Technical Challenge for Pandektes
Find a file
2026-03-01 15:02:58 +01:00
assets added video 2026-03-01 14:25:58 +01:00
prisma Finalised queue with sandboxed mode for child process 2026-03-01 12:47:00 +01:00
public Updated html functionality and styling 2026-03-01 13:05:49 +01:00
src Removed unused imports and packages 2026-03-01 14:12:16 +01:00
.env.example Finalised queue with sandboxed mode for child process 2026-03-01 12:47:00 +01:00
.gitignore Removed unused imports and packages 2026-03-01 14:12:16 +01:00
.prettierrc Initial commit 2026-03-01 10:31:31 +01:00
docker-compose.yaml Removed unused imports and packages 2026-03-01 14:12:16 +01:00
eslint.config.mjs Initial commit 2026-03-01 10:31:31 +01:00
nest-cli.json Cleaned up import paths 2026-03-01 14:01:15 +01:00
package-lock.json Cleaned up import paths 2026-03-01 14:01:15 +01:00
package.json Fixed env in README and re-added test script 2026-03-01 14:33:34 +01:00
prisma.config.ts Cleaned up unneeded boilerplate files, initialised prisma config 2026-03-01 11:11:48 +01:00
README.md Small text improvement 2026-03-01 15:02:58 +01:00
tsconfig.build.json Initial commit 2026-03-01 10:31:31 +01:00
tsconfig.json Cleaned up import paths 2026-03-01 14:01:15 +01:00

Pandektes Case Law Challenge

This is a NestJS-based legal document parsing application built for the Pandektes technical challenge. It extracts case law metadata from PDF and HTML documents using Gemini AI and stores it in a PostgreSQL database.

▶️ Watch the demo video

Getting Started

Prerequisites

  • Docker & Docker Compose
  • Node.js (v20+)
  • Gemini API Key (Get one at Google AI Studio)

IMPORTANT: I attached billing to my Google account to prevent hitting the free tier limits.

Installation

  1. Clone the Repo:
    git clone https://git.georgew.dev/georgew/pandektes-challenge.git
    cd pandektes-challenge
    
  2. Environment Setup: Create a .env file in the root:
    # App config
    PORT=3000
    NODE_ENV=development
    
    # AI Config
    GOOGLE_API_KEY=your_gemini_api_key_here
    
    # Database config
    POSTGRES_USER=postgres
    POSTGRES_PASSWORD=postgres
    POSTGRES_DB=pandektes
    POSTGRES_PORT=5432
    DATABASE_URL="postgresql://postgres:postgres@localhost:5432/pandektes?schema=public"
    
    # Redis config
    REDIS_HOST=localhost
    REDIS_PORT=6379
    
    # Storage (Local Minio)
    STORAGE_ENDPOINT=http://localhost:9000
    STORAGE_BUCKET=cases
    STORAGE_REGION=us-east-1
    STORAGE_ACCESS_KEY=minioadmin
    STORAGE_SECRET_KEY=minioadmin
    STORAGE_FORCE_PATH_STYLE=true
    
    Or simply copy the example: cp .env.example .env and add your GOOGLE_API_KEY.
  3. Start Infrastructure:
    docker-compose up -d
    
  4. Install & Build:
    npm install
    npx prisma migrate dev
    npm run start:dev
    

Usage

Once running, you can interact with the app in a few ways:

  • Web UI — I created a basic interface so the app can be easily tested without additional setup. Visit http://localhost:3000 to upload files and search for cases.
  • GraphQL Playground — Available at http://localhost:3000/graphql for direct query/mutation testing.
  • Prisma Studio — Run npx prisma studio to open a visual database browser and inspect the extracted case law entries directly.

Architectural Decisions

1. Why a Background Queue (BullMQ)?

Large document processing is likely to be "spiky" and slow, in particular with added LLM calls. If we did this directly in the HTTP request, the user's connection would likely time out, as well as potentially blocking the event loop on the main thread. I used sandboxed workers (running in separate processes) to circumvent that. This also ensures that if a particularly heavy PDF causes a memory leak or CPU spike, it doesn't crash the main API that serves other users.

2. S3-Compatible Storage (Minio)

Instead of saving files to the local disk, I used an S3-compatible service. Storing files on a local disk makes the app hard to scale effectiely. By using S3 patterns, the app is "cloud-ready" and I can just change the ENV variables to point to AWS S3.

3. Full-Document Parsing

I chose to send the full extracted text to Gemini rather than truncating it. Gemini Flash has a 1M token context window, so even a 50-page legal document barely scratches the surface. I did consider truncation, or limiting to just the start and end of the document (which likekly contains the most important information) but I don't have the domain knowledge of case laws to make that call. I instead set a generous character cap (500k) to act as a safety net against abuse.

4. Language Handling

It wasn't clear from the requirements whether the AI should extract metadata in the document's original language or normalise everything to English. Since the provided documents include both Danish and English, I haven't enforced any language rules on the AI and left it open for now. This would be trivial to change by adding a language instruction to the prompt.


Production Readiness (Next Steps)

If I were taking this to production, here's what I'd focus on:

De-duplication

Currently, a user can re-upload the same document multiple times, creating duplicates in the database. This could be mitigated by:

  • File hashing: Calculate a hash of the uploaded file before processing. This is quick and prevents the exact same file from being processed twice.
  • Post-AI check: Compare the extracted case number against existing records. Slower, but more logically robust since two different files could describe the same case.

File Upload Scaling

If files get large, buffering them through the NestJS server becomes a bottleneck. I'd look at removing the upload from NestJS entirely:

  • The API generates a presigned upload URL (direct to an S3 bucket) and returns it to the frontend.
  • The frontend uploads the file directly to storage — the NestJS server never touches the binary data.
  • This makes the backend infinitely more scalable and cheaper to run, while cloud storage handles the heavy lifting.

Input Validation

Currently, identifier formatting (UUID vs. Case Number) is handled via a helper function in the service layer. For production, I'd create a custom class validator so it fails at the entry point instead (i.e. "Fail Fast" principle) and apply it to the DTO.

Worker Isolation

My queue implementation is a good first step (passing heavy work to a child process instead of blocking the main thread), but in production I'd look at completely isolating the workers — perhaps into their own container. This keeps NestJS as a lightweight entry point, while being able to spin up many separate workers for processing multiple PDFs simultaneously. Other improvements:

  • Exponential Backoff: If the Gemini API is down for a few minutes, workers will fail immediately. I'd configure the queue with exponential backoff (e.g., retry in 5s, then 20s, then 1min).
  • Dead Letter Queues (DLQ): If a file is so corrupted it fails after multiple retries, BullMQ should move it to a "failed" queue for manual human review rather than retrying forever.
  • Worker Timeout: A particularly large PDF could "hang" the worker process. I'd set an explicit lockDuration or timeout on jobs so they don't block the queue indefinitely.

Logging & Observability

  • Audit logging: Track who is accessing what and when.
  • Crash reporting: Integrate a service like Sentry for real-time error alerting.
  • Health check pings: For container orchestration and uptime monitoring.

Security

  • Authentication & Authorisation: The /graphql endpoint is currently open. I'd implement Auth Guards (using Passport/JWT). Even if all users can upload, you might need to track who uploaded what for audit purposes.
  • CORS: Currently defaults to open. I'd restrict CORS in main.ts to only allow trusted frontend domains.
  • CSRF Protection: GraphQL is prone to CSRF when allowing standard multipart/form-data. I'd enable csrfPrevention: true in Apollo and require a custom header (like x-apollo-operation-name) on all requests.
  • Rate Limiting: A malicious script could flood the queue with blank PDFs, costing money in AI tokens. I'd use @nestjs/throttler to limit uploads per IP per hour.
  • File Scanning: Add an anti-virus layer (like ClamAV) before saving uploads to S3.

Testing

Run the suite with:

npm run test

I've focused the tests on the core parsing logic (ParserService) and utility functions, as these are the areas most likely to regress. In a production context, I'd expand coverage to include service-layer tests (e.g. verifying the queue receives the correct payload) and an E2E test for the full upload → process → query flow. I drew the line here to keep the scope reasonable for a challenge.


Author: George W. Challenge: Pandektes Legal Tech Challenge