Technical Challenge for Pandektes

Find a file

GeorgeWebberley b73e4c7d51 Small text improvement		2026-03-01 15:02:58 +01:00
assets	added video	2026-03-01 14:25:58 +01:00
prisma	Finalised queue with sandboxed mode for child process	2026-03-01 12:47:00 +01:00
public	Updated html functionality and styling	2026-03-01 13:05:49 +01:00
src	Removed unused imports and packages	2026-03-01 14:12:16 +01:00
.env.example	Finalised queue with sandboxed mode for child process	2026-03-01 12:47:00 +01:00
.gitignore	Removed unused imports and packages	2026-03-01 14:12:16 +01:00
.prettierrc	Initial commit	2026-03-01 10:31:31 +01:00
docker-compose.yaml	Removed unused imports and packages	2026-03-01 14:12:16 +01:00
eslint.config.mjs	Initial commit	2026-03-01 10:31:31 +01:00
nest-cli.json	Cleaned up import paths	2026-03-01 14:01:15 +01:00
package-lock.json	Cleaned up import paths	2026-03-01 14:01:15 +01:00
package.json	Fixed env in README and re-added test script	2026-03-01 14:33:34 +01:00
prisma.config.ts	Cleaned up unneeded boilerplate files, initialised prisma config	2026-03-01 11:11:48 +01:00
README.md	Small text improvement	2026-03-01 15:02:58 +01:00
tsconfig.build.json	Initial commit	2026-03-01 10:31:31 +01:00
tsconfig.json	Cleaned up import paths	2026-03-01 14:01:15 +01:00

README.md

Pandektes Case Law Challenge

This is a NestJS-based legal document parsing application built for the Pandektes technical challenge. It extracts case law metadata from PDF and HTML documents using Gemini AI and stores it in a PostgreSQL database.

▶️ Watch the demo video

Getting Started

Prerequisites

Docker & Docker Compose
Node.js (v20+)
Gemini API Key (Get one at Google AI Studio)

IMPORTANT: I attached billing to my Google account to prevent hitting the free tier limits.

Installation

Clone the Repo:

git clone https://git.georgew.dev/georgew/pandektes-challenge.git
cd pandektes-challenge

Environment Setup: Create a .env file in the root:

# App config
PORT=3000
NODE_ENV=development

# AI Config
GOOGLE_API_KEY=your_gemini_api_key_here

# Database config
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=pandektes
POSTGRES_PORT=5432
DATABASE_URL="postgresql://postgres:postgres@localhost:5432/pandektes?schema=public"

# Redis config
REDIS_HOST=localhost
REDIS_PORT=6379

# Storage (Local Minio)
STORAGE_ENDPOINT=http://localhost:9000
STORAGE_BUCKET=cases
STORAGE_REGION=us-east-1
STORAGE_ACCESS_KEY=minioadmin
STORAGE_SECRET_KEY=minioadmin
STORAGE_FORCE_PATH_STYLE=true

Or simply copy the example: cp .env.example .env and add your GOOGLE_API_KEY.

Start Infrastructure:
```
docker-compose up -d
```

Install & Build:

npm install
npx prisma migrate dev
npm run start:dev

Usage

Once running, you can interact with the app in a few ways:

Web UI — I created a basic interface so the app can be easily tested without additional setup. Visit http://localhost:3000 to upload files and search for cases.
GraphQL Playground — Available at http://localhost:3000/graphql for direct query/mutation testing.
Prisma Studio — Run npx prisma studio to open a visual database browser and inspect the extracted case law entries directly.

Architectural Decisions

1. Why a Background Queue (BullMQ)?

Large document processing is likely to be "spiky" and slow, in particular with added LLM calls. If we did this directly in the HTTP request, the user's connection would likely time out, as well as potentially blocking the event loop on the main thread. I used sandboxed workers (running in separate processes) to circumvent that. This also ensures that if a particularly heavy PDF causes a memory leak or CPU spike, it doesn't crash the main API that serves other users.

2. S3-Compatible Storage (Minio)

Instead of saving files to the local disk, I used an S3-compatible service. Storing files on a local disk makes the app hard to scale effectiely. By using S3 patterns, the app is "cloud-ready" and I can just change the ENV variables to point to AWS S3.

3. Full-Document Parsing

I chose to send the full extracted text to Gemini rather than truncating it. Gemini Flash has a 1M token context window, so even a 50-page legal document barely scratches the surface. I did consider truncation, or limiting to just the start and end of the document (which likekly contains the most important information) but I don't have the domain knowledge of case laws to make that call. I instead set a generous character cap (500k) to act as a safety net against abuse.

4. Language Handling

It wasn't clear from the requirements whether the AI should extract metadata in the document's original language or normalise everything to English. Since the provided documents include both Danish and English, I haven't enforced any language rules on the AI and left it open for now. This would be trivial to change by adding a language instruction to the prompt.

Production Readiness (Next Steps)

If I were taking this to production, here's what I'd focus on:

De-duplication

Currently, a user can re-upload the same document multiple times, creating duplicates in the database. This could be mitigated by:

File hashing: Calculate a hash of the uploaded file before processing. This is quick and prevents the exact same file from being processed twice.
Post-AI check: Compare the extracted case number against existing records. Slower, but more logically robust since two different files could describe the same case.

File Upload Scaling

If files get large, buffering them through the NestJS server becomes a bottleneck. I'd look at removing the upload from NestJS entirely:

The API generates a presigned upload URL (direct to an S3 bucket) and returns it to the frontend.
The frontend uploads the file directly to storage — the NestJS server never touches the binary data.
This makes the backend infinitely more scalable and cheaper to run, while cloud storage handles the heavy lifting.

Input Validation

Currently, identifier formatting (UUID vs. Case Number) is handled via a helper function in the service layer. For production, I'd create a custom class validator so it fails at the entry point instead (i.e. "Fail Fast" principle) and apply it to the DTO.

Worker Isolation

My queue implementation is a good first step (passing heavy work to a child process instead of blocking the main thread), but in production I'd look at completely isolating the workers — perhaps into their own container. This keeps NestJS as a lightweight entry point, while being able to spin up many separate workers for processing multiple PDFs simultaneously. Other improvements:

Exponential Backoff: If the Gemini API is down for a few minutes, workers will fail immediately. I'd configure the queue with exponential backoff (e.g., retry in 5s, then 20s, then 1min).
Dead Letter Queues (DLQ): If a file is so corrupted it fails after multiple retries, BullMQ should move it to a "failed" queue for manual human review rather than retrying forever.
Worker Timeout: A particularly large PDF could "hang" the worker process. I'd set an explicit lockDuration or timeout on jobs so they don't block the queue indefinitely.

Logging & Observability

Audit logging: Track who is accessing what and when.
Crash reporting: Integrate a service like Sentry for real-time error alerting.
Health check pings: For container orchestration and uptime monitoring.

Security

Authentication & Authorisation: The /graphql endpoint is currently open. I'd implement Auth Guards (using Passport/JWT). Even if all users can upload, you might need to track who uploaded what for audit purposes.
CORS: Currently defaults to open. I'd restrict CORS in main.ts to only allow trusted frontend domains.
CSRF Protection: GraphQL is prone to CSRF when allowing standard multipart/form-data. I'd enable csrfPrevention: true in Apollo and require a custom header (like x-apollo-operation-name) on all requests.
Rate Limiting: A malicious script could flood the queue with blank PDFs, costing money in AI tokens. I'd use @nestjs/throttler to limit uploads per IP per hour.
File Scanning: Add an anti-virus layer (like ClamAV) before saving uploads to S3.

Testing

Run the suite with:

npm run test

I've focused the tests on the core parsing logic (ParserService) and utility functions, as these are the areas most likely to regress. In a production context, I'd expand coverage to include service-layer tests (e.g. verifying the queue receives the correct payload) and an E2E test for the full upload → process → query flow. I drew the line here to keep the scope reasonable for a challenge.

Author: George W. Challenge: Pandektes Legal Tech Challenge