pandektes-challenge/README.md

# Pandektes Case Law Challenge 🏛️

This is a NestJS-based legal document parsing application built for the Pandektes technical challenge. It extracts case law metadata from PDF and HTML documents using Gemini AI and stores it in a PostgreSQL database.

## 🚀 Getting Started

### Prerequisites
- **Docker & Docker Compose**
- **Node.js (v20+)**
- **Gemini API Key** (Get one at [Google AI Studio](https://aistudio.google.com/))
NOTE: I attached billing to my google account to prevent hitting the free tier limits.

### Installation
1.  **Clone the Repo**:
    ```bash
    git clone [your-repo-url]
    cd pandektes-challenge
    ```
2.  **Environment Setup**:
    Create a `.env` file in the root:
    ```env
    # AI Config
    GOOGLE_API_KEY=your_gemini_api_key_here

    # Database (Standard Docker defaults)
    DATABASE_URL="postgresql://postgres:postgres@localhost:5432/pandektes?schema=public"
    REDIS_HOST="localhost"
    REDIS_PORT=6379

    # Storage (Local Minio)
    STORAGE_ENDPOINT="http://localhost:9000"
    STORAGE_BUCKET="cases"
    STORAGE_REGION="us-east-1"
    STORAGE_ACCESS_KEY="minioadmin"
    STORAGE_SECRET_KEY="minioadmin"
    STORAGE_FORCE_PATH_STYLE="true"
    ```
3.  **Start Infrastructure**:
    ```bash
    docker-compose up -d
    ```
4.  **Install & Build**:
    ```bash
    npm install
    npx prisma migrate dev
    npm run start:dev
    ```

### Usage
Once running, you can interact with the app in a few ways:
- **Web UI** — I created a basic interface so the app can be easily tested without additional setup. Visit [http://localhost:3000](http://localhost:3000) to upload files and search for cases.
- **GraphQL Playground** — Available at [http://localhost:3000/graphql](http://localhost:3000/graphql) for direct query/mutation testing.
- **Prisma Studio** — Run `npx prisma studio` to open a visual database browser and inspect the extracted case law entries directly.

---

## 🏗️ Architectural Decisions

### 1. Why a Background Queue (BullMQ)?
Large document processing is likely to be "spiky" and slow, in particular with added LLM calls. If we did this directly in the HTTP request, the user's connection would likely time out, as well as potentially blocking the event loop on the main thread.
I used sandboxed workers (running in separate processes) to circumvent that. This also ensures that if a particularly heavy PDF causes a memory leak or CPU spike, it doesn't crash the main API that serves other users.

### 2. S3-Compatible Storage (Minio)
Instead of saving files to the local disk, I used an S3-compatible service. Storing files on a local disk makes the app hard to scale effectiely. By using S3 patterns, the app is "cloud-ready" and I can just change the ENV variables to point to AWS S3.

### 3. Full-Document Parsing
I chose to send the full extracted text to Gemini rather than truncating it. Gemini Flash has a 1M token context window, so even a 50-page legal document barely scratches the surface. I did consider truncation, or limiting to just the start and end of the document (which likekly contains the most important information) but I don't have the domain knowledge of case laws to make that call.
I instead set a generous character cap (500k) to act as a safety net against abuse.

### 4. Language Handling
It wasn't clear from the requirements whether the AI should extract metadata in the document's original language or normalise everything to English. Since the provided documents include both Danish and English, I haven't enforced any language rules on the AI and left it open for now. This would be trivial to change by adding a language instruction to the prompt.

---

## 🛠️ Production Readiness (Next Steps)

If I were taking this to production, here's what I'd focus on:

### De-duplication
Currently, a user can re-upload the same document multiple times, creating duplicates in the database. This could be mitigated by:
- File hashing: Calculate a hash of the uploaded file before processing. This is quick and prevents the exact same file from being processed twice.
- Post-AI check: Compare the extracted case number against existing records. Slower, but more logically robust since two different files could describe the same case.

### File Upload Scaling
If files get large, buffering them through the NestJS server becomes a bottleneck. I'd look at removing the upload from NestJS entirely:
- The API generates a presigned upload URL (direct to an S3 bucket) and returns it to the frontend.
- The frontend uploads the file directly to storage — the NestJS server never touches the binary data.
- This makes the backend infinitely more scalable and cheaper to run, while cloud storage handles the heavy lifting.

### Input Validation
Currently, identifier formatting (UUID vs. Case Number) is handled via a helper function in the service layer. For production, I'd create a custom class validator so it fails at the entry point instead (i.e. "Fail Fast" principle).

### Worker Isolation
My queue implementation is a good first step (passing heavy work to a child process instead of blocking the main thread), but in production I'd look at completely isolating the workers — perhaps into their own container. This keeps NestJS as a lightweight entry point, while being able to spin up many separate workers for processing multiple PDFs simultaneously. Other improvements:
- Exponential Backoff: If the Gemini API is down for a few minutes, workers will fail immediately. I'd configure the queue with exponential backoff (e.g., retry in 5s, then 20s, then 1min).
- Dead Letter Queues (DLQ): If a file is so corrupted it fails after multiple retries, BullMQ should move it to a "failed" queue for manual human review rather than retrying forever.
- Worker Timeout: A particularly large PDF could "hang" the worker process. I'd set an explicit `lockDuration` or timeout on jobs so they don't block the queue indefinitely.

### Logging & Observability
- Audit logging: Track who is accessing what and when.
- Crash reporting: Integrate a service like Sentry for real-time error alerting.
- Health check pings: For container orchestration and uptime monitoring.

### Security
- Authentication & Authorisation: The `/graphql` endpoint is currently open. I'd implement Auth Guards (using Passport/JWT). Even if all users can upload, you might need to track who uploaded what for audit purposes.
- CORS: Currently defaults to open. I'd restrict CORS in `main.ts` to only allow trusted frontend domains.
- CSRF Protection: GraphQL is prone to CSRF when allowing standard `multipart/form-data`. I'd enable `csrfPrevention: true` in Apollo and require a custom header (like `x-apollo-operation-name`) on all requests.
- Rate Limiting: A malicious script could flood the queue with blank PDFs, costing money in AI tokens. I'd use `@nestjs/throttler` to limit uploads per IP per hour.
- File Scanning: Add an anti-virus layer (like ClamAV) before saving uploads to S3.

## 🧪 Testing
Run the suite with:
```bash
npm run test
```
I've focused the tests on the core parsing logic (`ParserService`) and utility functions, as these are the areas most likely to regress. In a production context, I'd expand coverage to include service-layer tests (e.g. verifying the queue receives the correct payload) and an E2E test for the full upload → process → query flow. I drew the line here to keep the scope reasonable for a challenge.

---
**Author**: George W.
**Challenge**: Pandektes Legal Tech Challenge