ipl-doc — App Logic Verification

Pipeline Overview

Click any element for details

Database Schema

Tables, views, entity URIs

API & Data Access

REST API v1 endpoints, query patterns, citation contract

LLM Integration

Models, caching, rate limiting

CLI Commands

Operational commands

Pipeline Architecture

Click any element in the diagram to see its full description, parameters, and validation rules.

flowchart TB
    Start([PDF / URL Input]) --> P0[Phase 0: Source Registry]
    P0 --> DB0[(documents + sources/)]
    DB0 --> P1[Phase 1: PDF to Markdown]
    P1 --> V1{Coverage >= 0.5?}
    V1 -->|pass| DB1[(pages + md/)]
    V1 -->|low| Retry[Retry with stronger model]
    Retry --> DB1
    DB1 --> RT{Round-trip sample 0.10}
    RT -->|ok + drift| DB_RT[(page_round_trip_checks)]
    DB_RT --> P2[Phase 2: Sectioning]
    P2 --> V2{Overlap + hash?}
    V2 -->|fail| Manual2[Manual review]
    V2 -->|pass| DB2[(sections)]
    DB2 --> P3[Phase 3: Extraction]
    P3 --> V3{Citation + confidence?}
    V3 -->|high conf| DB3a[(extractions: verified)]
    V3 -->|low conf| P4[Phase 4: Escalation]
    P4 --> V4{Model consensus?}
    V4 -->|agree| DB3b[(verified by consensus)]
    V4 -->|disagree| Manual4[Manual review]
    DB3a --> P5[Phase 5: Canonicalize]
    DB3b --> P5
    Manual4 -->|approved| P5
    P5 --> Prod[(vykony + omezeni + kombinace)]
    Prod --> P6[Phase 6: LLM Surface]
    P6 --> Surf[(nl_description + FTS index)]
    Prod --> SQLApps[SQL Applications]
    Surf --> LLMApps[LLM Applications]
    LLMLog[(llm_calls log)]
    LLMCache[(llm_cache)]
    P1 -.-> LLMLog
    P2 -.-> LLMLog
    P3 -.-> LLMLog
    P4 -.-> LLMLog
    P6 -.-> LLMLog
    LLMLog -.-> LLMCache

Rectangles = processing phases · Diamonds = validation gates · Cylinders = storage · Dotted = LLM audit logging

Database Schema

PostgreSQL database ipl_doc with two layers: internal pipeline tables (processing audit trail) and stable API tables (consumer-facing contract).

erDiagram
    documents ||--o{ pages : "Phase 1 produces"
    pages ||--o| page_round_trip_checks : "Phase 1.4 verifies"
    documents ||--o{ sections : "Phase 2 segments"
    sections ||--o{ extractions : "Phase 3 extracts"
    extractions ||--o{ validations : "validated by"
    vykony ||--o{ vykony_omezeni : "has constraints"
    vykony ||--o{ vykony_kombinace : "has combination rules"
    documents {
        uuid id PK
        text title
        text source_url
        text pdf_path
        text pdf_hash "SHA-256 for dedup"
        int pdf_pages
        timestamp fetched_at
    }
    pages {
        uuid id PK
        uuid document_id FK
        int page_start
        int page_end
        text md_path "path to .md file"
        float coverage_ratio "LLM chars / PDF chars"
        text status "ok | low_coverage | empty_page"
    }
    page_round_trip_checks {
        uuid id PK
        uuid page_id FK
        float similarity "0.0 to 1.0"
        text status "ok | drift"
    }
    sections {
        uuid id PK
        uuid document_id FK
        text section_type
        text identifier
        text title
    }
    extractions {
        uuid id PK
        uuid section_id FK
        text schema_name
        jsonb json_data
        float overall_confidence
        bool citation_verified
    }
    validations {
        uuid id PK
        uuid extraction_id FK
        jsonb checks
        text final_status
    }
    vykony {
        text entity_uri PK "vykon:kod@date"
        text kod
        text nazev
        text nl_description
        tsvector searchable_text
        jsonb provenance
        tstzrange validity
    }
    vykony_omezeni {
        text entity_uri PK
        text vykon_uri FK
        text typ "freq | age | diag"
        jsonb details
    }
    vykony_kombinace {
        text entity_uri PK
        text vykon_a FK
        text vykon_b FK
        text pravidlo
    }
    llm_calls {
        uuid id PK
        text model
        text phase
        int tokens_input
        int tokens_output
        float cost_usd
        bool cache_hit
    }
    llm_cache {
        text cache_key PK "SHA-256 hash"
        jsonb response
        int hit_count
    }

Internal Tables (Pipeline Audit Trail)

These tables record every step of the pipeline. They are append-only audit logs. Downstream applications must not query these for answers — use the stable API tables instead.

Table	Phase	What It Stores
`documents`	0	Registered PDFs with source URL, hash, page count. One row per unique PDF.
`pages`	1	Extraction blocks (5-page chunks). Links to Markdown file on disk. Stores coverage ratio and status.
`page_round_trip_checks`	1.4	Round-trip verification results. Similarity score and drift/ok status for sampled blocks.
`sections`	2	Logical sections identified within the document (paragraphs, tables, appendices).
`extractions`	3	Raw structured data extracted by LLM. JSON matching Pydantic schemas, with confidence scores.
`validations`	3-4	Validation and escalation audit. Records which checks passed/failed and escalation outcomes.
`llm_calls`	all	Every LLM invocation: model, tokens, cost, duration, cache hit flag. Full audit trail.
`llm_cache`	all	Cached LLM responses keyed by SHA-256(model+prompt). Avoids re-paying for identical requests.

Stable API Tables (Consumer Contract)

These are the guaranteed interface for downstream applications. Breaking changes require an architecture review. Every row has an entity_uri for stable referencing and a provenance JSONB for full traceability.

Table/View	Type	What It Contains
`vykony`	table	Healthcare procedures. Each row is one procedure at one point in time (temporal versioning via `validity` range).
`vykony_omezeni`	table	Constraints on procedures: frequency limits, age restrictions, required diagnoses, specialty requirements.
`vykony_kombinace`	table	Rules governing which procedures can/cannot be billed together.
`vykon_full`	matview	Denormalized join of all three tables above. The primary query target for applications.

Entity URI Schema

Every entity has a stable, deterministic identifier: {entity_type}:{kod}@{valid_from_iso}

URI Example	What It Identifies
`vykon:09543@2025-01-01`	Procedure 09543, effective from 1 Jan 2025
`omezeni:freq:09543@2025-01-01`	Frequency constraint for procedure 09543
`kombinace:09543+09544@2025-01-01`	Combination rule between procedures 09543 and 09544

Properties: Idempotent (re-extraction produces same URI), human-readable (LLMs can cite them), deterministic (no UUIDs), temporally versioned (same code at different dates = different entities).

API & Data Access

How downstream applications query the extracted data. The citation contract is mandatory — every answer must trace back to a source PDF sentence.

flowchart LR
    subgraph Pipeline Output
        VF[(vykon_full matview)]
        V[(vykony)]
        VO[(vykony_omezeni)]
        VK[(vykony_kombinace)]
    end
    subgraph Query Methods
        FTS[Full-Text Search\nCzech tsvector]
        Lookup[Code Lookup\nkod + validity range]
        Fuzzy[Fuzzy Search\npg_trgm trigrams]
        Embed[Vector Search\npgvector - planned]
    end
    subgraph Applications
        SQL[SQL Apps\nDeterministic queries]
        LLM[LLM Apps\nFTS retrieval + citations]
    end
    VF --> FTS
    VF --> Lookup
    VF --> Fuzzy
    VF -.-> Embed
    FTS --> SQL
    FTS --> LLM
    Lookup --> SQL
    Lookup --> LLM
    Fuzzy --> LLM

Query Patterns

Pattern	When to Use	How It Works
Code Lookup	Know the exact procedure code	Query `vykon_full` by `kod` + `validity @> now()` for current version, or `validity @> $date` for a specific date
Full-Text Search	Search by description or name	Use `websearch_to_tsquery('czech', $query)` against `searchable_text`. Weighted: code (A) > name (B) > description (C)
Fuzzy Search	Handle typos and partial matches	Use `pg_trgm` `similarity()` function with `%` operator on `nazev`
Vector Search	Semantic similarity (future)	`embedding vector(1024)` column prepared but empty. Applications must fall back to FTS when NULL.

REST API v1

Read-only JSON API at /api/v1 serving pipeline data for ipl-app consumption. Built with FastAPI + asyncpg. CORS enabled for all origins. OpenAPI 3.1 spec auto-generated.

flowchart LR
    subgraph REST API
        direction TB
        EP1["/procedures"]
        EP2["/specializations"]
        EP3["/combinations"]
        EP4["/code-lists"]
        EP5["/nursing-days"]
        EP6["/rules"]
        EP7["/changes"]
        EP8["/day-care"]
    end
    subgraph Database
        V[(vykony)]
        VO[(vykony_omezeni)]
        VK[(vykony_kombinace)]
        D[(documents)]
        EX[(extractions)]
    end
    subgraph Clients
        App[ipl-app]
        Swagger[Swagger UI]
    end
    EP1 --> V
    EP1 --> VO
    EP1 --> VK
    EP2 --> V
    EP3 --> VK
    EP4 --> D
    EP5 --> EX
    EP6 --> EX
    EP7 --> EX
    EP8 --> EX
    App --> EP1
    App --> EP2
    Swagger --> EP1

Endpoints

Method	Path	Description	Query Parameters
GET	`/api/v1/health`	Liveness check + DB status	—
GET	`/api/v1/procedures`	List procedures	`q` (FTS), `specialization`, `limit`, `offset`
GET	`/api/v1/procedures/{code}`	Procedure detail with restrictions + combinations	—
GET	`/api/v1/specializations`	Distinct specializations with procedure counts	`limit`, `offset`
GET	`/api/v1/specializations/{code}`	Procedures for one specialization	—
GET	`/api/v1/combinations`	Procedure combination rules	`procedure_code`, `type`, `limit`, `offset`
GET	`/api/v1/code-lists`	Ingested documents as code lists	`limit`, `offset`
GET	`/api/v1/code-lists/{name}`	Code list detail with page extraction status	—
GET	`/api/v1/nursing-days`	Nursing day entries	`limit`, `offset`
GET	`/api/v1/rules`	Billing rules	`limit`, `offset`
GET	`/api/v1/changes`	Change log across legislation versions	`limit`, `offset`
GET	`/api/v1/day-care`	Day-care procedure entries	`limit`, `offset`

Pagination

All list endpoints accept limit (1–500, default 50) and offset (default 0). Response shape: {"items": [...], "total": N, "limit": 50, "offset": 0}

Provenance

Entity payloads include a provenance object: {"document_id", "section_id", "extraction_id", "source_url"} — linking each data point back to its source PDF.

Production

Environment	URL
Local	`http://localhost:8000/api/v1`
Production	`https://ipl-api.tipelt.cz/api/v1`

Citation Contract

Every answer produced by an LLM application consuming this data MUST cite entity_uri references. An answer without citations is treated as a hallucination.

flowchart LR
    URI[entity_uri] --> Prov[provenance JSONB] --> Ext[extraction_id] --> Sec[section_id] --> Doc[document_id] --> PDF[Source PDF sentence]

This chain allows any data point to be traced back to the exact sentence in the original PDF where it was stated. CLI verification: pipeline trace <table>.<column> --entity-uri=<uri>

LLM Integration

All LLM interactions go through the centralized LLMClient wrapper. No pipeline module calls the Anthropic SDK directly.

Supported Models

claude-haiku-4-5-20251001
claude-opus-4-6
claude-sonnet-4-6

Call Flow

flowchart LR
    Pipeline[Pipeline Phase] --> Client[LLMClient]
    Client --> CacheCheck{Cache hit?}
    CacheCheck -->|yes| Cached[Return cached response]
    CacheCheck -->|no| RateLimit[Rate Limiter]
    RateLimit --> SDK[Anthropic SDK / Claude CLI]
    SDK --> Log[Log to llm_calls table]
    SDK --> CacheStore[Store in llm_cache]
    Log --> Response[Return response]
    CacheStore --> Response

Caching

Text-based calls are cached by SHA-256 hash of (model + system_prompt + user_input + optional_schema). Cache is stored in the llm_cache PostgreSQL table with hit counting. PDF-based calls are not cached (binary content too large for key hashing).

Rate Limiting

Per-model semaphore (concurrency = 1 per model by default)
Global output token bucket: 20K tokens/minute cap
On 429 errors: exponential backoff from 2s to 600s max
Proactive pacing: waits before sending if token bucket near capacity

Audit Logging

Every LLM call is logged to the llm_calls table with: model, phase, prompt version, input/output token counts, cost in USD, duration in ms, cache hit flag, and the full response. This provides a complete audit trail and cost tracking for all AI operations.

Document Discovery

Automated crawler for Czech healthcare legislation sources. Finds new PDFs, validates they are final (not draft) documents from official sources, and ingests confirmed candidates.

Configured Sources

VZP (Všeobecná zdravotní pojišťovna)
MZ ČR (Ministerstvo zdravotnictví)

4-Layer Validation

flowchart TB
    PDF[Downloaded PDF] --> L1{{Layer 1: Source Authority}}
    L1 -->|official domain| L2{{Layer 2: Draft Detection}}
    L1 -->|unofficial| REJECT[REJECT]
    L2 -->|FINAL| L3{{Layer 3: Cross-Reference}}
    L2 -->|DRAFT| REJECT
    L3 --> L4{{Layer 4: Metadata Consistency}}
    L3 -->|INCONCLUSIVE| HOLD[HOLD for review]
    L4 -->|gazette + date ok| INGEST[INGEST]
    L4 -->|issues| HOLD

Layer	What It Checks	Method
1. Source Authority	Is the domain on the official whitelist? (vzp.cz, mzd.gov.cz, sukl.cz, zakonyprolidi.cz, etc.)	Domain matching against hardcoded list
2. Draft Detection	Is this a final published document, not a draft?	LLM reads first 3 pages, checks for draft markers ("navrh", "pracovni verze", watermarks)
3. Cross-Reference	Can the document be confirmed in the Law Gazette (Sbirka zakonu)?	Web search for title + gazette number, needs 2+ confirmation keywords
4. Metadata	Does it have a concrete effective date and gazette number?	Regex extraction of dates and gazette references from PDF text

Decision matrix: Layer 2 FAIL = reject (no drafts). Layer 1 FAIL = reject (unofficial). Layer 4 FAIL = hold. All 4 PASS = ingest. Any INCONCLUSIVE = hold for human review.

Prompt Templates

All LLM prompts are stored as versioned text files. The loader automatically picks the highest version. Prompts are never inlined in code.

Name	Version	File	Size	Preview
`draft_check`	v1	`draft_check_v1.txt`	834 chars	You are reviewing a Czech legislative or regulatory document to determine whether it is a finalized (platné/účinné) version or a draft (návrh/pracovní verze). Examine the document text below. Look fo...
`nl_description_vykon`	v1	`nl_description_vykon_v1.txt`	379 chars	You are a Czech healthcare domain expert. Given structured data about a medical procedure (výkon), write a concise natural-language description in Czech. The description should: - Be 1-3 sentences lo...
`pdf_to_md`	v1	`pdf_to_md_v1.txt`	435 chars	You are a document conversion specialist. Convert the following Czech healthcare legislation PDF page to clean Markdown. Rules: - Preserve all section numbers, paragraph numbers, and legal references...
`sectioning`	v1	`sectioning_v1.txt`	575 chars	You are a Czech healthcare legislation analyst. Given a Markdown document converted from a PDF, identify and output the logical sections. For each section output a JSON object with: - "section_id": t...

Database Migrations

Schema evolution managed by Alembic. Each migration is idempotent and versioned.

File	Description	Tables Created
`001_initial_schema.py`	Initial schema: extensions, all pipeline and production tables.	alter only
`002_round_trip_checks.py`	Round-trip sample check audit table for M1.4.	alter only
`003_document_status.py`	Drop document_status column — drafts detected by LLM instead.	alter only

CLI Commands

All pipeline operations are accessible via the pipeline CLI. Commands require DATABASE_URL; extraction phases also need ANTHROPIC_API_KEY.

Command	Description
`pipeline ingest`	Ingest a PDF from a URL or local file path.
`pipeline run`	Run the extraction pipeline for a document.
`pipeline status`	Show pipeline status for documents.
`pipeline trace`	Trace data lineage through the pipeline.
`pipeline review`	Review extraction results — opens quality viewer in browser.
`pipeline discover`	Discover documents from Czech healthcare legislation sources.
`pipeline eval`	Evaluate pipeline accuracy against golden datasets.

Storage Layout

All file artifacts are managed by storage/paths.py under a configurable root directory. Markdown is the single source of truth (SSOT) — stored on disk, referenced by path in the database.

flowchart TB
    Root["STORAGE_ROOT (default: /storage)"] --> Sources["sources/{document_id}/"]
    Root --> MD["md/{document_id}/"]
    Root --> Logs["logs/{document_id}/"]
    Sources --> PDF["original.pdf"]
    MD --> Block1["pages_0001-0005.md"]
    MD --> Block2["pages_0006-0010.md"]
    MD --> BlockN["..."]

sources/ — Original downloaded PDFs, never modified after ingest
md/ — Extracted Markdown files, one per 5-page block. This is the cached extraction output; if these files exist, re-runs skip the LLM call.
logs/ — Pipeline execution logs per document

Path validation: Document IDs with /, \, or .. are rejected to prevent path traversal.

Auto-generated from codebase by scripts/generate_app_logic.py. Regenerate after code changes: python scripts/generate_app_logic.py