ipl-doc — App Logic Verification

Pipeline Architecture

Click any element in the diagram to see its full description, parameters, and validation rules.

flowchart TB
    Start([PDF / URL Input]) --> P0[Phase 0: Source Registry]
    P0 --> DB0[(documents + sources/)]
    DB0 --> P1[Phase 1: PDF to Markdown]
    P1 --> V1{Coverage >= 0.5?}
    V1 -->|pass| DB1[(pages + md/)]
    V1 -->|low| Retry[Retry with stronger model]
    Retry --> DB1
    DB1 --> RT{Round-trip sample 0.10}
    RT -->|ok + drift| DB_RT[(page_round_trip_checks)]
    DB_RT --> P2[Phase 2: Sectioning]
    P2 --> V2{Overlap + hash?}
    V2 -->|fail| Manual2[Manual review]
    V2 -->|pass| DB2[(sections)]
    DB2 --> P3[Phase 3: Extraction]
    P3 --> V3{Citation + confidence?}
    V3 -->|high conf| DB3a[(extractions: verified)]
    V3 -->|low conf| P4[Phase 4: Escalation]
    P4 --> V4{Model consensus?}
    V4 -->|agree| DB3b[(verified by consensus)]
    V4 -->|disagree| Manual4[Manual review]
    DB3a --> P5[Phase 5: Canonicalize]
    DB3b --> P5
    Manual4 -->|approved| P5
    P5 --> Prod[(vykony + omezeni + kombinace)]
    Prod --> P6[Phase 6: LLM Surface]
    P6 --> Surf[(nl_description + FTS index)]
    Prod --> SQLApps[SQL Applications]
    Surf --> LLMApps[LLM Applications]
    LLMLog[(llm_calls log)]
    LLMCache[(llm_cache)]
    P1 -.-> LLMLog
    P2 -.-> LLMLog
    P3 -.-> LLMLog
    P4 -.-> LLMLog
    P6 -.-> LLMLog
    LLMLog -.-> LLMCache
  

Rectangles = processing phases · Diamonds = validation gates · Cylinders = storage · Dotted = LLM audit logging

Database Schema

PostgreSQL database ipl_doc with two layers: internal pipeline tables (processing audit trail) and stable API tables (consumer-facing contract).

erDiagram
    documents ||--o{ pages : "Phase 1 produces"
    pages ||--o| page_round_trip_checks : "Phase 1.4 verifies"
    documents ||--o{ sections : "Phase 2 segments"
    sections ||--o{ extractions : "Phase 3 extracts"
    extractions ||--o{ validations : "validated by"
    vykony ||--o{ vykony_omezeni : "has constraints"
    vykony ||--o{ vykony_kombinace : "has combination rules"
    documents {
        uuid id PK
        text title
        text source_url
        text pdf_path
        text pdf_hash "SHA-256 for dedup"
        int pdf_pages
        timestamp fetched_at
    }
    pages {
        uuid id PK
        uuid document_id FK
        int page_start
        int page_end
        text md_path "path to .md file"
        float coverage_ratio "LLM chars / PDF chars"
        text status "ok | low_coverage | empty_page"
    }
    page_round_trip_checks {
        uuid id PK
        uuid page_id FK
        float similarity "0.0 to 1.0"
        text status "ok | drift"
    }
    sections {
        uuid id PK
        uuid document_id FK
        text section_type
        text identifier
        text title
    }
    extractions {
        uuid id PK
        uuid section_id FK
        text schema_name
        jsonb json_data
        float overall_confidence
        bool citation_verified
    }
    validations {
        uuid id PK
        uuid extraction_id FK
        jsonb checks
        text final_status
    }
    vykony {
        text entity_uri PK "vykon:kod@date"
        text kod
        text nazev
        text nl_description
        tsvector searchable_text
        jsonb provenance
        tstzrange validity
    }
    vykony_omezeni {
        text entity_uri PK
        text vykon_uri FK
        text typ "freq | age | diag"
        jsonb details
    }
    vykony_kombinace {
        text entity_uri PK
        text vykon_a FK
        text vykon_b FK
        text pravidlo
    }
    llm_calls {
        uuid id PK
        text model
        text phase
        int tokens_input
        int tokens_output
        float cost_usd
        bool cache_hit
    }
    llm_cache {
        text cache_key PK "SHA-256 hash"
        jsonb response
        int hit_count
    }
  

Internal Tables (Pipeline Audit Trail)

These tables record every step of the pipeline. They are append-only audit logs. Downstream applications must not query these for answers — use the stable API tables instead.

TablePhaseWhat It Stores
documents0Registered PDFs with source URL, hash, page count. One row per unique PDF.
pages1Extraction blocks (5-page chunks). Links to Markdown file on disk. Stores coverage ratio and status.
page_round_trip_checks1.4Round-trip verification results. Similarity score and drift/ok status for sampled blocks.
sections2Logical sections identified within the document (paragraphs, tables, appendices).
extractions3Raw structured data extracted by LLM. JSON matching Pydantic schemas, with confidence scores.
validations3-4Validation and escalation audit. Records which checks passed/failed and escalation outcomes.
llm_callsallEvery LLM invocation: model, tokens, cost, duration, cache hit flag. Full audit trail.
llm_cacheallCached LLM responses keyed by SHA-256(model+prompt). Avoids re-paying for identical requests.

Stable API Tables (Consumer Contract)

These are the guaranteed interface for downstream applications. Breaking changes require an architecture review. Every row has an entity_uri for stable referencing and a provenance JSONB for full traceability.

Table/ViewTypeWhat It Contains
vykonytableHealthcare procedures. Each row is one procedure at one point in time (temporal versioning via validity range).
vykony_omezenitableConstraints on procedures: frequency limits, age restrictions, required diagnoses, specialty requirements.
vykony_kombinacetableRules governing which procedures can/cannot be billed together.
vykon_fullmatviewDenormalized join of all three tables above. The primary query target for applications.

Entity URI Schema

Every entity has a stable, deterministic identifier: {entity_type}:{kod}@{valid_from_iso}

URI ExampleWhat It Identifies
vykon:09543@2025-01-01Procedure 09543, effective from 1 Jan 2025
omezeni:freq:09543@2025-01-01Frequency constraint for procedure 09543
kombinace:09543+09544@2025-01-01Combination rule between procedures 09543 and 09544

Properties: Idempotent (re-extraction produces same URI), human-readable (LLMs can cite them), deterministic (no UUIDs), temporally versioned (same code at different dates = different entities).

API & Data Access

How downstream applications query the extracted data. The citation contract is mandatory — every answer must trace back to a source PDF sentence.

flowchart LR
    subgraph Pipeline Output
        VF[(vykon_full matview)]
        V[(vykony)]
        VO[(vykony_omezeni)]
        VK[(vykony_kombinace)]
    end
    subgraph Query Methods
        FTS[Full-Text Search\nCzech tsvector]
        Lookup[Code Lookup\nkod + validity range]
        Fuzzy[Fuzzy Search\npg_trgm trigrams]
        Embed[Vector Search\npgvector - planned]
    end
    subgraph Applications
        SQL[SQL Apps\nDeterministic queries]
        LLM[LLM Apps\nFTS retrieval + citations]
    end
    VF --> FTS
    VF --> Lookup
    VF --> Fuzzy
    VF -.-> Embed
    FTS --> SQL
    FTS --> LLM
    Lookup --> SQL
    Lookup --> LLM
    Fuzzy --> LLM
  

Query Patterns

PatternWhen to UseHow It Works
Code Lookup Know the exact procedure code Query vykon_full by kod + validity @> now() for current version, or validity @> $date for a specific date
Full-Text Search Search by description or name Use websearch_to_tsquery('czech', $query) against searchable_text. Weighted: code (A) > name (B) > description (C)
Fuzzy Search Handle typos and partial matches Use pg_trgm similarity() function with % operator on nazev
Vector Search Semantic similarity (future) embedding vector(1024) column prepared but empty. Applications must fall back to FTS when NULL.

REST API v1

Read-only JSON API at /api/v1 serving pipeline data for ipl-app consumption. Built with FastAPI + asyncpg. CORS enabled for all origins. OpenAPI 3.1 spec auto-generated.

flowchart LR
    subgraph REST API
        direction TB
        EP1["/procedures"]
        EP2["/specializations"]
        EP3["/combinations"]
        EP4["/code-lists"]
        EP5["/nursing-days"]
        EP6["/rules"]
        EP7["/changes"]
        EP8["/day-care"]
    end
    subgraph Database
        V[(vykony)]
        VO[(vykony_omezeni)]
        VK[(vykony_kombinace)]
        D[(documents)]
        EX[(extractions)]
    end
    subgraph Clients
        App[ipl-app]
        Swagger[Swagger UI]
    end
    EP1 --> V
    EP1 --> VO
    EP1 --> VK
    EP2 --> V
    EP3 --> VK
    EP4 --> D
    EP5 --> EX
    EP6 --> EX
    EP7 --> EX
    EP8 --> EX
    App --> EP1
    App --> EP2
    Swagger --> EP1
  

Endpoints

MethodPathDescriptionQuery Parameters
GET/api/v1/healthLiveness check + DB status
GET/api/v1/proceduresList proceduresq (FTS), specialization, limit, offset
GET/api/v1/procedures/{code}Procedure detail with restrictions + combinations
GET/api/v1/specializationsDistinct specializations with procedure countslimit, offset
GET/api/v1/specializations/{code}Procedures for one specialization
GET/api/v1/combinationsProcedure combination rulesprocedure_code, type, limit, offset
GET/api/v1/code-listsIngested documents as code listslimit, offset
GET/api/v1/code-lists/{name}Code list detail with page extraction status
GET/api/v1/nursing-daysNursing day entrieslimit, offset
GET/api/v1/rulesBilling ruleslimit, offset
GET/api/v1/changesChange log across legislation versionslimit, offset
GET/api/v1/day-careDay-care procedure entrieslimit, offset

Pagination

All list endpoints accept limit (1–500, default 50) and offset (default 0). Response shape: {"items": [...], "total": N, "limit": 50, "offset": 0}

Provenance

Entity payloads include a provenance object: {"document_id", "section_id", "extraction_id", "source_url"} — linking each data point back to its source PDF.

Production

EnvironmentURL
Localhttp://localhost:8000/api/v1
Productionhttps://ipl-api.tipelt.cz/api/v1

Citation Contract

Every answer produced by an LLM application consuming this data MUST cite entity_uri references. An answer without citations is treated as a hallucination.

flowchart LR
    URI[entity_uri] --> Prov[provenance JSONB] --> Ext[extraction_id] --> Sec[section_id] --> Doc[document_id] --> PDF[Source PDF sentence]
  

This chain allows any data point to be traced back to the exact sentence in the original PDF where it was stated. CLI verification: pipeline trace <table>.<column> --entity-uri=<uri>

LLM Integration

All LLM interactions go through the centralized LLMClient wrapper. No pipeline module calls the Anthropic SDK directly.

Supported Models

Call Flow

flowchart LR
    Pipeline[Pipeline Phase] --> Client[LLMClient]
    Client --> CacheCheck{Cache hit?}
    CacheCheck -->|yes| Cached[Return cached response]
    CacheCheck -->|no| RateLimit[Rate Limiter]
    RateLimit --> SDK[Anthropic SDK / Claude CLI]
    SDK --> Log[Log to llm_calls table]
    SDK --> CacheStore[Store in llm_cache]
    Log --> Response[Return response]
    CacheStore --> Response
  

Caching

Text-based calls are cached by SHA-256 hash of (model + system_prompt + user_input + optional_schema). Cache is stored in the llm_cache PostgreSQL table with hit counting. PDF-based calls are not cached (binary content too large for key hashing).

Rate Limiting

Audit Logging

Every LLM call is logged to the llm_calls table with: model, phase, prompt version, input/output token counts, cost in USD, duration in ms, cache hit flag, and the full response. This provides a complete audit trail and cost tracking for all AI operations.

Document Discovery

Automated crawler for Czech healthcare legislation sources. Finds new PDFs, validates they are final (not draft) documents from official sources, and ingests confirmed candidates.

Configured Sources

4-Layer Validation

flowchart TB
    PDF[Downloaded PDF] --> L1{{Layer 1: Source Authority}}
    L1 -->|official domain| L2{{Layer 2: Draft Detection}}
    L1 -->|unofficial| REJECT[REJECT]
    L2 -->|FINAL| L3{{Layer 3: Cross-Reference}}
    L2 -->|DRAFT| REJECT
    L3 --> L4{{Layer 4: Metadata Consistency}}
    L3 -->|INCONCLUSIVE| HOLD[HOLD for review]
    L4 -->|gazette + date ok| INGEST[INGEST]
    L4 -->|issues| HOLD
  
LayerWhat It ChecksMethod
1. Source AuthorityIs the domain on the official whitelist? (vzp.cz, mzd.gov.cz, sukl.cz, zakonyprolidi.cz, etc.)Domain matching against hardcoded list
2. Draft DetectionIs this a final published document, not a draft?LLM reads first 3 pages, checks for draft markers ("navrh", "pracovni verze", watermarks)
3. Cross-ReferenceCan the document be confirmed in the Law Gazette (Sbirka zakonu)?Web search for title + gazette number, needs 2+ confirmation keywords
4. MetadataDoes it have a concrete effective date and gazette number?Regex extraction of dates and gazette references from PDF text

Decision matrix: Layer 2 FAIL = reject (no drafts). Layer 1 FAIL = reject (unofficial). Layer 4 FAIL = hold. All 4 PASS = ingest. Any INCONCLUSIVE = hold for human review.

Prompt Templates

All LLM prompts are stored as versioned text files. The loader automatically picks the highest version. Prompts are never inlined in code.

NameVersionFileSizePreview
draft_check v1 draft_check_v1.txt 834 chars You are reviewing a Czech legislative or regulatory document to determine whether it is a finalized (platné/účinné) version or a draft (návrh/pracovní verze). Examine the document text below. Look fo...
nl_description_vykon v1 nl_description_vykon_v1.txt 379 chars You are a Czech healthcare domain expert. Given structured data about a medical procedure (výkon), write a concise natural-language description in Czech. The description should: - Be 1-3 sentences lo...
pdf_to_md v1 pdf_to_md_v1.txt 435 chars You are a document conversion specialist. Convert the following Czech healthcare legislation PDF page to clean Markdown. Rules: - Preserve all section numbers, paragraph numbers, and legal references...
sectioning v1 sectioning_v1.txt 575 chars You are a Czech healthcare legislation analyst. Given a Markdown document converted from a PDF, identify and output the logical sections. For each section output a JSON object with: - "section_id": t...

Database Migrations

Schema evolution managed by Alembic. Each migration is idempotent and versioned.

FileDescriptionTables Created
001_initial_schema.py Initial schema: extensions, all pipeline and production tables. alter only
002_round_trip_checks.py Round-trip sample check audit table for M1.4. alter only
003_document_status.py Drop document_status column — drafts detected by LLM instead. alter only

CLI Commands

All pipeline operations are accessible via the pipeline CLI. Commands require DATABASE_URL; extraction phases also need ANTHROPIC_API_KEY.

CommandDescription
pipeline ingest Ingest a PDF from a URL or local file path.
pipeline run Run the extraction pipeline for a document.
pipeline status Show pipeline status for documents.
pipeline trace Trace data lineage through the pipeline.
pipeline review Review extraction results — opens quality viewer in browser.
pipeline discover Discover documents from Czech healthcare legislation sources.
pipeline eval Evaluate pipeline accuracy against golden datasets.

Storage Layout

All file artifacts are managed by storage/paths.py under a configurable root directory. Markdown is the single source of truth (SSOT) — stored on disk, referenced by path in the database.

flowchart TB
    Root["STORAGE_ROOT (default: /storage)"] --> Sources["sources/{document_id}/"]
    Root --> MD["md/{document_id}/"]
    Root --> Logs["logs/{document_id}/"]
    Sources --> PDF["original.pdf"]
    MD --> Block1["pages_0001-0005.md"]
    MD --> Block2["pages_0006-0010.md"]
    MD --> BlockN["..."]
  

Path validation: Document IDs with /, \, or .. are rejected to prevent path traversal.

Auto-generated from codebase by scripts/generate_app_logic.py. Regenerate after code changes: python scripts/generate_app_logic.py