Click any element in the diagram to see its full description, parameters, and validation rules.
flowchart TB
Start([PDF / URL Input]) --> P0[Phase 0: Source Registry]
P0 --> DB0[(documents + sources/)]
DB0 --> P1[Phase 1: PDF to Markdown]
P1 --> V1{Coverage >= 0.5?}
V1 -->|pass| DB1[(pages + md/)]
V1 -->|low| Retry[Retry with stronger model]
Retry --> DB1
DB1 --> RT{Round-trip sample 0.10}
RT -->|ok + drift| DB_RT[(page_round_trip_checks)]
DB_RT --> P2[Phase 2: Sectioning]
P2 --> V2{Overlap + hash?}
V2 -->|fail| Manual2[Manual review]
V2 -->|pass| DB2[(sections)]
DB2 --> P3[Phase 3: Extraction]
P3 --> V3{Citation + confidence?}
V3 -->|high conf| DB3a[(extractions: verified)]
V3 -->|low conf| P4[Phase 4: Escalation]
P4 --> V4{Model consensus?}
V4 -->|agree| DB3b[(verified by consensus)]
V4 -->|disagree| Manual4[Manual review]
DB3a --> P5[Phase 5: Canonicalize]
DB3b --> P5
Manual4 -->|approved| P5
P5 --> Prod[(vykony + omezeni + kombinace)]
Prod --> P6[Phase 6: LLM Surface]
P6 --> Surf[(nl_description + FTS index)]
Prod --> SQLApps[SQL Applications]
Surf --> LLMApps[LLM Applications]
LLMLog[(llm_calls log)]
LLMCache[(llm_cache)]
P1 -.-> LLMLog
P2 -.-> LLMLog
P3 -.-> LLMLog
P4 -.-> LLMLog
P6 -.-> LLMLog
LLMLog -.-> LLMCache
Rectangles = processing phases · Diamonds = validation gates · Cylinders = storage · Dotted = LLM audit logging
PostgreSQL database ipl_doc with two layers: internal pipeline tables (processing audit trail)
and stable API tables (consumer-facing contract).
erDiagram
documents ||--o{ pages : "Phase 1 produces"
pages ||--o| page_round_trip_checks : "Phase 1.4 verifies"
documents ||--o{ sections : "Phase 2 segments"
sections ||--o{ extractions : "Phase 3 extracts"
extractions ||--o{ validations : "validated by"
vykony ||--o{ vykony_omezeni : "has constraints"
vykony ||--o{ vykony_kombinace : "has combination rules"
documents {
uuid id PK
text title
text source_url
text pdf_path
text pdf_hash "SHA-256 for dedup"
int pdf_pages
timestamp fetched_at
}
pages {
uuid id PK
uuid document_id FK
int page_start
int page_end
text md_path "path to .md file"
float coverage_ratio "LLM chars / PDF chars"
text status "ok | low_coverage | empty_page"
}
page_round_trip_checks {
uuid id PK
uuid page_id FK
float similarity "0.0 to 1.0"
text status "ok | drift"
}
sections {
uuid id PK
uuid document_id FK
text section_type
text identifier
text title
}
extractions {
uuid id PK
uuid section_id FK
text schema_name
jsonb json_data
float overall_confidence
bool citation_verified
}
validations {
uuid id PK
uuid extraction_id FK
jsonb checks
text final_status
}
vykony {
text entity_uri PK "vykon:kod@date"
text kod
text nazev
text nl_description
tsvector searchable_text
jsonb provenance
tstzrange validity
}
vykony_omezeni {
text entity_uri PK
text vykon_uri FK
text typ "freq | age | diag"
jsonb details
}
vykony_kombinace {
text entity_uri PK
text vykon_a FK
text vykon_b FK
text pravidlo
}
llm_calls {
uuid id PK
text model
text phase
int tokens_input
int tokens_output
float cost_usd
bool cache_hit
}
llm_cache {
text cache_key PK "SHA-256 hash"
jsonb response
int hit_count
}
These tables record every step of the pipeline. They are append-only audit logs. Downstream applications must not query these for answers — use the stable API tables instead.
| Table | Phase | What It Stores |
|---|---|---|
documents | 0 | Registered PDFs with source URL, hash, page count. One row per unique PDF. |
pages | 1 | Extraction blocks (5-page chunks). Links to Markdown file on disk. Stores coverage ratio and status. |
page_round_trip_checks | 1.4 | Round-trip verification results. Similarity score and drift/ok status for sampled blocks. |
sections | 2 | Logical sections identified within the document (paragraphs, tables, appendices). |
extractions | 3 | Raw structured data extracted by LLM. JSON matching Pydantic schemas, with confidence scores. |
validations | 3-4 | Validation and escalation audit. Records which checks passed/failed and escalation outcomes. |
llm_calls | all | Every LLM invocation: model, tokens, cost, duration, cache hit flag. Full audit trail. |
llm_cache | all | Cached LLM responses keyed by SHA-256(model+prompt). Avoids re-paying for identical requests. |
These are the guaranteed interface for downstream applications. Breaking changes require an architecture review.
Every row has an entity_uri for stable referencing and a provenance JSONB for full traceability.
| Table/View | Type | What It Contains |
|---|---|---|
vykony | table | Healthcare procedures. Each row is one procedure at one point in time (temporal versioning via validity range). |
vykony_omezeni | table | Constraints on procedures: frequency limits, age restrictions, required diagnoses, specialty requirements. |
vykony_kombinace | table | Rules governing which procedures can/cannot be billed together. |
vykon_full | matview | Denormalized join of all three tables above. The primary query target for applications. |
Every entity has a stable, deterministic identifier: {entity_type}:{kod}@{valid_from_iso}
| URI Example | What It Identifies |
|---|---|
vykon:09543@2025-01-01 | Procedure 09543, effective from 1 Jan 2025 |
omezeni:freq:09543@2025-01-01 | Frequency constraint for procedure 09543 |
kombinace:09543+09544@2025-01-01 | Combination rule between procedures 09543 and 09544 |
Properties: Idempotent (re-extraction produces same URI), human-readable (LLMs can cite them), deterministic (no UUIDs), temporally versioned (same code at different dates = different entities).
How downstream applications query the extracted data. The citation contract is mandatory — every answer must trace back to a source PDF sentence.
flowchart LR
subgraph Pipeline Output
VF[(vykon_full matview)]
V[(vykony)]
VO[(vykony_omezeni)]
VK[(vykony_kombinace)]
end
subgraph Query Methods
FTS[Full-Text Search\nCzech tsvector]
Lookup[Code Lookup\nkod + validity range]
Fuzzy[Fuzzy Search\npg_trgm trigrams]
Embed[Vector Search\npgvector - planned]
end
subgraph Applications
SQL[SQL Apps\nDeterministic queries]
LLM[LLM Apps\nFTS retrieval + citations]
end
VF --> FTS
VF --> Lookup
VF --> Fuzzy
VF -.-> Embed
FTS --> SQL
FTS --> LLM
Lookup --> SQL
Lookup --> LLM
Fuzzy --> LLM
| Pattern | When to Use | How It Works |
|---|---|---|
| Code Lookup | Know the exact procedure code | Query vykon_full by kod + validity @> now() for current version, or validity @> $date for a specific date |
| Full-Text Search | Search by description or name | Use websearch_to_tsquery('czech', $query) against searchable_text. Weighted: code (A) > name (B) > description (C) |
| Fuzzy Search | Handle typos and partial matches | Use pg_trgm similarity() function with % operator on nazev |
| Vector Search | Semantic similarity (future) | embedding vector(1024) column prepared but empty. Applications must fall back to FTS when NULL. |
Read-only JSON API at /api/v1 serving pipeline data for ipl-app consumption.
Built with FastAPI + asyncpg. CORS enabled for all origins. OpenAPI 3.1 spec auto-generated.
flowchart LR
subgraph REST API
direction TB
EP1["/procedures"]
EP2["/specializations"]
EP3["/combinations"]
EP4["/code-lists"]
EP5["/nursing-days"]
EP6["/rules"]
EP7["/changes"]
EP8["/day-care"]
end
subgraph Database
V[(vykony)]
VO[(vykony_omezeni)]
VK[(vykony_kombinace)]
D[(documents)]
EX[(extractions)]
end
subgraph Clients
App[ipl-app]
Swagger[Swagger UI]
end
EP1 --> V
EP1 --> VO
EP1 --> VK
EP2 --> V
EP3 --> VK
EP4 --> D
EP5 --> EX
EP6 --> EX
EP7 --> EX
EP8 --> EX
App --> EP1
App --> EP2
Swagger --> EP1
| Method | Path | Description | Query Parameters |
|---|---|---|---|
| GET | /api/v1/health | Liveness check + DB status | — |
| GET | /api/v1/procedures | List procedures | q (FTS), specialization, limit, offset |
| GET | /api/v1/procedures/{code} | Procedure detail with restrictions + combinations | — |
| GET | /api/v1/specializations | Distinct specializations with procedure counts | limit, offset |
| GET | /api/v1/specializations/{code} | Procedures for one specialization | — |
| GET | /api/v1/combinations | Procedure combination rules | procedure_code, type, limit, offset |
| GET | /api/v1/code-lists | Ingested documents as code lists | limit, offset |
| GET | /api/v1/code-lists/{name} | Code list detail with page extraction status | — |
| GET | /api/v1/nursing-days | Nursing day entries | limit, offset |
| GET | /api/v1/rules | Billing rules | limit, offset |
| GET | /api/v1/changes | Change log across legislation versions | limit, offset |
| GET | /api/v1/day-care | Day-care procedure entries | limit, offset |
All list endpoints accept limit (1–500, default 50) and offset (default 0).
Response shape: {"items": [...], "total": N, "limit": 50, "offset": 0}
Entity payloads include a provenance object: {"document_id", "section_id", "extraction_id", "source_url"}
— linking each data point back to its source PDF.
| Environment | URL |
|---|---|
| Local | http://localhost:8000/api/v1 |
| Production | https://ipl-api.tipelt.cz/api/v1 |
Every answer produced by an LLM application consuming this data MUST cite entity_uri references.
An answer without citations is treated as a hallucination.
flowchart LR
URI[entity_uri] --> Prov[provenance JSONB] --> Ext[extraction_id] --> Sec[section_id] --> Doc[document_id] --> PDF[Source PDF sentence]
This chain allows any data point to be traced back to the exact sentence in the original PDF where it was stated.
CLI verification: pipeline trace <table>.<column> --entity-uri=<uri>
All LLM interactions go through the centralized LLMClient wrapper.
No pipeline module calls the Anthropic SDK directly.
claude-haiku-4-5-20251001claude-opus-4-6claude-sonnet-4-6
flowchart LR
Pipeline[Pipeline Phase] --> Client[LLMClient]
Client --> CacheCheck{Cache hit?}
CacheCheck -->|yes| Cached[Return cached response]
CacheCheck -->|no| RateLimit[Rate Limiter]
RateLimit --> SDK[Anthropic SDK / Claude CLI]
SDK --> Log[Log to llm_calls table]
SDK --> CacheStore[Store in llm_cache]
Log --> Response[Return response]
CacheStore --> Response
Text-based calls are cached by SHA-256 hash of (model + system_prompt + user_input + optional_schema).
Cache is stored in the llm_cache PostgreSQL table with hit counting.
PDF-based calls are not cached (binary content too large for key hashing).
Every LLM call is logged to the llm_calls table with: model, phase, prompt version,
input/output token counts, cost in USD, duration in ms, cache hit flag, and the full response.
This provides a complete audit trail and cost tracking for all AI operations.
Automated crawler for Czech healthcare legislation sources. Finds new PDFs, validates they are final (not draft) documents from official sources, and ingests confirmed candidates.
flowchart TB
PDF[Downloaded PDF] --> L1{{Layer 1: Source Authority}}
L1 -->|official domain| L2{{Layer 2: Draft Detection}}
L1 -->|unofficial| REJECT[REJECT]
L2 -->|FINAL| L3{{Layer 3: Cross-Reference}}
L2 -->|DRAFT| REJECT
L3 --> L4{{Layer 4: Metadata Consistency}}
L3 -->|INCONCLUSIVE| HOLD[HOLD for review]
L4 -->|gazette + date ok| INGEST[INGEST]
L4 -->|issues| HOLD
| Layer | What It Checks | Method |
|---|---|---|
| 1. Source Authority | Is the domain on the official whitelist? (vzp.cz, mzd.gov.cz, sukl.cz, zakonyprolidi.cz, etc.) | Domain matching against hardcoded list |
| 2. Draft Detection | Is this a final published document, not a draft? | LLM reads first 3 pages, checks for draft markers ("navrh", "pracovni verze", watermarks) |
| 3. Cross-Reference | Can the document be confirmed in the Law Gazette (Sbirka zakonu)? | Web search for title + gazette number, needs 2+ confirmation keywords |
| 4. Metadata | Does it have a concrete effective date and gazette number? | Regex extraction of dates and gazette references from PDF text |
Decision matrix: Layer 2 FAIL = reject (no drafts). Layer 1 FAIL = reject (unofficial). Layer 4 FAIL = hold. All 4 PASS = ingest. Any INCONCLUSIVE = hold for human review.
All LLM prompts are stored as versioned text files. The loader automatically picks the highest version. Prompts are never inlined in code.
| Name | Version | File | Size | Preview |
|---|---|---|---|---|
draft_check |
v1 | draft_check_v1.txt |
834 chars | You are reviewing a Czech legislative or regulatory document to determine whether it is a finalized (platné/účinné) version or a draft (návrh/pracovní verze). Examine the document text below. Look fo... |
nl_description_vykon |
v1 | nl_description_vykon_v1.txt |
379 chars | You are a Czech healthcare domain expert. Given structured data about a medical procedure (výkon), write a concise natural-language description in Czech. The description should: - Be 1-3 sentences lo... |
pdf_to_md |
v1 | pdf_to_md_v1.txt |
435 chars | You are a document conversion specialist. Convert the following Czech healthcare legislation PDF page to clean Markdown. Rules: - Preserve all section numbers, paragraph numbers, and legal references... |
sectioning |
v1 | sectioning_v1.txt |
575 chars | You are a Czech healthcare legislation analyst. Given a Markdown document converted from a PDF, identify and output the logical sections. For each section output a JSON object with: - "section_id": t... |
Schema evolution managed by Alembic. Each migration is idempotent and versioned.
| File | Description | Tables Created |
|---|---|---|
001_initial_schema.py |
Initial schema: extensions, all pipeline and production tables. | alter only |
002_round_trip_checks.py |
Round-trip sample check audit table for M1.4. | alter only |
003_document_status.py |
Drop document_status column — drafts detected by LLM instead. | alter only |
All pipeline operations are accessible via the pipeline CLI.
Commands require DATABASE_URL; extraction phases also need ANTHROPIC_API_KEY.
| Command | Description |
|---|---|
pipeline ingest |
Ingest a PDF from a URL or local file path. |
pipeline run |
Run the extraction pipeline for a document. |
pipeline status |
Show pipeline status for documents. |
pipeline trace |
Trace data lineage through the pipeline. |
pipeline review |
Review extraction results — opens quality viewer in browser. |
pipeline discover |
Discover documents from Czech healthcare legislation sources. |
pipeline eval |
Evaluate pipeline accuracy against golden datasets. |
All file artifacts are managed by storage/paths.py under a configurable root directory.
Markdown is the single source of truth (SSOT) — stored on disk, referenced by path in the database.
flowchart TB
Root["STORAGE_ROOT (default: /storage)"] --> Sources["sources/{document_id}/"]
Root --> MD["md/{document_id}/"]
Root --> Logs["logs/{document_id}/"]
Sources --> PDF["original.pdf"]
MD --> Block1["pages_0001-0005.md"]
MD --> Block2["pages_0006-0010.md"]
MD --> BlockN["..."]
sources/ — Original downloaded PDFs, never modified after ingestmd/ — Extracted Markdown files, one per 5-page block. This is the cached extraction output; if these files exist, re-runs skip the LLM call.logs/ — Pipeline execution logs per documentPath validation: Document IDs with /, \, or .. are rejected to prevent path traversal.
scripts/generate_app_logic.py. Regenerate after code changes: python scripts/generate_app_logic.py