Short answer: Pakistan has one of the largest publicly accessible corpora of judgments and statutes in South Asia — but accessing it is hard. For this 2026 snapshot we indexed 3,929 statutes, 428,860 structural components (parts, chapters, sections, subsections, clauses), 66,905 reported judgments, and 113,128 citation stubs drawn from across federal, provincial, and AJK legal sources. This post walks through what we found.
Why we built this corpus
Pakistani legal research online is fragmented. Statutes live on pakistancode.gov.pk, provincial code portals (punjabcode.punjab.gov.pk, sindhlaws.gov.pk, kpcode.kp.gov.pk, balochistancode.gob.pk), and individual ministry sites. Judgments live on each court’s own website — Supreme Court, five High Courts, Federal Shariat Court — each with its own access pattern. Paywalled reporters (PLD, SCMR, CLC, PLJ) fill the gaps. QanoonX’s mission is to consolidate all of this into a single, structured, AI-searchable knowledge base. This post is our first public audit of what that looks like at corpus level.
Statutes indexed
| Jurisdiction | Statutes indexed (approx.) |
|---|---|
| Federal | 259 across 13 legal domains |
| Punjab | 370+ |
| Sindh | 480+ |
| Khyber Pakhtunkhwa | 571 (via the KP Code portal) |
| Balochistan | 1,699 (via the Balochistan Code, includes regulations) |
| Islamabad Capital Territory | 120+ |
| Azad Jammu & Kashmir | 824 (legacy bulk-dump) |
| Subordinate legislation (regulator rules, act-implementing rules) | 287 |
| International treaties (UN HR, bilateral, multilateral, ICSID, FTAs) | 279 |
| Total | 3,929 items |
Each statute is broken down to its hierarchical components — parts, chapters, sections, subsections, clauses — giving us 428,860 addressable nodes and 424,009 bitemporal versions at a single point in time.
Case law indexed
| Court | Judgments indexed |
|---|---|
| Supreme Court of Pakistan | 3,288 (1958 to 2026) |
| Lahore High Court | 1,061+ (stratified 2018–2026) |
| Balochistan High Court | 683 |
| Federal Shariat Court | 77 |
| Sindh High Court | 128 |
| Islamabad High Court | 82 |
| Peshawar High Court | 59 |
| KP Service Tribunal | 489 |
| Sindh Service Tribunal | 276 |
| Customs Appellate Tribunal | 135 |
| Other tribunals | 60,627 (AJK bulk + regulator appellate bodies) |
| Total | 66,905 judgments |
Every judgment is stored in full text. 99.4% have an LLM-generated headnote (66,533 cases); 88% have a disposition; 70% have parties parsed out. Citation ratio: 110,142 judgment-to-statute links already resolved, 113,128 raw citation stubs still to resolve.
What we learned
1. Publication access is the single biggest bottleneck
Coverage depends almost entirely on what each court chooses to publish. The Lahore High Court publishes ~1,000 approved judgments per year; the Sindh High Court publishes a fraction of its output. The Balochistan HC has a clean public judgments portal; the Federal Shariat Court depends on Wayback Machine archives for pre-2015 material. Substantive democratic oversight of the judiciary requires open publication — and the picture is uneven.
2. OCR quality matters more than any other corpus variable
88,865 pages of Pakistani legal PDFs required OCR — mostly scans of pre-2010 gazette reprints and provincial legislation. Modern PDFs from pakistancode.gov.pk have clean text layers; anything older is an OCR problem. We used PaddleOCR PP-OCRv5 on A40 GPUs and got usable but imperfect output on 1970s–1990s material. This is the biggest predictor of AI retrieval quality for older case law.
3. The citation graph is sparse in useful ways
Of 113,128 citation stubs extracted by regex, 97.4% resolved deterministically to specific statute sections. That’s striking — it says Pakistani judges cite statutes in predictable, structured ways (“Section 302 PPC,” “Article 199 of the Constitution”). The remaining 2.6% are ambiguous references worth reviewing manually. By contrast, case-to-case citations (judgment cites judgment) are harder to resolve because PLD / SCMR / CLC references are inconsistent and ~4.6% of cases in the corpus even have a PLD-style citation attached.
4. Bitemporal modelling is non-negotiable
The Pakistan Penal Code has been amended 50+ times since 1860. The Constitution has had 26 amendments. Searching “Section 302 PPC” without a time dimension returns the 1860 original, the 1990 Qisas & Diyat version, and the 2004 amendment in the same result set. Every judgment must be read against the version of the law as it stood on the date of the offence — not the text on pakistancode.gov.pk today. We store every statute version in a bitemporal graph so a query like “what did Section 302 say on 1995-06-01?” is one SQL call away.
5. English-only is a product decision, and a visible gap
QanoonX indexes English-language material only. Urdu-language statutes and judgments are out of scope for this first version. This is a deliberate choice: legal English in Pakistan is consistent across jurisdictions, is what the bar argues in, and what the judgment text uses. A full Urdu corpus is future work.
What’s not in the corpus (yet)
- Paywalled reporters — PLD, PLJ, YLR, MLD, SCMR. Subscription access only.
- ITAT / FST / PST / BST judgments — no public portals.
- Army, Air Force, Navy Act tribunal output.
- District court judgments — not systematically published.
- Urdu-language material.
What this means for legal research in 2026
For advocates, the QanoonX corpus indexes virtually every reported judgment from the Supreme Court since 1958 and full stratified samples from the five High Courts since 2018, with deterministic citation resolution to the specific statute version in force at the time of the offence. For citizens, it means a plain-English AI assistant has ~67,000 real Pakistani judgments to draw on when answering questions like “how do I file an FIR” or “can I get khula without returning dower” — and when the AI can’t find supporting material, it says so explicitly rather than hallucinating.
Methodology
Corpus build: three collection waves across 89,262 source PDFs (64.77 GB), PyMuPDF for digital-text layer extraction, PaddleOCR PP-OCRv5 on GPU for scans, regex structural parser v5.2 for statutes, dedicated case-law parser for judgments. Storage: PostgreSQL with pgvector for embeddings, pg_trgm for fuzzy search, and a bitemporal version table to handle amendments. Data freeze: 2026-04-15.
Want the methodology file or the raw aggregates?
Email support@qanoonx.com — we’ll share the CSV aggregates and the methodology PDF with journalists, researchers, and academic institutions at no cost.
Related reading
This article is part of the Criminal and Civil Procedure in Pakistan pillar. Continue with: