Local ordinances shape everyday regulation (zoning, housing, licensing, nuisance) yet are fragmented and hard to analyze at scale; LOCUS v1.0 makes that layer of law machine-observable by assembling a county-harmonized, chunk-level corpus of U.S. municipal and county ordinance text annotated for downstream legal-NLP tasks.
Key Findings
- Scale and scope: ~2,211,516 text chunks derived from municipal and county codes, with coverage metadata that links chunks to jurisdiction, state, city, and county.
- Annotation schema: Each chunk is assigned a
functionlabel (Context, Rules, Process, Enforcement), a binaryis_substantiveflag, and for substantive chunks a coarsetopic(Buildings, Business, Nuisance, Zoning, Other). The release also includes continuous scores for opacity, paternalism, enforcement discretion, and problem salience to support analytic tasks beyond simple classification. - Practical tradeoffs: The dataset uses OCR and automated classifiers to scale labeling across thousands of jurisdictions; this yields broad geographic reach but introduces label noise, taxonomy coarseness, and uneven jurisdictional digitization.
Who it's for and tradeoffs
Great fit if you need a large, jurisdiction-linked corpus to prototype legal-text classifiers, build substantive vs. non-substantive filters, or run comparative studies of municipal regulation. Look elsewhere if you require a fully audited legal source, exhaustive coverage of every U.S. locality, or fine-grained legal subject-matter labeling; LOCUS v1.0 is a snapshot and its function/topic labels are model-assigned and not fully human-validated.
Where it fits
LOCUS is best used as research infrastructure for legal NLP pipelines, weakly supervised workflows, and empirical policy analysis that benefit from reproducible coverage metadata and scalable annotations. For production legal advice or litigation-grade proof, human legal review and up-to-date jurisdictional checks remain necessary.
