A Dataset Showing a Century of Evolution in the Complexity of the United States Legal Code
<p dir="ltr">We leverage <b>OCR</b> and <b>Generative AI</b> techniques to recover and clean printed historical editions of the Code. This enables computational analysis of federal law even in periods before web-based digital access. The processing pipeline in...
Đã lưu trong:
| Tác giả chính: | |
|---|---|
| Tác giả khác: | , , , , |
| Được phát hành: |
2025
|
| Những chủ đề: | |
| Các nhãn: |
Thêm thẻ
Không có thẻ, Là người đầu tiên thẻ bản ghi này!
|
| Tóm tắt: | <p dir="ltr">We leverage <b>OCR</b> and <b>Generative AI</b> techniques to recover and clean printed historical editions of the Code. This enables computational analysis of federal law even in periods before web-based digital access. The processing pipeline includes:</p><ul><li> <b>Contents of U.S. Code</b>: Word counts, unique word counts, entropy, scaling exponents, etc.</li><li> <b>Hierarchical Structure</b>: Subtitle → Part → Chapter → Section → Subsection...</li><li> <b>Cross-Reference Relationships</b>: Title-to-title citation relationships</li></ul><p dir="ltr">Due to repository size constraints, this GitHub includes:</p><ul><li> A sample OCR text page (<code>ocr_processing_gemini</code>) for demonstration</li><li> Web-based U.S. Code text from 1994 for structural parsing (<code>Data Set 2</code>)</li></ul><p></p> |
|---|