A Dataset Showing a Century of Evolution in the Complexity of the United States Legal Code

<p dir="ltr">We leverage <b>OCR</b> and <b>Generative AI</b> techniques to recover and clean printed historical editions of the Code. This enables computational analysis of federal law even in periods before web-based digital access. The processing pipeline in...

Mô tả đầy đủ

Đã lưu trong:
Chi tiết về thư mục
Tác giả chính: Dawoon Jeong (22382975) (author)
Tác giả khác: James Holehouse (21698807) (author), Jisung Yoon (13043034) (author), Christopher P. Kempes (21698853) (author), Geoffrey B. West (21698858) (author), Hyejin Youn (21698921) (author)
Được phát hành: 2025
Những chủ đề:
Các nhãn: Thêm thẻ
Không có thẻ, Là người đầu tiên thẻ bản ghi này!
Miêu tả
Tóm tắt:<p dir="ltr">We leverage <b>OCR</b> and <b>Generative AI</b> techniques to recover and clean printed historical editions of the Code. This enables computational analysis of federal law even in periods before web-based digital access. The processing pipeline includes:</p><ul><li> <b>Contents of U.S. Code</b>: Word counts, unique word counts, entropy, scaling exponents, etc.</li><li> <b>Hierarchical Structure</b>: Subtitle → Part → Chapter → Section → Subsection...</li><li> <b>Cross-Reference Relationships</b>: Title-to-title citation relationships</li></ul><p dir="ltr">Due to repository size constraints, this GitHub includes:</p><ul><li> A sample OCR text page (<code>ocr_processing_gemini</code>) for demonstration</li><li> Web-based U.S. Code text from 1994 for structural parsing (<code>Data Set 2</code>)</li></ul><p></p>