Resolving Harvesting Errors in Institutional Repository Migration : Using Python Scripts with VS Code and LLM Integration.
<p dir="ltr">Presented at <a href="https://coar-repositories.org/news-updates/coar-annual-conference-2025/" rel="noreferrer" target="_blank">COAR 2025</a> (2025/05/12-14)</p><h3>Introduction</h3><ul><li>In Novemb...
Saved in:
| Main Author: | |
|---|---|
| Published: |
2025
|
| Subjects: | |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | <p dir="ltr">Presented at <a href="https://coar-repositories.org/news-updates/coar-annual-conference-2025/" rel="noreferrer" target="_blank">COAR 2025</a> (2025/05/12-14)</p><h3>Introduction</h3><ul><li>In November 2024, we migrated our institutional repository to “JAIRO Cloud”, the standard repository system in Japan managed by the National Institute of Informatics. Approx. 16,000 items were migrated, of which about 4,000 items were subject to metadata harvesting. After migration, we received an error report from the “IRDB”, a service that harvests repository content across Japan. The error report comprised 9 CSV files totaling approx. 36,000 lines.</li><li>Unlike typical CSV error reports, each line in this report represented an individual item rather than a single error. Each line (CRLF) contained internal LF line breaks for human readability, with multiple errors per item resulting in multiple internal LF breaks.</li><li>To address these errors effectively, we needed to determine how many instances of each error pattern occurred.</li><li>Due to the insufficient structure of the file, standard office software such as Excel was not adequate for analysis. Therefore, we decided to create a dedicated Python program using Large Language Model (LLM)-assisted coding.</li></ul><h3>Method</h3><ul><li>We used Visual Studio Code (VS Code), a free code editor compatible with LLM services. GitHub Copilot was selected as the LLM service. (It is the official LLM service for VS Code.)</li><li>Initially, the error report files were loaded into VS Code.</li><li>Next, we provided GitHub Copilot with contextual information, file details, and analysis objectives as initial instructions to generate the Python code.</li><li>We then reviewed the generated code and results of execution, providing additional instructions to refine the code.</li></ul><h3>Result</h3><ul><li>The initial instructions yielded Python code with approx. 70-80% completion.</li><li>Following further adjustments, the desired analytical program was completed in about two hours.</li><li>Executing this program revealed 4,216 errors across 3,078 items, categorized into 18 distinct error patterns, including the number of occurrences for each error type.</li><li>Identifying these patterns clarified the nature of the issues, enabling us to effectively prioritize and address them.</li></ul><h3>Discussion & Conclusion</h3><ul><li>We successfully addressed a challenging practical issue, which would have been difficult to resolve using conventional business tools and skills, within a relatively short period of time.</li><li>Using a code editor allowed direct interaction with insufficiently structured files, ensuring the resulting program closely aligned with our specific needs. The integration of natural language instruction with code generation, editing, and error feedback led to a significant lowering of the barrier to programming.</li><li>Conclusion: An LLM-integrated code editor is a highly practical tool, making it easy to create small, task-specific programs that fit operational needs.</li><li>Note: Using LLM services may incur costs. / Check your institution's and the LLM service's policies before use.</li></ul><p><br></p> |
|---|