Dataset for an LLM score extraction challenge

<p dir="ltr">This zipfile contains three plain text files. One describes the task, one contains the task and one contains the answers to the task. To access the information you will need to use <b>LLM!!!</b> and the <a href="https://www.7-zip.org/" rel="...

詳細記述

保存先:

書誌詳細
第一著者:	Mike Thelwall (452631) (author)
出版事項:	2025
主題:	Artificial intelligence not elsewhere classified Informetrics Large Language Models LLMs
タグ:	タグ追加タグなし, このレコードへの初めてのタグを付けませんか!

その他の書誌記述
要約:	<p dir="ltr">This zipfile contains three plain text files. One describes the task, one contains the task and one contains the answers to the task. To access the information you will need to use <b>LLM!!!</b> and the <a href="https://www.7-zip.org/" rel="noreferrer" target="_blank"><b>7-zip</b></a> software. Here is the challenge. Please do not share the files anywhere online - they are encrypted to prevent LLMs reading the answers. This is the task.</p><p>--------</p><p dir="ltr">The dataset includes outputs from Magistral, Llama 4 Scout and Gemma3 27b when asked to give a REF score to a journal article based on REF guidelines.</p><p dir="ltr">Some outputs are truncated to 100 tokens or are truncated for other reasons. Some contain a score, others don't.</p><p dir="ltr"><br></p><p dir="ltr">The task is to use LLMs to obtain the REF score described by each report, or return -1 if it does not report a score.</p><p dir="ltr">The scoring scale is 1* to 4, and -1 should be returned if there is not possible to be confident about the score.</p><p dir="ltr">For background information, this is what the scores mean (from: https://2021.ref.ac.uk/guidance-on-results/guidance-on-ref-2021-results/index.html):</p><p dir="ltr">4: Quality that is world-leading in terms of originality, significance and rigour. </p><p dir="ltr">3: Quality that is internationally excellent in terms of originality, significance and rigour but which falls short of the highest standards of excellence. </p><p dir="ltr">2: Quality that is recognised internationally in terms of originality, significance and rigour </p><p dir="ltr">1: Quality that is recognised nationally in terms of originality, significance and rigour. </p><p dir="ltr">The LLM should report either an overall score, or, if no overall score is reported then the average of the significance, originality, and rigour scores, if all three are given. These scores should be ignored if one or two are missing.</p><p dir="ltr">To count as a correct answer, the LLM score must only include the number and (optionally) a star after the number. Additional spaces are also allowed at the start and end of the response as well as between the number and the star.</p><p dir="ltr"><br></p><p dir="ltr">Examples of correct answer formats</p><p>3.4</p><p>2</p><p> 3* </p><p>4 </p><p>-1</p><p dir="ltr"><br></p><p dir="ltr">Examples of incorrect answer formats</p><p>1. 3</p><p>4</p><p dir="ltr">Score: 2</p><p>-1</p><p dir="ltr">The gold standard is the score in the report (or -1) as judged by a human.</p><p dir="ltr">Some of the gold standard judgements are subjective and you may disagree. For example, when three scores are given with no context then these are assumed to be rigour, originality and significance and rounded. When two scores are included, then this is usually counted as unknown score.</p><p dir="ltr">The number extracted is counted as correct if it is exact or within (<=) 0.005 of being exact (this is to allow for a small amount of rounding). This includes the -1s, so accuracy calculations are always based on 1446 items.</p><p dir="ltr">For clarity, any symbol in the output other than a space and the following counts as an automatic fail: 0123456789.-</p>

Dataset for an LLM score extraction challenge

類似資料