Data Analysis for Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage

<p dir="ltr">Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic reviews by human experts can be time‑consuming and subjective, especially early in development. This study investigates whether large language models (LLMs)...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Nolan Platt (21242834) (author)
مؤلفون آخرون:	Ethan Luchs (21252077) (author), Sehrish Basir Nizamani (15393461) (author)
منشور في:	2025
الموضوعات:	Natural language processing Performance evaluation Human-computer interaction Numerical computation and mathematical software large language models LLMs natural language processing NLP human-computer interaction HCI virginia tech virginia tech cs research usability evaluations automated usability evaluations AI usability evaluations
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

الوصف
الملخص:	<p dir="ltr">Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic reviews by human experts can be time‑consuming and subjective, especially early in development. This study investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen’s ten usability heuristics to thirty open‑source websites, we generated over 850 heuristic ratings in three independent evaluations per site using a pipeline of OpenAI’s GPT-4o. Agreement analysis shows moderate consistency: the average pairwise Cohen’s Kappa for severity ratings was 0.63, with exact agreement on 56 percent of cases; multi‑rater Fleiss’s Kappa for those ratings was 0.50, while Krippendorff’s Alpha was effectively zero, indicating systematic variation in how the model assigns severity levels. For binary detection of whether an issue exists, pairwise Cohen’s Kappa averaged 0.50 with 84 percent exact agreement; Fleiss’s Kappa was similarly 0.50 and Krippendorff’s Alpha again near zero. These results demonstrate that LLM-based evaluations can indeed achieve a meaningful level of reliability in spotting usability issues, though severity judgments vary. Our work provides one of the first quantitative inter‑rater reliability analyses of automated heuristic testing and highlights methods for improving model consistency.</p>

Data Analysis for Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage

مواد مشابهة