A Neighborhood Framework for Resource-Lean Content Flagging

<p dir="ltr">We propose a novel framework for cross- lingual content flagging with limited target- language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbor architecture. It is a modern instantiation of...

Full description

Saved in:
Bibliographic Details
Main Author: Sheikh Muhammad Sarwar (19517701) (author)
Other Authors: Dimitrina Zlatkova (19517704) (author), Momchil Hardalov (18618397) (author), Yoan Dinkov (19517707) (author), Isabelle Augenstein (14013962) (author), Preslav Nakov (17760905) (author)
Published: 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864513506161721344
author Sheikh Muhammad Sarwar (19517701)
author2 Dimitrina Zlatkova (19517704)
Momchil Hardalov (18618397)
Yoan Dinkov (19517707)
Isabelle Augenstein (14013962)
Preslav Nakov (17760905)
author2_role author
author
author
author
author
author_facet Sheikh Muhammad Sarwar (19517701)
Dimitrina Zlatkova (19517704)
Momchil Hardalov (18618397)
Yoan Dinkov (19517707)
Isabelle Augenstein (14013962)
Preslav Nakov (17760905)
author_role author
dc.creator.none.fl_str_mv Sheikh Muhammad Sarwar (19517701)
Dimitrina Zlatkova (19517704)
Momchil Hardalov (18618397)
Yoan Dinkov (19517707)
Isabelle Augenstein (14013962)
Preslav Nakov (17760905)
dc.date.none.fl_str_mv 2022-05-04T03:00:00Z
dc.identifier.none.fl_str_mv 10.1162/tacl_a_00472
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/A_Neighborhood_Framework_for_Resource-Lean_Content_Flagging/26889445
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Information and computing sciences
Artificial intelligence
Computer vision and multimedia computation
Language, communication and culture
Language studies
Linguistics
Cross-Lingual Content Flagging
Limited Target-Language Data
Encoding Schemes
Abusive Language Detection
Jigsaw Multilingual Dataset
Model Adaptation
Neighborhood-Based Approaches
dc.title.none.fl_str_mv A Neighborhood Framework for Resource-Lean Content Flagging
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">We propose a novel framework for cross- lingual content flagging with limited target- language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbor architecture. It is a modern instantiation of the vanilla k-nearest neighbor model, as we use Transformer representations in all its components. Our framework can adapt to new source- language instances, without the need to be retrained from scratch. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query– neighbor interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.</p><h2>Other Information</h2><p dir="ltr">Published in: Transactions of the Association for Computational Linguistics<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1162/tacl_a_00472" target="_blank">https://dx.doi.org/10.1162/tacl_a_00472</a></p>
eu_rights_str_mv openAccess
id Manara2_9530be59367bf37599664e76cfaa0063
identifier_str_mv 10.1162/tacl_a_00472
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/26889445
publishDate 2022
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling A Neighborhood Framework for Resource-Lean Content FlaggingSheikh Muhammad Sarwar (19517701)Dimitrina Zlatkova (19517704)Momchil Hardalov (18618397)Yoan Dinkov (19517707)Isabelle Augenstein (14013962)Preslav Nakov (17760905)Information and computing sciencesArtificial intelligenceComputer vision and multimedia computationLanguage, communication and cultureLanguage studiesLinguisticsCross-Lingual Content FlaggingLimited Target-Language DataEncoding SchemesAbusive Language DetectionJigsaw Multilingual DatasetModel AdaptationNeighborhood-Based Approaches<p dir="ltr">We propose a novel framework for cross- lingual content flagging with limited target- language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbor architecture. It is a modern instantiation of the vanilla k-nearest neighbor model, as we use Transformer representations in all its components. Our framework can adapt to new source- language instances, without the need to be retrained from scratch. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query– neighbor interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.</p><h2>Other Information</h2><p dir="ltr">Published in: Transactions of the Association for Computational Linguistics<br>License: <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1162/tacl_a_00472" target="_blank">https://dx.doi.org/10.1162/tacl_a_00472</a></p>2022-05-04T03:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1162/tacl_a_00472https://figshare.com/articles/journal_contribution/A_Neighborhood_Framework_for_Resource-Lean_Content_Flagging/26889445CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/268894452022-05-04T03:00:00Z
spellingShingle A Neighborhood Framework for Resource-Lean Content Flagging
Sheikh Muhammad Sarwar (19517701)
Information and computing sciences
Artificial intelligence
Computer vision and multimedia computation
Language, communication and culture
Language studies
Linguistics
Cross-Lingual Content Flagging
Limited Target-Language Data
Encoding Schemes
Abusive Language Detection
Jigsaw Multilingual Dataset
Model Adaptation
Neighborhood-Based Approaches
status_str publishedVersion
title A Neighborhood Framework for Resource-Lean Content Flagging
title_full A Neighborhood Framework for Resource-Lean Content Flagging
title_fullStr A Neighborhood Framework for Resource-Lean Content Flagging
title_full_unstemmed A Neighborhood Framework for Resource-Lean Content Flagging
title_short A Neighborhood Framework for Resource-Lean Content Flagging
title_sort A Neighborhood Framework for Resource-Lean Content Flagging
topic Information and computing sciences
Artificial intelligence
Computer vision and multimedia computation
Language, communication and culture
Language studies
Linguistics
Cross-Lingual Content Flagging
Limited Target-Language Data
Encoding Schemes
Abusive Language Detection
Jigsaw Multilingual Dataset
Model Adaptation
Neighborhood-Based Approaches