Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text

Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of t...

Full description

Saved in:

Bibliographic Details
Main Author:	ALBayari, Reem (author)
Other Authors:	Abdallah, Sherief (author)
Published:	2022
Subjects:	cyberbullying; offensive language; Arabic dialect
Online Access:	https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083.
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1862980613535956992
author	ALBayari, Reem
author2	Abdallah, Sherief
author2_role	author
author_facet	ALBayari, Reem Abdallah, Sherief
author_role	author
dc.creator.none.fl_str_mv	ALBayari, Reem Abdallah, Sherief
dc.date.none.fl_str_mv	2022 2025-05-24T12:51:30Z 2025-05-24T12:51:30Z
dc.identifier.none.fl_str_mv	ALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83. https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083.
dc.language.none.fl_str_mv	en
dc.publisher.none.fl_str_mv	MDPI
dc.relation.none.fl_str_mv	Datav7 n7 (2022): 83
dc.subject.none.fl_str_mv	cyberbullying; offensive language; Arabic dialect
dc.title.none.fl_str_mv	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
dc.type.none.fl_str_mv	Article
description	Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments.
id	budr_c9b9cbb243108c1a06c3f1b00d0e148f
identifier_str_mv	ALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83.
language_invalid_str_mv	en
network_acronym_str	budr
network_name_str	The British University in Dubai repository
oai_identifier_str	oai:bspace.buid.ac.ae:1234/3117
publishDate	2022
publisher.none.fl_str_mv	MDPI
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic TextALBayari, ReemAbdallah, Sheriefcyberbullying; offensive language; Arabic dialectBackground: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments.MDPI2025-05-24T12:51:30Z2025-05-24T12:51:30Z2022ArticleALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83.https://bspace.buid.ac.ae/handle/1234/3117https://doi.org/10.3390/data7070083.enDatav7 n7 (2022): 83oai:bspace.buid.ac.ae:1234/31172025-05-24T12:53:08Z
spellingShingle	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text ALBayari, Reem cyberbullying; offensive language; Arabic dialect
title	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
title_full	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
title_fullStr	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
title_full_unstemmed	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
title_short	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
title_sort	Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
topic	cyberbullying; offensive language; Arabic dialect
url	https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083.

Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text

Similar Items