Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text
Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of t...
Saved in:
| Main Author: | |
|---|---|
| Other Authors: | |
| Published: |
2022
|
| Subjects: | |
| Online Access: | https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083. |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1862980613535956992 |
|---|---|
| author | ALBayari, Reem |
| author2 | Abdallah, Sherief |
| author2_role | author |
| author_facet | ALBayari, Reem Abdallah, Sherief |
| author_role | author |
| dc.creator.none.fl_str_mv | ALBayari, Reem Abdallah, Sherief |
| dc.date.none.fl_str_mv | 2022 2025-05-24T12:51:30Z 2025-05-24T12:51:30Z |
| dc.identifier.none.fl_str_mv | ALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83. https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083. |
| dc.language.none.fl_str_mv | en |
| dc.publisher.none.fl_str_mv | MDPI |
| dc.relation.none.fl_str_mv | Datav7 n7 (2022): 83 |
| dc.subject.none.fl_str_mv | cyberbullying; offensive language; Arabic dialect |
| dc.title.none.fl_str_mv | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| dc.type.none.fl_str_mv | Article |
| description | Background: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments. |
| id | budr_c9b9cbb243108c1a06c3f1b00d0e148f |
| identifier_str_mv | ALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83. |
| language_invalid_str_mv | en |
| network_acronym_str | budr |
| network_name_str | The British University in Dubai repository |
| oai_identifier_str | oai:bspace.buid.ac.ae:1234/3117 |
| publishDate | 2022 |
| publisher.none.fl_str_mv | MDPI |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| spelling | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic TextALBayari, ReemAbdallah, Sheriefcyberbullying; offensive language; Arabic dialectBackground: the ability to use social media to communicate without revealing one’s real identity has created an attractive setting for cyberbullying. Several studies targeted social media to collect their datasets with the aim of automatically detecting offensive language. However, the majority of the datasets were in English, not in Arabic. Even the few Arabic datasets that were collected, none focused on Instagram despite being a major social media platform in the Arab world. (2) Methods: we use the official Instagram APIs to collect our dataset. To consider the dataset as a benchmark, we use SPSS (Kappa statistic) to evaluate the inter-annotator agreement (IAA), as well as examine and evaluate the performance of various learning models (LR, SVM, RFC, and MNB). (3) Results: in this research, we present the first Instagram Arabic corpus (sub-class categorization (multi-class)) focusing on cyberbullying. The dataset is primarily designed for the purpose of detecting offensive language in texts. We end up with 200,000 comments, of which 46,898 comments were annotated by three human annotators. The results show that the SVM classifier outperforms the other classifiers, with an F1 score of 69% for bullying comments and 85 percent for positive comments.MDPI2025-05-24T12:51:30Z2025-05-24T12:51:30Z2022ArticleALBayari, R. and Abdallah, S. (2022) “Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text,” Data, 7(7), p. 83.https://bspace.buid.ac.ae/handle/1234/3117https://doi.org/10.3390/data7070083.enDatav7 n7 (2022): 83oai:bspace.buid.ac.ae:1234/31172025-05-24T12:53:08Z |
| spellingShingle | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text ALBayari, Reem cyberbullying; offensive language; Arabic dialect |
| title | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| title_full | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| title_fullStr | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| title_full_unstemmed | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| title_short | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| title_sort | Instagram-Based Benchmark Dataset for Cyberbullying Detection in Arabic Text |
| topic | cyberbullying; offensive language; Arabic dialect |
| url | https://bspace.buid.ac.ae/handle/1234/3117 https://doi.org/10.3390/data7070083. |