Tamp-X: Attacking explainable natural language classifiers through tampered activations

<p>While the technique of Deep Neural Networks (DNNs) has been instrumental in achieving state-of-the-art results for various Natural Language Processing (NLP) tasks, recent works have shown that the decisions made by DNNs cannot always be trusted. Recently Explainable Artificial Intelligence...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Hassan Ali (3348749) (author)
مؤلفون آخرون:	Muhammad Suleman Khan (17562612) (author), Ala Al-Fuqaha (4434340) (author), Junaid Qadir (16494902) (author)
منشور في:	2022
الموضوعات:	Information and computing sciences Artificial intelligence Machine learning Explainable artificial intelligence (XAI) Natural language processing Attacking XAI Adversarial attacks Model tampering
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

_version_	1864513535792381952
author	Hassan Ali (3348749)
author2	Muhammad Suleman Khan (17562612) Ala Al-Fuqaha (4434340) Junaid Qadir (16494902)
author2_role	author author author
author_facet	Hassan Ali (3348749) Muhammad Suleman Khan (17562612) Ala Al-Fuqaha (4434340) Junaid Qadir (16494902)
author_role	author
dc.creator.none.fl_str_mv	Hassan Ali (3348749) Muhammad Suleman Khan (17562612) Ala Al-Fuqaha (4434340) Junaid Qadir (16494902)
dc.date.none.fl_str_mv	2022-09-01T00:00:00Z
dc.identifier.none.fl_str_mv	10.1016/j.cose.2022.102791
dc.relation.none.fl_str_mv	https://figshare.com/articles/journal_contribution/Tamp-X_Attacking_explainable_natural_language_classifiers_through_tampered_activations/24745086
dc.rights.none.fl_str_mv	CC BY 4.0 info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv	Information and computing sciences Artificial intelligence Machine learning Explainable artificial intelligence (XAI) Natural language processing Attacking XAI Adversarial attacks Model tampering
dc.title.none.fl_str_mv	Tamp-X: Attacking explainable natural language classifiers through tampered activations
dc.type.none.fl_str_mv	Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal
description	<p>While the technique of Deep Neural Networks (DNNs) has been instrumental in achieving state-of-the-art results for various Natural Language Processing (NLP) tasks, recent works have shown that the decisions made by DNNs cannot always be trusted. Recently Explainable Artificial Intelligence (XAI) methods have been proposed as a method for increasing DNN’s reliability and trustworthiness. These XAI methods are however open to attack and can be manipulated in both white-box (gradient-based) and black-box (perturbation-based) scenarios. Exploring novel techniques to attack and robustify these XAI methods is crucial to fully understand these vulnerabilities. In this work, we propose Tamp-X—a novel attack which tampers the activations of robust NLP classifiers forcing the state-of-the-art white-box and black-box XAI methods to generate misrepresented explanations. To the best of our knowledge, in current NLP literature, we are the first to attack both the white-box and the black-box XAI methods simultaneously. We quantify the reliability of explanations based on three different metrics—the descriptive accuracy, the cosine similarity, and the L p norms of the explanation vectors. Through extensive experimentation, we show that the explanations generated for the tampered classifiers are not reliable, and significantly disagree with those generated for the untampered classifiers despite that the output decisions of tampered and untampered classifiers are almost always the same. Additionally, we study the adversarial robustness of the tampered NLP classifiers, and find out that the tampered classifiers which are harder to explain for the XAI methods, are also harder to attack by the adversarial attackers.</p><h2>Other Information</h2> <p> Published in: Computers & Security<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.cose.2022.102791" target="_blank">https://dx.doi.org/10.1016/j.cose.2022.102791</a></p>
eu_rights_str_mv	openAccess
id	Manara2_6c5c6d3df74258de5b1ced3167df0e8e
identifier_str_mv	10.1016/j.cose.2022.102791
network_acronym_str	Manara2
network_name_str	Manara2
oai_identifier_str	oai:figshare.com:article/24745086
publishDate	2022
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv	CC BY 4.0
spelling	Tamp-X: Attacking explainable natural language classifiers through tampered activationsHassan Ali (3348749)Muhammad Suleman Khan (17562612)Ala Al-Fuqaha (4434340)Junaid Qadir (16494902)Information and computing sciencesArtificial intelligenceMachine learningExplainable artificial intelligence (XAI)Natural language processingAttacking XAIAdversarial attacksModel tampering<p>While the technique of Deep Neural Networks (DNNs) has been instrumental in achieving state-of-the-art results for various Natural Language Processing (NLP) tasks, recent works have shown that the decisions made by DNNs cannot always be trusted. Recently Explainable Artificial Intelligence (XAI) methods have been proposed as a method for increasing DNN’s reliability and trustworthiness. These XAI methods are however open to attack and can be manipulated in both white-box (gradient-based) and black-box (perturbation-based) scenarios. Exploring novel techniques to attack and robustify these XAI methods is crucial to fully understand these vulnerabilities. In this work, we propose Tamp-X—a novel attack which tampers the activations of robust NLP classifiers forcing the state-of-the-art white-box and black-box XAI methods to generate misrepresented explanations. To the best of our knowledge, in current NLP literature, we are the first to attack both the white-box and the black-box XAI methods simultaneously. We quantify the reliability of explanations based on three different metrics—the descriptive accuracy, the cosine similarity, and the L p norms of the explanation vectors. Through extensive experimentation, we show that the explanations generated for the tampered classifiers are not reliable, and significantly disagree with those generated for the untampered classifiers despite that the output decisions of tampered and untampered classifiers are almost always the same. Additionally, we study the adversarial robustness of the tampered NLP classifiers, and find out that the tampered classifiers which are harder to explain for the XAI methods, are also harder to attack by the adversarial attackers.</p><h2>Other Information</h2> <p> Published in: Computers & Security<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.cose.2022.102791" target="_blank">https://dx.doi.org/10.1016/j.cose.2022.102791</a></p>2022-09-01T00:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1016/j.cose.2022.102791https://figshare.com/articles/journal_contribution/Tamp-X_Attacking_explainable_natural_language_classifiers_through_tampered_activations/24745086CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/247450862022-09-01T00:00:00Z
spellingShingle	Tamp-X: Attacking explainable natural language classifiers through tampered activations Hassan Ali (3348749) Information and computing sciences Artificial intelligence Machine learning Explainable artificial intelligence (XAI) Natural language processing Attacking XAI Adversarial attacks Model tampering
status_str	publishedVersion
title	Tamp-X: Attacking explainable natural language classifiers through tampered activations
title_full	Tamp-X: Attacking explainable natural language classifiers through tampered activations
title_fullStr	Tamp-X: Attacking explainable natural language classifiers through tampered activations
title_full_unstemmed	Tamp-X: Attacking explainable natural language classifiers through tampered activations
title_short	Tamp-X: Attacking explainable natural language classifiers through tampered activations
title_sort	Tamp-X: Attacking explainable natural language classifiers through tampered activations
topic	Information and computing sciences Artificial intelligence Machine learning Explainable artificial intelligence (XAI) Natural language processing Attacking XAI Adversarial attacks Model tampering

Tamp-X: Attacking explainable natural language classifiers through tampered activations

مواد مشابهة