Replication Package of "Battling Phish"

<h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Anonymous Author (7372229) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1852015875717070848
author Anonymous Author (7372229)
author_facet Anonymous Author (7372229)
author_role author
dc.creator.none.fl_str_mv Anonymous Author (7372229)
dc.date.none.fl_str_mv 2025-10-09T21:32:42Z
dc.identifier.none.fl_str_mv 10.6084/m9.figshare.30324559.v1
dc.relation.none.fl_str_mv https://figshare.com/articles/dataset/Replication_Package_of_Battling_Phish_/30324559
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Cybersecurity and privacy not elsewhere classified
Information security management
Phishing URLs
dc.title.none.fl_str_mv Replication Package of "Battling Phish"
dc.type.none.fl_str_mv Dataset
info:eu-repo/semantics/publishedVersion
dataset
description <h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.<br><br>Directory Structure</h4><p dir="ltr">├── Datasets/</p><p dir="ltr">│ ├── Dataset-1.csv</p><p dir="ltr">│ ├── Dataset-2.csv</p><p dir="ltr">│ ├── Dataset-3.csv</p><p dir="ltr">│ ├── Dataset-4.csv</p><p dir="ltr">│ ├── Dataset-5.csv</p><p dir="ltr">│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv</p><p dir="ltr">│ └── Legit_Phish_32_Features_Extracted_Data.csv</p><p>│</p><p dir="ltr">└── Source_Codes/</p><p dir="ltr"> ├── Feature_extraction_source_code.py</p><p dir="ltr"> ├── Feature_importance_analysis_source_code.py</p><p dir="ltr"> ├── ML/</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_LP.py</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_PSU.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_LP.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_PSU.py</p><p dir="ltr"> │ ├── HardVoting_trained_on_LP.py</p><p dir="ltr"> │ └── HardVoting_trained_on_PSU.py</p><p> │</p><p dir="ltr"> ├── DL/</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_LP.py</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_PSU.py</p><p dir="ltr"> │ └── ... (total 16 files for 8 DL algorithms)</p><p> │</p><p dir="ltr"> └── LLM/</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── PhishBERT_Evaluation.py</p><p dir="ltr"> └── URLBERT_Evaluation.py</p><h2>Datasets:</h2><ul><li><code>Dataset-1.csv</code> to <code>Dataset-5.csv</code>:<br>Used for feature importance analysis.</li><li><code>Phishing_Site_URLs_32_Features_Extracted_Data.csv</code> (PSU dataset):<br>Includes phishing and legitimate URLs with 32 extracted lexical features.</li><li><code>Legit_Phish_32_Features_Extracted_Data.csv</code> (LP dataset):<br>Another benchmark dataset with the same 32 features, used for comparative evaluation.</li></ul><p dir="ltr"></p><p dir="ltr"><b>Note</b>: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.</p><h2>Source Code:</h2><ul><li><code><strong>Feature_extraction_source_code.py</strong></code><br>Extracts 32 handcrafted lexical features from raw URL data.</li><li><code><strong>Feature_importance_analysis_source_code.py</strong></code><br>Performs feature selection using seven statistical and model-based ranking methods.</li></ul><h3>Machine Learning (ML)</h3><p dir="ltr">Implements ML classifiers individually trained on LP and PSU datasets:</p><ul><li>Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.</li><li>Soft Voting and Hard Voting ensembles are also implemented.</li></ul><p dir="ltr">Scripts:</p><ul><li><code>Seven_ML_Models_trained_on_LP.py</code></li><li><code>Seven_ML_Models_trained_on_PSU.py</code></li><li><code>SoftVoting_trained_on_LP.py</code>, <code>SoftVoting_trained_on_PSU.py</code></li><li><code>HardVoting_trained_on_LP.py</code>, <code>HardVoting_trained_on_PSU.py</code></li></ul><h3>Deep Learning (DL)</h3><p dir="ltr">Implements eight deep learning architectures (each trained separately on LP and PSU):</p><blockquote><p dir="ltr">Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).</p></blockquote><h3>Large Language Models (LLMs)</h3><ul><li><b>Fine-tuned</b>:</li><li><ul><li><code>BERT_Fine_Tuned_on_LP.py</code>, <code>BERT_Fine_Tuned_on_PSU.py</code></li><li><code>DistilBERT_Fine_Tuned_on_LP.py</code>, <code>DistilBERT_Fine_Tuned_on_PSU.py</code></li></ul></li><li><b>Pre-trained, zero-shot or direct evaluation</b>:</li><li><ul><li><code>PhishBERT_Evaluation.py</code></li><li><code>URLBERT_Evaluation.py</code></li></ul></li></ul><p></p>
eu_rights_str_mv openAccess
id Manara_a453cd5cea777eb7b56c8a89ff69bc95
identifier_str_mv 10.6084/m9.figshare.30324559.v1
network_acronym_str Manara
network_name_str ManaraRepo
oai_identifier_str oai:figshare.com:article/30324559
publishDate 2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Replication Package of "Battling Phish"Anonymous Author (7372229)Cybersecurity and privacy not elsewhere classifiedInformation security managementPhishing URLs<h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.<br><br>Directory Structure</h4><p dir="ltr">├── Datasets/</p><p dir="ltr">│ ├── Dataset-1.csv</p><p dir="ltr">│ ├── Dataset-2.csv</p><p dir="ltr">│ ├── Dataset-3.csv</p><p dir="ltr">│ ├── Dataset-4.csv</p><p dir="ltr">│ ├── Dataset-5.csv</p><p dir="ltr">│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv</p><p dir="ltr">│ └── Legit_Phish_32_Features_Extracted_Data.csv</p><p>│</p><p dir="ltr">└── Source_Codes/</p><p dir="ltr"> ├── Feature_extraction_source_code.py</p><p dir="ltr"> ├── Feature_importance_analysis_source_code.py</p><p dir="ltr"> ├── ML/</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_LP.py</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_PSU.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_LP.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_PSU.py</p><p dir="ltr"> │ ├── HardVoting_trained_on_LP.py</p><p dir="ltr"> │ └── HardVoting_trained_on_PSU.py</p><p> │</p><p dir="ltr"> ├── DL/</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_LP.py</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_PSU.py</p><p dir="ltr"> │ └── ... (total 16 files for 8 DL algorithms)</p><p> │</p><p dir="ltr"> └── LLM/</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── PhishBERT_Evaluation.py</p><p dir="ltr"> └── URLBERT_Evaluation.py</p><h2>Datasets:</h2><ul><li><code>Dataset-1.csv</code> to <code>Dataset-5.csv</code>:<br>Used for feature importance analysis.</li><li><code>Phishing_Site_URLs_32_Features_Extracted_Data.csv</code> (PSU dataset):<br>Includes phishing and legitimate URLs with 32 extracted lexical features.</li><li><code>Legit_Phish_32_Features_Extracted_Data.csv</code> (LP dataset):<br>Another benchmark dataset with the same 32 features, used for comparative evaluation.</li></ul><p dir="ltr"></p><p dir="ltr"><b>Note</b>: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.</p><h2>Source Code:</h2><ul><li><code><strong>Feature_extraction_source_code.py</strong></code><br>Extracts 32 handcrafted lexical features from raw URL data.</li><li><code><strong>Feature_importance_analysis_source_code.py</strong></code><br>Performs feature selection using seven statistical and model-based ranking methods.</li></ul><h3>Machine Learning (ML)</h3><p dir="ltr">Implements ML classifiers individually trained on LP and PSU datasets:</p><ul><li>Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.</li><li>Soft Voting and Hard Voting ensembles are also implemented.</li></ul><p dir="ltr">Scripts:</p><ul><li><code>Seven_ML_Models_trained_on_LP.py</code></li><li><code>Seven_ML_Models_trained_on_PSU.py</code></li><li><code>SoftVoting_trained_on_LP.py</code>, <code>SoftVoting_trained_on_PSU.py</code></li><li><code>HardVoting_trained_on_LP.py</code>, <code>HardVoting_trained_on_PSU.py</code></li></ul><h3>Deep Learning (DL)</h3><p dir="ltr">Implements eight deep learning architectures (each trained separately on LP and PSU):</p><blockquote><p dir="ltr">Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).</p></blockquote><h3>Large Language Models (LLMs)</h3><ul><li><b>Fine-tuned</b>:</li><li><ul><li><code>BERT_Fine_Tuned_on_LP.py</code>, <code>BERT_Fine_Tuned_on_PSU.py</code></li><li><code>DistilBERT_Fine_Tuned_on_LP.py</code>, <code>DistilBERT_Fine_Tuned_on_PSU.py</code></li></ul></li><li><b>Pre-trained, zero-shot or direct evaluation</b>:</li><li><ul><li><code>PhishBERT_Evaluation.py</code></li><li><code>URLBERT_Evaluation.py</code></li></ul></li></ul><p></p>2025-10-09T21:32:42ZDatasetinfo:eu-repo/semantics/publishedVersiondataset10.6084/m9.figshare.30324559.v1https://figshare.com/articles/dataset/Replication_Package_of_Battling_Phish_/30324559CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/303245592025-10-09T21:32:42Z
spellingShingle Replication Package of "Battling Phish"
Anonymous Author (7372229)
Cybersecurity and privacy not elsewhere classified
Information security management
Phishing URLs
status_str publishedVersion
title Replication Package of "Battling Phish"
title_full Replication Package of "Battling Phish"
title_fullStr Replication Package of "Battling Phish"
title_full_unstemmed Replication Package of "Battling Phish"
title_short Replication Package of "Battling Phish"
title_sort Replication Package of "Battling Phish"
topic Cybersecurity and privacy not elsewhere classified
Information security management
Phishing URLs