Replication Package of "Battling Phish"
<h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1852015875717070848 |
|---|---|
| author | Anonymous Author (7372229) |
| author_facet | Anonymous Author (7372229) |
| author_role | author |
| dc.creator.none.fl_str_mv | Anonymous Author (7372229) |
| dc.date.none.fl_str_mv | 2025-10-09T21:32:42Z |
| dc.identifier.none.fl_str_mv | 10.6084/m9.figshare.30324559.v1 |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/dataset/Replication_Package_of_Battling_Phish_/30324559 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Cybersecurity and privacy not elsewhere classified Information security management Phishing URLs |
| dc.title.none.fl_str_mv | Replication Package of "Battling Phish" |
| dc.type.none.fl_str_mv | Dataset info:eu-repo/semantics/publishedVersion dataset |
| description | <h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.<br><br>Directory Structure</h4><p dir="ltr">├── Datasets/</p><p dir="ltr">│ ├── Dataset-1.csv</p><p dir="ltr">│ ├── Dataset-2.csv</p><p dir="ltr">│ ├── Dataset-3.csv</p><p dir="ltr">│ ├── Dataset-4.csv</p><p dir="ltr">│ ├── Dataset-5.csv</p><p dir="ltr">│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv</p><p dir="ltr">│ └── Legit_Phish_32_Features_Extracted_Data.csv</p><p>│</p><p dir="ltr">└── Source_Codes/</p><p dir="ltr"> ├── Feature_extraction_source_code.py</p><p dir="ltr"> ├── Feature_importance_analysis_source_code.py</p><p dir="ltr"> ├── ML/</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_LP.py</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_PSU.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_LP.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_PSU.py</p><p dir="ltr"> │ ├── HardVoting_trained_on_LP.py</p><p dir="ltr"> │ └── HardVoting_trained_on_PSU.py</p><p> │</p><p dir="ltr"> ├── DL/</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_LP.py</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_PSU.py</p><p dir="ltr"> │ └── ... (total 16 files for 8 DL algorithms)</p><p> │</p><p dir="ltr"> └── LLM/</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── PhishBERT_Evaluation.py</p><p dir="ltr"> └── URLBERT_Evaluation.py</p><h2>Datasets:</h2><ul><li><code>Dataset-1.csv</code> to <code>Dataset-5.csv</code>:<br>Used for feature importance analysis.</li><li><code>Phishing_Site_URLs_32_Features_Extracted_Data.csv</code> (PSU dataset):<br>Includes phishing and legitimate URLs with 32 extracted lexical features.</li><li><code>Legit_Phish_32_Features_Extracted_Data.csv</code> (LP dataset):<br>Another benchmark dataset with the same 32 features, used for comparative evaluation.</li></ul><p dir="ltr"></p><p dir="ltr"><b>Note</b>: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.</p><h2>Source Code:</h2><ul><li><code><strong>Feature_extraction_source_code.py</strong></code><br>Extracts 32 handcrafted lexical features from raw URL data.</li><li><code><strong>Feature_importance_analysis_source_code.py</strong></code><br>Performs feature selection using seven statistical and model-based ranking methods.</li></ul><h3>Machine Learning (ML)</h3><p dir="ltr">Implements ML classifiers individually trained on LP and PSU datasets:</p><ul><li>Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.</li><li>Soft Voting and Hard Voting ensembles are also implemented.</li></ul><p dir="ltr">Scripts:</p><ul><li><code>Seven_ML_Models_trained_on_LP.py</code></li><li><code>Seven_ML_Models_trained_on_PSU.py</code></li><li><code>SoftVoting_trained_on_LP.py</code>, <code>SoftVoting_trained_on_PSU.py</code></li><li><code>HardVoting_trained_on_LP.py</code>, <code>HardVoting_trained_on_PSU.py</code></li></ul><h3>Deep Learning (DL)</h3><p dir="ltr">Implements eight deep learning architectures (each trained separately on LP and PSU):</p><blockquote><p dir="ltr">Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).</p></blockquote><h3>Large Language Models (LLMs)</h3><ul><li><b>Fine-tuned</b>:</li><li><ul><li><code>BERT_Fine_Tuned_on_LP.py</code>, <code>BERT_Fine_Tuned_on_PSU.py</code></li><li><code>DistilBERT_Fine_Tuned_on_LP.py</code>, <code>DistilBERT_Fine_Tuned_on_PSU.py</code></li></ul></li><li><b>Pre-trained, zero-shot or direct evaluation</b>:</li><li><ul><li><code>PhishBERT_Evaluation.py</code></li><li><code>URLBERT_Evaluation.py</code></li></ul></li></ul><p></p> |
| eu_rights_str_mv | openAccess |
| id | Manara_a453cd5cea777eb7b56c8a89ff69bc95 |
| identifier_str_mv | 10.6084/m9.figshare.30324559.v1 |
| network_acronym_str | Manara |
| network_name_str | ManaraRepo |
| oai_identifier_str | oai:figshare.com:article/30324559 |
| publishDate | 2025 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | Replication Package of "Battling Phish"Anonymous Author (7372229)Cybersecurity and privacy not elsewhere classifiedInformation security managementPhishing URLs<h4>This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.<br><br>Directory Structure</h4><p dir="ltr">├── Datasets/</p><p dir="ltr">│ ├── Dataset-1.csv</p><p dir="ltr">│ ├── Dataset-2.csv</p><p dir="ltr">│ ├── Dataset-3.csv</p><p dir="ltr">│ ├── Dataset-4.csv</p><p dir="ltr">│ ├── Dataset-5.csv</p><p dir="ltr">│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv</p><p dir="ltr">│ └── Legit_Phish_32_Features_Extracted_Data.csv</p><p>│</p><p dir="ltr">└── Source_Codes/</p><p dir="ltr"> ├── Feature_extraction_source_code.py</p><p dir="ltr"> ├── Feature_importance_analysis_source_code.py</p><p dir="ltr"> ├── ML/</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_LP.py</p><p dir="ltr"> │ ├── Seven_ML_Models_trained_on_PSU.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_LP.py</p><p dir="ltr"> │ ├── SoftVoting_trained_on_PSU.py</p><p dir="ltr"> │ ├── HardVoting_trained_on_LP.py</p><p dir="ltr"> │ └── HardVoting_trained_on_PSU.py</p><p> │</p><p dir="ltr"> ├── DL/</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_LP.py</p><p dir="ltr"> │ ├── [DLModel1]_trained_on_PSU.py</p><p dir="ltr"> │ └── ... (total 16 files for 8 DL algorithms)</p><p> │</p><p dir="ltr"> └── LLM/</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── BERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_LP.py</p><p dir="ltr"> ├── DistilBERT_Fine_Tuned_on_PSU.py</p><p dir="ltr"> ├── PhishBERT_Evaluation.py</p><p dir="ltr"> └── URLBERT_Evaluation.py</p><h2>Datasets:</h2><ul><li><code>Dataset-1.csv</code> to <code>Dataset-5.csv</code>:<br>Used for feature importance analysis.</li><li><code>Phishing_Site_URLs_32_Features_Extracted_Data.csv</code> (PSU dataset):<br>Includes phishing and legitimate URLs with 32 extracted lexical features.</li><li><code>Legit_Phish_32_Features_Extracted_Data.csv</code> (LP dataset):<br>Another benchmark dataset with the same 32 features, used for comparative evaluation.</li></ul><p dir="ltr"></p><p dir="ltr"><b>Note</b>: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.</p><h2>Source Code:</h2><ul><li><code><strong>Feature_extraction_source_code.py</strong></code><br>Extracts 32 handcrafted lexical features from raw URL data.</li><li><code><strong>Feature_importance_analysis_source_code.py</strong></code><br>Performs feature selection using seven statistical and model-based ranking methods.</li></ul><h3>Machine Learning (ML)</h3><p dir="ltr">Implements ML classifiers individually trained on LP and PSU datasets:</p><ul><li>Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.</li><li>Soft Voting and Hard Voting ensembles are also implemented.</li></ul><p dir="ltr">Scripts:</p><ul><li><code>Seven_ML_Models_trained_on_LP.py</code></li><li><code>Seven_ML_Models_trained_on_PSU.py</code></li><li><code>SoftVoting_trained_on_LP.py</code>, <code>SoftVoting_trained_on_PSU.py</code></li><li><code>HardVoting_trained_on_LP.py</code>, <code>HardVoting_trained_on_PSU.py</code></li></ul><h3>Deep Learning (DL)</h3><p dir="ltr">Implements eight deep learning architectures (each trained separately on LP and PSU):</p><blockquote><p dir="ltr">Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).</p></blockquote><h3>Large Language Models (LLMs)</h3><ul><li><b>Fine-tuned</b>:</li><li><ul><li><code>BERT_Fine_Tuned_on_LP.py</code>, <code>BERT_Fine_Tuned_on_PSU.py</code></li><li><code>DistilBERT_Fine_Tuned_on_LP.py</code>, <code>DistilBERT_Fine_Tuned_on_PSU.py</code></li></ul></li><li><b>Pre-trained, zero-shot or direct evaluation</b>:</li><li><ul><li><code>PhishBERT_Evaluation.py</code></li><li><code>URLBERT_Evaluation.py</code></li></ul></li></ul><p></p>2025-10-09T21:32:42ZDatasetinfo:eu-repo/semantics/publishedVersiondataset10.6084/m9.figshare.30324559.v1https://figshare.com/articles/dataset/Replication_Package_of_Battling_Phish_/30324559CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/303245592025-10-09T21:32:42Z |
| spellingShingle | Replication Package of "Battling Phish" Anonymous Author (7372229) Cybersecurity and privacy not elsewhere classified Information security management Phishing URLs |
| status_str | publishedVersion |
| title | Replication Package of "Battling Phish" |
| title_full | Replication Package of "Battling Phish" |
| title_fullStr | Replication Package of "Battling Phish" |
| title_full_unstemmed | Replication Package of "Battling Phish" |
| title_short | Replication Package of "Battling Phish" |
| title_sort | Replication Package of "Battling Phish" |
| topic | Cybersecurity and privacy not elsewhere classified Information security management Phishing URLs |