An AI system for detecting phishing URLs using ensemble learning and advanced feature engineering.
Phishing URL Detector
— An AI phishing detection system powered by advanced ML pipelines and feature engineering, built to spot malicious URLs with high precision.
Overview
Phishing remains one of the most prevalent cybersecurity threats, tricking users into revealing sensitive information via deceptive links. This Spam Detector is a machine learning-based solution trained to detect such phishing attempts using structural and behavioral URL features.
Problem Statement
Phishing URLs use techniques like:
-Redirection & mimicry of trusted domains
-Homoglyph attacks (e.g., "g00gle.com")
-URL-based fingerprinting and data harvesting
These threats are hard to detect manually or with simple blacklists. We needed an adaptable, intelligent system that evolves with phishing strategies.
Dataset Compilation
Combined and cleaned data from 4 major sources:
UCI Machine Learning Repository (2024)
Kaggle Datasets (2021, 2023)
PhishTank (2021)
Final Dataset Size: ~297,725 records | 📊 Features: 56
Preprocessing & Feature Engineering
Web-crawling to extract uniform features from multiple sources
Added custom features like: URL EntropyDot & Uppercase Letter Count
Presence of keywords: login
, secure
, verify
, etc.
Feature selection based on statistical significance
Modeling
Trained multiple ML classifiers (e.g., Random Forest, XGBoost)
Performed: GridSearchCV for basic hyperparameter tuning and Optuna for advanced Hyperparameter tuning.
Final model showed high accuracy with no overfitting.
Results
Trained and tuned multiple models, including logistic regression, neural networks, and boosting algorithms (XGBoost),achieving 99.47% accuracy.
References
UCI Repository (2024)
Kaggle URL Dataset (2023)
Kaggle URL Dataset (2021)
PhishTank Database
Tech Stack
Python, scikit-learn, pandas,
Optuna
BeautifulSoup & Requests for crawling
Jupyter, Matplotlib, Seaborn for analysis