An AI system for detecting phishing URLs using ensemble learning and advanced feature engineering.

Phishing URL Detector

— An AI
phishing detection system powered by advanced ML pipelines and feature engineering, built to spot malicious URLs with high precision.

Overview

Phishing remains one of the most prevalent cybersecurity threats, tricking users into revealing sensitive information via deceptive links. This Spam Detector is a machine learning-based solution trained to detect such phishing attempts using structural and behavioral URL features.

Problem Statement

Phishing URLs use techniques like:

-Redirection & mimicry of trusted domains
-Homoglyph attacks (e.g., "g00gle.com")
-URL-based fingerprinting and data harvesting

These threats are hard to detect manually or with simple blacklists. We needed an adaptable, intelligent system that evolves with phishing strategies.

Dataset Compilation

Combined and cleaned data from 4 major sources:
UCI Machine Learning Repository (2024)
Kaggle Datasets (2021, 2023)
PhishTank (2021)

Final Dataset Size: ~297,725 records | 📊 Features: 56

Preprocessing & Feature Engineering

Web-crawling to extract uniform features from multiple sources
Added custom features like: URL EntropyDot & Uppercase Letter Count
Presence of keywords: login, secure, verify, etc.

Feature selection based on statistical significance

Modeling

Trained multiple ML classifiers (e.g., Random Forest, XGBoost)
Performed: GridSearchCV for basic hyperparameter tuning and Optuna for advanced Hyperparameter tuning.
Final model showed high accuracy with no overfitting.

Results

Trained and tuned multiple models, including logistic regression, neural networks, and boosting algorithms (XGBoost),achieving 99.47% accuracy.


References

UCI Repository (2024)
Kaggle URL Dataset (2023)
Kaggle URL Dataset (2021)
PhishTank Database


Tech Stack

Python, scikit-learn, pandas,
Optuna
BeautifulSoup
& Requests for crawling
Jupyter, Matplotlib, Seaborn for analysis

Back To Homepage