September 2025
ML-Enhanced Web Crawler
Machine Learning-powered Crawler for Vulnerability Detection
ONUR AKTAS - AHMET BURAK CAN
Page Contents
Short summary
The paper presents an ML-enhanced crawler architecture that exploits features collected during automated browsing to flag pages with a higher likelihood of containing vulnerabilities. The approach aims to reduce manual triage workload and provide actionable priorities for security teams.
Objective
This work proposes a machine learning (ML) augmented web-crawler architecture designed to detect vulnerabilities in web applications and publicly available web resources. The primary goal is to automatically distinguish pages likely to contain security issues using a binary classification approach based on features extracted during crawling.
Method & Approach
The proposed system consists of two main components:
- Intelligent crawler: Systematically explores target domains and collects signals such as forms, scripts, HTTP headers, and page content.
- Machine learning classifier: Uses an engineered feature set to classify pages as vulnerable or non-vulnerable. Feature engineering includes structural indicators (DOM structure, forms, input elements) and content-based signals (embedded scripts, suspicious patterns).
Key technical highlights
Evaluation & Findings
The authors evaluate the framework on representative datasets and crawling scenarios using standard classification metrics such as accuracy, precision, recall, and F1-score. The main contribution is the defined workflow between crawler and classifier, providing an automated filtering layer to help security analysts prioritize investigation targets.
Implementation & Architectural Notes
- Modular crawler design to allow easy extension with plugins or filters.
- Feature extraction pipeline that includes DOM parsing, script analysis, and discovery of endpoints/forms.
- Model update loop: periodic retraining recommended to adapt to new patterns and reduce concept drift.
Limitations & Future Work
The study acknowledges several limitations when applied in real-world settings:
- Class imbalance (few positive/vulnerable examples) which may bias the model.
- Dynamic content (AJAX/SPA) challenges requiring realistic browser automation to capture runtime-generated artifacts.
- False positives/negatives impact on operations – human verification remains necessary.
Practical takeaways
- Provides prioritized candidate pages for security analysts to review.
- Potential to increase discovery efficiency when integrated with traditional scanners.
- Flexible architecture allows swapping ML models or adding rule-based heuristics.