September 2025

ML-Enhanced Web Crawler

Machine Learning-powered Crawler for Vulnerability Detection

ONUR AKTAS - AHMET BURAK CAN

Page Contents

Short summary

The paper presents an ML-enhanced crawler architecture that exploits features collected during automated browsing to flag pages with a higher likelihood of containing vulnerabilities. The approach aims to reduce manual triage workload and provide actionable priorities for security teams.

Objective

This work proposes a machine learning (ML) augmented web-crawler architecture designed to detect vulnerabilities in web applications and publicly available web resources. The primary goal is to automatically distinguish pages likely to contain security issues using a binary classification approach based on features extracted during crawling.

Method & Approach

The proposed system consists of two main components:

Intelligent crawler: Systematically explores target domains and collects signals such as forms, scripts, HTTP headers, and page content.
Machine learning classifier: Uses an engineered feature set to classify pages as vulnerable or non-vulnerable. Feature engineering includes structural indicators (DOM structure, forms, input elements) and content-based signals (embedded scripts, suspicious patterns).

Key technical highlightsFocusWeb application vulnerabilities
ModelBinary classification (ML)
DataHTML/JS/HTTP metadata collected by crawler
GoalReduce manual analysis effort via automated pre-filtering

Evaluation & Findings

The authors evaluate the framework on representative datasets and crawling scenarios using standard classification metrics such as accuracy, precision, recall, and F1-score. The main contribution is the defined workflow between crawler and classifier, providing an automated filtering layer to help security analysts prioritize investigation targets.

Implementation & Architectural Notes

Modular crawler design to allow easy extension with plugins or filters.
Feature extraction pipeline that includes DOM parsing, script analysis, and discovery of endpoints/forms.
Model update loop: periodic retraining recommended to adapt to new patterns and reduce concept drift.

Limitations & Future Work

The study acknowledges several limitations when applied in real-world settings:

Class imbalance (few positive/vulnerable examples) which may bias the model.
Dynamic content (AJAX/SPA) challenges requiring realistic browser automation to capture runtime-generated artifacts.
False positives/negatives impact on operations – human verification remains necessary.

            Practical takeaways
            Provides prioritized candidate pages for security analysts to review.
Potential to increase discovery efficiency when integrated with traditional scanners.
Flexible architecture allows swapping ML models or adding rule-based heuristics.

        

Launch Your Idea with Us!

Got a research idea or prototype? Collaborate with S4E to validate and showcase your work.

Submit a Project