S4E Mobile Logo

September 2025

ML-Enhanced Web Crawler

Machine Learning-powered Crawler for Vulnerability Detection

ONUR AKTAS - AHMET BURAK CAN

Page Contents

    Short summary

    The paper presents an ML-enhanced crawler architecture that exploits features collected during automated browsing to flag pages with a higher likelihood of containing vulnerabilities. The approach aims to reduce manual triage workload and provide actionable priorities for security teams.

    Objective

    This work proposes a machine learning (ML) augmented web-crawler architecture designed to detect vulnerabilities in web applications and publicly available web resources. The primary goal is to automatically distinguish pages likely to contain security issues using a binary classification approach based on features extracted during crawling.

    Method & Approach

    The proposed system consists of two main components:

    • Intelligent crawler: Systematically explores target domains and collects signals such as forms, scripts, HTTP headers, and page content.
    • Machine learning classifier: Uses an engineered feature set to classify pages as vulnerable or non-vulnerable. Feature engineering includes structural indicators (DOM structure, forms, input elements) and content-based signals (embedded scripts, suspicious patterns).

    Key technical highlights

    Focus
    Web application vulnerabilities
    Model
    Binary classification (ML)
    Data
    HTML/JS/HTTP metadata collected by crawler
    Goal
    Reduce manual analysis effort via automated pre-filtering

    Evaluation & Findings

    The authors evaluate the framework on representative datasets and crawling scenarios using standard classification metrics such as accuracy, precision, recall, and F1-score. The main contribution is the defined workflow between crawler and classifier, providing an automated filtering layer to help security analysts prioritize investigation targets.

    Implementation & Architectural Notes

    1. Modular crawler design to allow easy extension with plugins or filters.
    2. Feature extraction pipeline that includes DOM parsing, script analysis, and discovery of endpoints/forms.
    3. Model update loop: periodic retraining recommended to adapt to new patterns and reduce concept drift.

    Limitations & Future Work

    The study acknowledges several limitations when applied in real-world settings:

    • Class imbalance (few positive/vulnerable examples) which may bias the model.
    • Dynamic content (AJAX/SPA) challenges requiring realistic browser automation to capture runtime-generated artifacts.
    • False positives/negatives impact on operations – human verification remains necessary.

    Practical takeaways

    • Provides prioritized candidate pages for security analysts to review.
    • Potential to increase discovery efficiency when integrated with traditional scanners.
    • Flexible architecture allows swapping ML models or adding rule-based heuristics.
    S4E Research Hub Footer CTA Image
    Launch Your Idea with Us!
    Got a research idea or prototype? Collaborate with S4E to validate and showcase your work.