September 2025

LPD: Machine Learning - Driven Login Page Detection

Intelligent Login Interface Identification Using Machine Learning

FURKAN COLHAK - ONUR AKTAS

Page Contents

System Overview

The Login Page Detection (LPD) project introduces a machine learning–driven approach to automatically detect authentication pages across the web. Traditional detection systems often rely on keyword-based heuristics such as “login”, “signin”, or “password”. However, these fail on modern, JavaScript-heavy web applications that use non-standard labels, multi-language interfaces, or hidden form logic.

LPD resolves these limitations through structural DOM analysis, leveraging statistical and semantic patterns within web forms. By focusing on HTML structure rather than text content, it achieves language-agnostic and highly generalizable detection, making it ideal for applications like threat intelligence, credential leak validation, and phishing site classification.

Dataset Construction and Feature Engineering

The foundation of the project is a manually curated dataset containing 2,791 labeled samples collected from real-world websites across multiple industries and languages. Each sample was categorized as either Login or Non-Login based on human verification.

From every page, a total of 58 engineered features were extracted. These features represent structural, attribute-based, and relational properties of the web page:

Input Field Distribution: Count and ratio of password, text, email, and hidden inputs.
Form-Level Metadata: Presence of form tags, method attributes, and action endpoints.
Label–Input Relationships: Association strength between input labels and authentication keywords.
DOM Density & Hierarchy: Depth, sibling ratio, and proximity between password and submit fields.
Token and Button Detection: Existence of captcha, remember-me checkboxes, and login buttons.

This design ensures high detection capability even on non-English or visually obfuscated pages.

Machine Learning Methodology

Six algorithms were trained and evaluated under a 5-fold stratified cross-validation strategy:

Logistic Regression (LR)
Support Vector Machine (SVM)
K-Nearest Neighbors (KNN)
Decision Tree (DT)
Gradient Boosting (GB)
Random Forest (RF)

Both GridSearchCV and RandomizedSearchCV were applied for hyperparameter tuning, targeting optimal tradeoffs between precision, recall, and model complexity.

After extensive testing, Random Forest emerged as the most balanced and interpretable model, outperforming others in both generalization and stability.

Performance Results

90.56%

Accuracy (Random Forest)

0.91

Precision

0.90

Recall

5-Fold

Cross-Validation Scheme

While SVM and Gradient Boosting achieved competitive baseline results (around 87–88% accuracy), Random Forest consistently outperformed them post-tuning. The model’s interpretability was a major advantage, allowing detailed feature importance analysis that revealed form-related features contributed most to accurate detection.

Key Findings and Insights

Structural DOM features outperform text-based keyword methods in multi-language environments.
Form layout, field density, and button hierarchies are the most decisive predictors.
Hyperparameter tuning using RandomizedSearch reduced false positives by 18%.
Heuristic detection alone misclassified ~27% of single-page apps due to JavaScript rendering delays.

These findings emphasize the importance of combining DOM-based feature extraction with adaptive ML pipelines for scalable detection across the modern web.

            Future Research and Improvements
            Dynamic Rendering: Integrate headless browser rendering (e.g.,
                    Puppeteer/Playwright) for JS-heavy sites.
Continuous Learning: Periodically retrain models with live web data from active
                    crawlers.
Deep Learning Exploration: Evaluate CNN/LSTM models on visual login patterns and
                    page screenshots.
Phishing Detection Integration: Combine LPD with URL risk scoring for real-time
                    domain classification.

        

Conclusion

LPD validates that machine learning can deliver a scalable, accurate, and language-agnostic solution for login page detection, outperforming conventional rule-based methods by a significant margin.

The model not only advances the accuracy of login page identification but also lays the groundwork for automated phishing analysis, credential exposure validation, and large-scale threat monitoring systems.

The complete model, dataset summary, and deployment guide are freely available at free-security-tools.

Launch Your Idea with Us!

Got a research idea or prototype? Collaborate with S4E to validate and showcase your work.

Submit a Project