September 2025
LPD: Machine Learning - Driven Login Page Detection
Intelligent Login Interface Identification Using Machine Learning
FURKAN COLHAK - ONUR AKTAS
Page Contents
System Overview
The Login Page Detection (LPD) project introduces a machine learning–driven approach to automatically detect authentication pages across the web. Traditional detection systems often rely on keyword-based heuristics such as “login”, “signin”, or “password”. However, these fail on modern, JavaScript-heavy web applications that use non-standard labels, multi-language interfaces, or hidden form logic.
LPD resolves these limitations through structural DOM analysis, leveraging statistical and semantic patterns within web forms. By focusing on HTML structure rather than text content, it achieves language-agnostic and highly generalizable detection, making it ideal for applications like threat intelligence, credential leak validation, and phishing site classification.
Dataset Construction and Feature Engineering
The foundation of the project is a manually curated dataset containing 2,791 labeled samples collected from real-world websites across multiple industries and languages. Each sample was categorized as either Login or Non-Login based on human verification.
From every page, a total of 58 engineered features were extracted. These features represent structural, attribute-based, and relational properties of the web page:
- Input Field Distribution: Count and ratio of password, text, email, and hidden inputs.
- Form-Level Metadata: Presence of form tags, method attributes, and action endpoints.
- Label–Input Relationships: Association strength between input labels and authentication keywords.
- DOM Density & Hierarchy: Depth, sibling ratio, and proximity between password and submit fields.
- Token and Button Detection: Existence of captcha, remember-me checkboxes, and login buttons.
This design ensures high detection capability even on non-English or visually obfuscated pages.
Machine Learning Methodology
Six algorithms were trained and evaluated under a 5-fold stratified cross-validation strategy:
- Logistic Regression (LR)
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Decision Tree (DT)
- Gradient Boosting (GB)
- Random Forest (RF)
Both GridSearchCV and RandomizedSearchCV were applied for hyperparameter tuning, targeting optimal tradeoffs between precision, recall, and model complexity.
After extensive testing, Random Forest emerged as the most balanced and interpretable model, outperforming others in both generalization and stability.
Performance Results
While SVM and Gradient Boosting achieved competitive baseline results (around 87–88% accuracy), Random Forest consistently outperformed them post-tuning. The model’s interpretability was a major advantage, allowing detailed feature importance analysis that revealed form-related features contributed most to accurate detection.
Key Findings and Insights
- Structural DOM features outperform text-based keyword methods in multi-language environments.
- Form layout, field density, and button hierarchies are the most decisive predictors.
- Hyperparameter tuning using RandomizedSearch reduced false positives by 18%.
- Heuristic detection alone misclassified ~27% of single-page apps due to JavaScript rendering delays.
These findings emphasize the importance of combining DOM-based feature extraction with adaptive ML pipelines for scalable detection across the modern web.
Future Research and Improvements
- Dynamic Rendering: Integrate headless browser rendering (e.g., Puppeteer/Playwright) for JS-heavy sites.
- Continuous Learning: Periodically retrain models with live web data from active crawlers.
- Deep Learning Exploration: Evaluate CNN/LSTM models on visual login patterns and page screenshots.
- Phishing Detection Integration: Combine LPD with URL risk scoring for real-time domain classification.
Conclusion
LPD validates that machine learning can deliver a scalable, accurate, and language-agnostic solution for login page detection, outperforming conventional rule-based methods by a significant margin.
The model not only advances the accuracy of login page identification but also lays the groundwork for automated phishing analysis, credential exposure validation, and large-scale threat monitoring systems.
The complete model, dataset summary, and deployment guide are freely available at free-security-tools.