All accepted publications from SPARTA partners under its funding.
Machine learning methods are now widely used to detect a wide range of cyberattacks. Nevertheless, the commonly used algorithms come with challenges of their own - one of them lies in network dataset characteristics. The dataset should be well-balanced in terms of the number of malicious data samples vs. benign traffic samples to achieve adequate results. When the data is not balanced, numerous machine learning approaches show a tendency to classify minority class samples as majority class samples. Since usually in network traffic data there are significantly fewer malicious samples than benign samples, in this work the problem of learning from imbalanced network traffic data in the cybersecurity domain is addressed. A number of balancing approaches is evaluated along with their impact on different machine learning algorithms.