Makine Öğrenmesi Teknikleri Kullanılarak Sybil Botların Tespit Edilmesi
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Open Access Color
OpenAIRE Downloads
OpenAIRE Views
Abstract
Bu çalışma, NSL-KDD veri seti kullanılarak ağ tabanlı anomali tespiti amacıyla çeşitli makine öğrenmesi algoritmalarının performansını karşılaştırmalı olarak değerlendirmeyi amaçlamaktadır. NSL-KDD, saldırı türlerini dört ana başlıkta (DoS, Probe, R2L, U2R) toplayan, etiketli ve dengeli yapısıyla denetimli öğrenme yöntemleri için uygun bir veri seti olarak ele alınmıştır. Çalışma kapsamında veri seti üzerinde öncelikle istatistiksel analizler ve veri keşif çalışmaları gerçekleştirilmiş, ardından veri ön işleme adımları uygulanmıştır. Bu süreçte kategorik değişkenler sayısal forma dönüştürülmüş, eksik veriler temizlenmiş ve azınlıkta kalan sınıflar SMOTE yöntemiyle dengelenmiştir. Özellik seçimi için Mutual Information (MI) yöntemi kullanılarak en bilgilendirici 15 değişken belirlenmiş ve model eğitimi bu özellikler kullanılarak gerçekleştirilmiştir. Sonrasında tüm değişkenler kullanılarak modeller tekrar eğitilmiş ve sonuçlar kıyaslanmıştır. Modelleme aşamasında Lojistik Regresyon, Naive Bayes, Random Forest, K En Yakın Komşu (KNN), Destek Vektör Makineleri (SVM), AdaBoost ve Yapay Sinir Ağı (ANN) algoritmaları kullanılmıştır. Her model için hiper parametre optimizasyonu GridSearchCV veya RandomizedSearchCV yöntemleriyle yapılmıştır. Modellerin başarısı doğruluk (accuracy), kesinlik (precision), duyarlılık (recall) ve F1 skoru gibi değerlendirme metrikleri kullanılarak analiz edilmiştir.Elde edilen sonuçlar, NSL-KDD veri seti üzerinde bazı modellerin özellikle DoS gibi baskın sınıflarda yüksek doğruluk sağlarken, azınlıkta kalan R2L ve U2R saldırı türlerinde performans düşüşleri yaşandığını göstermektedir. Bu durum, dengesiz veri setlerinde kullanılacak yöntemlerin dikkatli seçilmesinin gerekliliğine işaret etmektedir.
This study aims to comparatively evaluate the performance of various machine learning algorithms for network-based anomaly detection using the NSL-KDD dataset. NSL-KDD, which categorizes attack types into four main groups (DoS, Probe, R2L, U2R), has been considered a suitable dataset for supervised learning methods due to its labeled and balanced structure. Within the scope of the study, initial statistical analyses and exploratory data analysis were conducted on the dataset, followed by data preprocessing steps. In this process, categorical variables were converted into numerical format, missing values were removed, and the minority classes were balanced using the SMOTE technique. For feature selection, the Mutual Information (MI) method was applied to determine the 15 most informative variables, and models were trained using these features. Subsequently, the models were retrained using all available features, and the results were compared. During the modeling phase, Logistic Regression, Naive Bayes, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), AdaBoost, and Artificial Neural Network (ANN) algorithms were employed. Hyperparameter optimization was performed for each model using GridSearchCV or RandomizedSearchCV. Model performances were evaluated based on several metrics, including accuracy, precision, recall, and F1-score. The results indicate that some models achieved high accuracy particularly for dominant classes such as DoS, while performance dropped significantly for underrepresented classes like R2L and U2R. These findings emphasize the importance of careful algorithm selection when dealing with imbalanced datasets.
This study aims to comparatively evaluate the performance of various machine learning algorithms for network-based anomaly detection using the NSL-KDD dataset. NSL-KDD, which categorizes attack types into four main groups (DoS, Probe, R2L, U2R), has been considered a suitable dataset for supervised learning methods due to its labeled and balanced structure. Within the scope of the study, initial statistical analyses and exploratory data analysis were conducted on the dataset, followed by data preprocessing steps. In this process, categorical variables were converted into numerical format, missing values were removed, and the minority classes were balanced using the SMOTE technique. For feature selection, the Mutual Information (MI) method was applied to determine the 15 most informative variables, and models were trained using these features. Subsequently, the models were retrained using all available features, and the results were compared. During the modeling phase, Logistic Regression, Naive Bayes, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), AdaBoost, and Artificial Neural Network (ANN) algorithms were employed. Hyperparameter optimization was performed for each model using GridSearchCV or RandomizedSearchCV. Model performances were evaluated based on several metrics, including accuracy, precision, recall, and F1-score. The results indicate that some models achieved high accuracy particularly for dominant classes such as DoS, while performance dropped significantly for underrepresented classes like R2L and U2R. These findings emphasize the importance of careful algorithm selection when dealing with imbalanced datasets.
Description
Keywords
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Turkish CoHE Thesis Center URL
Fields of Science
Citation
WoS Q
Scopus Q
Source
Volume
Issue
Start Page
End Page
59
Collections
Google Scholar™
Sustainable Development Goals
3
GOOD HEALTH AND WELL-BEING

4
QUALITY EDUCATION

7
AFFORDABLE AND CLEAN ENERGY

8
DECENT WORK AND ECONOMIC GROWTH

9
INDUSTRY, INNOVATION AND INFRASTRUCTURE

11
SUSTAINABLE CITIES AND COMMUNITIES
