research-article

Open access

In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

Authors:

Roman Beltiukov,

Arpit Gupta, and

Walter WillingerAuthors Info & Claims

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

November 2023

Pages 2217 - 2231

https://doi.org/10.1145/3576915.3623075

Published: 21 November 2023 Publication History

Abstract

The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models' inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets.

To address this issue, we propose a new closed-loop ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data's realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic "hourglass'' model and is implemented as its "thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model's generalizability.

References

[1]

Apache airflow. https://airflow.apache.org.

[2]

A. Alsaheel, Y. Nan, S. Ma, L. Yu, G. Walkup, Z. B. Celik, X. Zhang, and D. Xu. Atlas: A sequence-based learning approach for attack investigation. In USENIX Security, 2021.

[3]

Anonymous, A. A. Niaki, N. P. Hoang, P. Gill, and A. Houmansadr. Triplet censors: Demystifying great Firewall's DNS censorship behavior. In FOCI, 2020.

[4]

Ansible automation platform. https://www.ansible.com/.

[5]

D. W. Apley and J. Zhu. Visualizing the effects of predictor variables in black box supervised learning models, 2019.

[6]

S. O. Arik and T. Pfister. Tabnet: Attentive interpretable tabular learning, 2020.

[7]

D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck. Dos and don'ts of machine learning in computer security. In USENIX Security, 2022.

[8]

I. Baldin, A. Nikolich, J. Griffioen, I. I. S. Monga, K.-C. Wang, T. Lehman, and P. Ruth. Fabric: A national-scale programmable experimental network infrastructure. IEEE Internet Computing, 2019.

[9]

balena - the complete iot management platform. https://www.balena.io/.

[10]

K. Bartos, M. Sofka, and V. Franc. Optimized invariant representation of network traffic for detecting unseen malware variants. In USENIX Security, 2016.

[11]

M. Beck. On the hourglass model. Commun. ACM, 62(7):48--57, jun 2019.

Digital Library

[12]

R. Beltiukov, S. Chandrasekaran, A. Gupta, and W. Willinger. Pinot: Programmable infrastructure for networking. In ANRW, 2023.

Digital Library

[13]

R. Beltiukov, W. Guo, A. Gupta, and W. Willinger. In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems (extended version). https://arxiv.org/abs/2306.08853, 2023.

[14]

A. Bhaskar and P. Pearce. Many roads lead to rome: How packet headers influence DNS censorship measurement. In USENIX Security, 2022.

[15]

H. Birge-Lee, L. Wang, D. McCarney, R. Shoemaker, J. Rexford, and P. Mittal. Experiences deploying Multi-Vantage-Point domain validation at let's encrypt. In USENIX Security, 2021.

[16]

L. Breiman. Random forests. Machine learning, 45:5-32, 2001.

Digital Library

[17]

F. Bronzino, P. Schmitt, S. Ayoubi, G. Martins, R. Teixeira, and N. Feamster. Inferring streaming video quality from encrypted traffic: Practical models and deployment experience. POMACS, 2019.

Digital Library

[18]

Cloud computing services - amazon web services. https://aws.amazon.com/.

[19]

Cloud computing services - microsoft azure. https://azure.microsoft.com/.

[20]

Cloud computing services - digitalocean. https://www.digitalocean.com/.

[21]

Cloud computing services - google cloud. https://cloud.google.com/.

[22]

Chi@edge. https://chameleoncloud.org/experiment/chiedge/.

[23]

E. Chatzoglou, V. Kouliaridis, G. Karopoulos, and G. Kambourakis. Revisiting quic attacks: A comprehensive review on quic security and a hands-on study. International Journal of Information Security, 2022.

[24]

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. JAIR, 2002.

[25]

Chef infra. http://www.chef.io/chef/.

[26]

Z. Chen, Q. Li, and Z. Zhang. Towards robust neural networks via close-loop control. arXiv preprint arXiv:2102.01862, 2021.

[27]

G. Cherubin, R. Jansen, and C. Troncoso. Online website fingerprinting: Evalu- ating website fingerprinting attacks on tor in the real world. In USENIX Security, 2022.

[28]

Canadian institute for cybersecurity datasets. https://www.unb.ca/cic/datasets/ index.html.

[29]

Cicflowmeter-v4.0. https://github.com/ahlashkari/CICFlowMeter.

[30]

A. Cuzzocrea, F. Martinelli, F. Mercaldo, and G. Vercelli. Tor traffic analysis and detection via machine learning techniques. In Big Data, 2017.

[31]

Dagster. https://dagster.io/.

[32]

A. D'Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2022.

[33]

1998 darpa intrusion detection evaluation dataset. https://www.ll.mit.edu/r- d/datasets/1998-darpa-intrusion-detection-evaluation-dataset.

[34]

Docker. https://www.docker.com/.

[35]

P. Dodia, M. AlSabah, O. Alrawi, and T. Wang. Exposing the rat in the tunnel: Using traffic analysis for tor-based malware detection. In CCS, 2022.

[36]

G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, and A. A. Ghorbani. Charac- terization of encrypted and vpn traffic using time-related features. In ICISSP, 2016.

[37]

M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In CCS, 2017.

[38]

L. D'hooge, T. Wauters, B. Volckaert, and F. De Turck. Inter-dataset generaliza- tion strength of supervised machine learning methods for intrusion detection. Journal of Information Security and Applications, 54:102564, 2020.

[39]

Edgenet. https://www.edge-net.org/.

[40]

A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia. A brief review of domain adaptation. In Advances in Data Science and Information Engineering, 2021.

[41]

J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 2001.

[42]

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.

[43]

A. Gepperth and S. Rieger. A survey of machine learning applied to computer networks. In ESANN, 2020.

[44]

Github actions. https://docs.github.com/en/actions.

[45]

Gitlab ci/cd. https://docs.gitlab.com/ee/ci/.

[46]

I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

Digital Library

[47]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[48]

M. Gouel, K. Vermeulen, M. Mouchet, J. P. Rohrer, O. Fourmaux, and T. Friedman. Zeph iris map the internet: A resilient reinforcement learning approach to distributed ip route tracing. SIGCOMM Computer Communication Review, 2022.

Digital Library

[49]

L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data?, 2022.

[50]

W. Guo, D. Mu, J. Xu, P. Su, G. Wang, and X. Xing. Lemna: Explaining deep learning based security applications. In CCS, 2018.

[51]

S. Gupta and A. Gupta. Dealing with noise problem in machine learning data- sets: A systematic review. Procedia Computer Science, 2019.

Digital Library

[52]

C. Gutterman, K. Guo, S. Arora, T. Gilliland, X. Wang, L. Wu, E. Katz-Bassett, and G. Zussman. Requet: Real-time qoe metric detection for encrypted youtube traffic. ACM Transactions on MCCA, 2020.

Digital Library

[53]

M. Harrity, K. Bock, F. Sell, and D. Levin. GET /out: Automated discovery of Application-Layer censorship evasion strategies. In USENIX Security, 2022.

[54]

D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv:1610.02136, 2016.

[55]

Hydra. https://github.com/vanhauser-thc/thc-hydra.

[56]

A. S. Jacobs, R. Beltiukov, W. Willinger, R. A. Ferreira, A. Gupta, and L. Z. Granville. Ai/ml for network security: The emperor has no clothes. In CCS, 2022.

[57]

Jenkins. https://www.jenkins.io/.

[58]

R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro. Transcend: Detecting concept drift in malware classification models. In USENIX Security, 2017.

Digital Library

[59]

G. Kovács. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. ASC, 2019.

[60]

Kubernetes - production-grade container orchestraction. https://kubernetes.io/.

[61]

I. Kunakorntum, W. Hinthong, and P. Phunchongharn. A synthetic minority based on probabilistic distribution (symprod) oversampling for imbalanced datasets. IEEE Access, 2020.

[62]

B. Lantz, B. Heller, and N. McKeown. A network in a laptop: Rapid prototyping for software-defined networks. In SIGCOMM Workshop on Hot Topics in Networks, New York, NY, USA, 2010. Association for Computing Machinery.

Digital Library

[63]

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018.

[64]

Luigi. https://github.com/spotify/luigi.

[65]

S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predic- tions. In NeurIPS. 2017.

[66]

K. Maharana, S. Mondal, and B. Nemade. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 2022.

[67]

memory-profiler. https://pypi.org/project/memory-profiler/.

[68]

Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai. Kitsune: An ensemble of autoencoders for online network intrusion detection. In NDSS, 2018.

[69]

F. Molder, K. Jablonski, B. Letcher, M. Hall, C. Tomkins-Tinch, V. Sochat, J. Forster, S. Lee, S. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Koster. Sustainable data analysis with snakemake. F1000Research, 2021.

[70]

C. Molnar. Interpretable machine learning. Lulu. com, 2020.

[71]

A. Natekin and A. Knoll. Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7:21, 2013.

[72]

R. Netravali, A. Sivaraman, S. Das, A. Goyal, K. Winstein, J. Mickens, and H. Balakrishnan. Mahimahi: Accurate Record-and-Replay for HTTP. In USENIX ATC, 2015.

Digital Library

[73]

System code of netunicorn. https://github.com/netunicorn/netunicorn.

[74]

Library of tasks for netunicorn. https://github.com/netunicorn/netunicorn- library.

[75]

Supplementary materials for netunicorn paper. https://github.com/netunicorn/ netunicorn-search.

[76]

H. Nori, S. Jenkins, P. Koch, and R. Caruana. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223, 2019.

[77]

ns-3 | a discrete-event network simulator for internet systems. https://www. nsnam.org/.

[78]

P4runtime specification. https://p4.org/p4-spec/p4runtime/main/P4Runtime- Spec.html.

[79]

Patator. https://github.com/lanjelot/patator.

[80]

Platforms for advanced wireless research. https://advancedwireless.org/.

[81]

J. Petch, S. Di, and W. Nelson. Opening the black box: The promise and limitations of explainable machine learning in cardiology. Canadian Journal of Cardiology, 2022.

[82]

Puppet. https://puppet.com/.

[83]

J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. Mit Press, 2008.

[84]

S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann. Data augmentation can improve robustness. In NeurIPS, 2021.

[85]

M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, page 1135--1144, New York, NY, USA, 2016. Association for Computing Machinery.

Digital Library

[86]

M. Richards. Software Architecture Patterns: Understanding Common Architecture Patterns and when to Use Them. O'Reilly Media, 2015.

[87]

J. Rivero, B. Ribeiro, N. Chen, and F. S. Leite. A grassmannian approach to zero-shot learning for network intrusion detection. In ICONIP, 2017.

[88]

S. Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.

[89]

Salt project. https://saltproject.io/.

[90]

R. Schuster, V. Shmatikov, and E. Tromer. Beauty and the burst: Remote identi- fication of encrypted video streams. In USENIX Security, 2017.

[91]

Seclists. https://github.com/danielmiessler/SecLists.

[92]

S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi. Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745, 2018.

[93]

I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In International Conference on Information Systems Security and Privacy, 2018.

[94]

S. Shi, X. Zhang, and W. Fan. Explaining the predictions of any image classifier via decision trees, 2019.

[95]

C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 2019.

[96]

R. Shwartz-Ziv and A. Armon. Tabular data: Deep learning is not all you need. Information Fusion, 2022.

[97]

Sidecar. https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar.

[98]

J.-P. Smith, L. Dolfi, P. Mittal, and A. Perrig. QCSD: A QUIC Client-Side Website- Fingerprinting defence framework. In USENIX Security, 2022.

[99]

Unsw datasets. https://iotanalytics.unsw.edu.au/.

[100]

D. A. Van Dyk and X.-L. Meng. The art of data augmentation. Journal of Computational and Graphical Statistics, 2001.

[101]

M. Vasić, A. Petrović, K. Wang, M. Nikolić, R. Singh, and S. Khurshid. MoËT: Mixture of expert trees and its application to verifiable reinforcement learning. Neural Networks, 151:34-47, jul 2022.

Digital Library

[102]

Vmware vsphere. https://www.vmware.com/products/vsphere.html.

[103]

F. Wei, H. Li, Z. Zhao, and H. Hu. Xnids: Explaining deep learning-based network intrusion detection systems for active intrusion responses. In Security, 2023.

[104]

F. Y. Yan, H. Ayers, C. Zhu, S. Fouladi, J. Hong, K. Zhang, P. Levis, and K. Winstein. Learning in situ: a randomized experiment in video streaming. In NSDI, 2020.

[105]

F. Y. Yan, J. Ma, G. D. Hill, D. Raghavan, R. S. Wahby, P. Levis, and K. Winstein. Pantheon: the training ground for internet congestion-control research. In USENIX ATC, 2018.

[106]

L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, and G. Wang. {CADE}: Detecting and explaining concept drift samples for security applications. In USENIX Security, 2021.

[107]

H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[108]

Y. Zhang and Q. Yang. An overview of multi-task learning. NSR, 2018.

[109]

Q. Zhou and D. Pezaros. Evaluation of machine learning classifiers for zero- day intrusion detection-an analysis on cic-aws-2018 dataset. arXiv preprint arXiv:1905.03685, 2019.

Cited By

Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829

Index Terms

In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

Recommendations

Deep learning, reinforcement learning, and world models
Abstract
Deep learning (DL) and reinforcement learning (RL) methods seem to be a part of indispensable factors to achieve human-level or super-human AI systems. On the other hand, both DL and RL have strong connections with our brain functions ...
Read More
The role of Reinforcement Learning in software testing
Abstract Context:
Software testing is applied to validate the behavior of the software system and identify flaws and bugs. Different machine learning technique types such as supervised and unsupervised learning were utilized in software testing. However, ...
Read More
The Worst of Both Worlds: A Comparative Analysis of Errors in Learning from Data in Psychology and Machine Learning
AIES '22: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society

Arguments that machine learning (ML) is facing a reproducibility and replication crisis suggest that some published claims in research cannot be taken at face value. Concerns inspire analogies to the replication crisis affecting the social and medical ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

November 2023

3722 pages

ISBN:9798400700507

DOI:10.1145/3576915

General Chairs:
Weizhi Meng
Technical University of Denmark
,
Christian D. Jensen
Technical University of Denmark
,
Program Chairs:
Cas Cremers
CISPA Helmholtz Center for Information Security
,
Engin Kirda
Khoury College of Computer Sciences

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2023

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

CCS '23

Sponsor:

SIGSAC

CCS '23: ACM SIGSAC Conference on Computer and Communications Security

November 26 - 30, 2023

Copenhagen, Denmark

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
578
Total Downloads

Downloads (Last 12 months)578
Downloads (Last 6 weeks)98

Other Metrics

View Author Metrics

Citations

Cited By

Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents