skip to main content
10.1145/3576915.3623075acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Open access

In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

Published: 21 November 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models' inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets.
    To address this issue, we propose a new closed-loop ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data's realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic "hourglass'' model and is implemented as its "thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model's generalizability.

    References

    [1]
    Apache airflow. https://airflow.apache.org.
    [2]
    A. Alsaheel, Y. Nan, S. Ma, L. Yu, G. Walkup, Z. B. Celik, X. Zhang, and D. Xu. Atlas: A sequence-based learning approach for attack investigation. In USENIX Security, 2021.
    [3]
    Anonymous, A. A. Niaki, N. P. Hoang, P. Gill, and A. Houmansadr. Triplet censors: Demystifying great Firewall's DNS censorship behavior. In FOCI, 2020.
    [4]
    Ansible automation platform. https://www.ansible.com/.
    [5]
    D. W. Apley and J. Zhu. Visualizing the effects of predictor variables in black box supervised learning models, 2019.
    [6]
    S. O. Arik and T. Pfister. Tabnet: Attentive interpretable tabular learning, 2020.
    [7]
    D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro, and K. Rieck. Dos and don'ts of machine learning in computer security. In USENIX Security, 2022.
    [8]
    I. Baldin, A. Nikolich, J. Griffioen, I. I. S. Monga, K.-C. Wang, T. Lehman, and P. Ruth. Fabric: A national-scale programmable experimental network infrastructure. IEEE Internet Computing, 2019.
    [9]
    balena - the complete iot management platform. https://www.balena.io/.
    [10]
    K. Bartos, M. Sofka, and V. Franc. Optimized invariant representation of network traffic for detecting unseen malware variants. In USENIX Security, 2016.
    [11]
    M. Beck. On the hourglass model. Commun. ACM, 62(7):48--57, jun 2019.
    [12]
    R. Beltiukov, S. Chandrasekaran, A. Gupta, and W. Willinger. Pinot: Programmable infrastructure for networking. In ANRW, 2023.
    [13]
    R. Beltiukov, W. Guo, A. Gupta, and W. Willinger. In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems (extended version). https://arxiv.org/abs/2306.08853, 2023.
    [14]
    A. Bhaskar and P. Pearce. Many roads lead to rome: How packet headers influence DNS censorship measurement. In USENIX Security, 2022.
    [15]
    H. Birge-Lee, L. Wang, D. McCarney, R. Shoemaker, J. Rexford, and P. Mittal. Experiences deploying Multi-Vantage-Point domain validation at let's encrypt. In USENIX Security, 2021.
    [16]
    L. Breiman. Random forests. Machine learning, 45:5-32, 2001.
    [17]
    F. Bronzino, P. Schmitt, S. Ayoubi, G. Martins, R. Teixeira, and N. Feamster. Inferring streaming video quality from encrypted traffic: Practical models and deployment experience. POMACS, 2019.
    [18]
    Cloud computing services - amazon web services. https://aws.amazon.com/.
    [19]
    Cloud computing services - microsoft azure. https://azure.microsoft.com/.
    [20]
    Cloud computing services - digitalocean. https://www.digitalocean.com/.
    [21]
    Cloud computing services - google cloud. https://cloud.google.com/.
    [22]
    Chi@edge. https://chameleoncloud.org/experiment/chiedge/.
    [23]
    E. Chatzoglou, V. Kouliaridis, G. Karopoulos, and G. Kambourakis. Revisiting quic attacks: A comprehensive review on quic security and a hands-on study. International Journal of Information Security, 2022.
    [24]
    N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. JAIR, 2002.
    [25]
    Chef infra. http://www.chef.io/chef/.
    [26]
    Z. Chen, Q. Li, and Z. Zhang. Towards robust neural networks via close-loop control. arXiv preprint arXiv:2102.01862, 2021.
    [27]
    G. Cherubin, R. Jansen, and C. Troncoso. Online website fingerprinting: Evalu- ating website fingerprinting attacks on tor in the real world. In USENIX Security, 2022.
    [28]
    Canadian institute for cybersecurity datasets. https://www.unb.ca/cic/datasets/ index.html.
    [29]
    Cicflowmeter-v4.0. https://github.com/ahlashkari/CICFlowMeter.
    [30]
    A. Cuzzocrea, F. Martinelli, F. Mercaldo, and G. Vercelli. Tor traffic analysis and detection via machine learning techniques. In Big Data, 2017.
    [31]
    Dagster. https://dagster.io/.
    [32]
    A. D'Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2022.
    [33]
    1998 darpa intrusion detection evaluation dataset. https://www.ll.mit.edu/r- d/datasets/1998-darpa-intrusion-detection-evaluation-dataset.
    [34]
    Docker. https://www.docker.com/.
    [35]
    P. Dodia, M. AlSabah, O. Alrawi, and T. Wang. Exposing the rat in the tunnel: Using traffic analysis for tor-based malware detection. In CCS, 2022.
    [36]
    G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, and A. A. Ghorbani. Charac- terization of encrypted and vpn traffic using time-related features. In ICISSP, 2016.
    [37]
    M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In CCS, 2017.
    [38]
    L. D'hooge, T. Wauters, B. Volckaert, and F. De Turck. Inter-dataset generaliza- tion strength of supervised machine learning methods for intrusion detection. Journal of Information Security and Applications, 54:102564, 2020.
    [39]
    Edgenet. https://www.edge-net.org/.
    [40]
    A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia. A brief review of domain adaptation. In Advances in Data Science and Information Engineering, 2021.
    [41]
    J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 2001.
    [42]
    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020.
    [43]
    A. Gepperth and S. Rieger. A survey of machine learning applied to computer networks. In ESANN, 2020.
    [44]
    Github actions. https://docs.github.com/en/actions.
    [45]
    Gitlab ci/cd. https://docs.gitlab.com/ee/ci/.
    [46]
    I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
    [47]
    I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
    [48]
    M. Gouel, K. Vermeulen, M. Mouchet, J. P. Rohrer, O. Fourmaux, and T. Friedman. Zeph iris map the internet: A resilient reinforcement learning approach to distributed ip route tracing. SIGCOMM Computer Communication Review, 2022.
    [49]
    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data?, 2022.
    [50]
    W. Guo, D. Mu, J. Xu, P. Su, G. Wang, and X. Xing. Lemna: Explaining deep learning based security applications. In CCS, 2018.
    [51]
    S. Gupta and A. Gupta. Dealing with noise problem in machine learning data- sets: A systematic review. Procedia Computer Science, 2019.
    [52]
    C. Gutterman, K. Guo, S. Arora, T. Gilliland, X. Wang, L. Wu, E. Katz-Bassett, and G. Zussman. Requet: Real-time qoe metric detection for encrypted youtube traffic. ACM Transactions on MCCA, 2020.
    [53]
    M. Harrity, K. Bock, F. Sell, and D. Levin. GET /out: Automated discovery of Application-Layer censorship evasion strategies. In USENIX Security, 2022.
    [54]
    D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv:1610.02136, 2016.
    [55]
    Hydra. https://github.com/vanhauser-thc/thc-hydra.
    [56]
    A. S. Jacobs, R. Beltiukov, W. Willinger, R. A. Ferreira, A. Gupta, and L. Z. Granville. Ai/ml for network security: The emperor has no clothes. In CCS, 2022.
    [57]
    Jenkins. https://www.jenkins.io/.
    [58]
    R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, and L. Cavallaro. Transcend: Detecting concept drift in malware classification models. In USENIX Security, 2017.
    [59]
    G. Kovács. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. ASC, 2019.
    [60]
    Kubernetes - production-grade container orchestraction. https://kubernetes.io/.
    [61]
    I. Kunakorntum, W. Hinthong, and P. Phunchongharn. A synthetic minority based on probabilistic distribution (symprod) oversampling for imbalanced datasets. IEEE Access, 2020.
    [62]
    B. Lantz, B. Heller, and N. McKeown. A network in a laptop: Rapid prototyping for software-defined networks. In SIGCOMM Workshop on Hot Topics in Networks, New York, NY, USA, 2010. Association for Computing Machinery.
    [63]
    J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 2018.
    [64]
    Luigi. https://github.com/spotify/luigi.
    [65]
    S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predic- tions. In NeurIPS. 2017.
    [66]
    K. Maharana, S. Mondal, and B. Nemade. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 2022.
    [67]
    memory-profiler. https://pypi.org/project/memory-profiler/.
    [68]
    Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai. Kitsune: An ensemble of autoencoders for online network intrusion detection. In NDSS, 2018.
    [69]
    F. Molder, K. Jablonski, B. Letcher, M. Hall, C. Tomkins-Tinch, V. Sochat, J. Forster, S. Lee, S. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Koster. Sustainable data analysis with snakemake. F1000Research, 2021.
    [70]
    C. Molnar. Interpretable machine learning. Lulu. com, 2020.
    [71]
    A. Natekin and A. Knoll. Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7:21, 2013.
    [72]
    R. Netravali, A. Sivaraman, S. Das, A. Goyal, K. Winstein, J. Mickens, and H. Balakrishnan. Mahimahi: Accurate Record-and-Replay for HTTP. In USENIX ATC, 2015.
    [73]
    System code of netunicorn. https://github.com/netunicorn/netunicorn.
    [74]
    Library of tasks for netunicorn. https://github.com/netunicorn/netunicorn- library.
    [75]
    Supplementary materials for netunicorn paper. https://github.com/netunicorn/ netunicorn-search.
    [76]
    H. Nori, S. Jenkins, P. Koch, and R. Caruana. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223, 2019.
    [77]
    ns-3 | a discrete-event network simulator for internet systems. https://www. nsnam.org/.
    [78]
    P4runtime specification. https://p4.org/p4-spec/p4runtime/main/P4Runtime- Spec.html.
    [79]
    Patator. https://github.com/lanjelot/patator.
    [80]
    Platforms for advanced wireless research. https://advancedwireless.org/.
    [81]
    J. Petch, S. Di, and W. Nelson. Opening the black box: The promise and limitations of explainable machine learning in cardiology. Canadian Journal of Cardiology, 2022.
    [82]
    Puppet. https://puppet.com/.
    [83]
    J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. Mit Press, 2008.
    [84]
    S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann. Data augmentation can improve robustness. In NeurIPS, 2021.
    [85]
    M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, page 1135--1144, New York, NY, USA, 2016. Association for Computing Machinery.
    [86]
    M. Richards. Software Architecture Patterns: Understanding Common Architecture Patterns and when to Use Them. O'Reilly Media, 2015.
    [87]
    J. Rivero, B. Ribeiro, N. Chen, and F. S. Leite. A grassmannian approach to zero-shot learning for network intrusion detection. In ICONIP, 2017.
    [88]
    S. Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.
    [89]
    Salt project. https://saltproject.io/.
    [90]
    R. Schuster, V. Shmatikov, and E. Tromer. Beauty and the burst: Remote identi- fication of encrypted video streams. In USENIX Security, 2017.
    [91]
    Seclists. https://github.com/danielmiessler/SecLists.
    [92]
    S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi. Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745, 2018.
    [93]
    I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In International Conference on Information Systems Security and Privacy, 2018.
    [94]
    S. Shi, X. Zhang, and W. Fan. Explaining the predictions of any image classifier via decision trees, 2019.
    [95]
    C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 2019.
    [96]
    R. Shwartz-Ziv and A. Armon. Tabular data: Deep learning is not all you need. Information Fusion, 2022.
    [97]
    Sidecar. https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar.
    [98]
    J.-P. Smith, L. Dolfi, P. Mittal, and A. Perrig. QCSD: A QUIC Client-Side Website- Fingerprinting defence framework. In USENIX Security, 2022.
    [99]
    Unsw datasets. https://iotanalytics.unsw.edu.au/.
    [100]
    D. A. Van Dyk and X.-L. Meng. The art of data augmentation. Journal of Computational and Graphical Statistics, 2001.
    [101]
    M. Vasić, A. Petrović, K. Wang, M. Nikolić, R. Singh, and S. Khurshid. MoËT: Mixture of expert trees and its application to verifiable reinforcement learning. Neural Networks, 151:34-47, jul 2022.
    [102]
    Vmware vsphere. https://www.vmware.com/products/vsphere.html.
    [103]
    F. Wei, H. Li, Z. Zhao, and H. Hu. Xnids: Explaining deep learning-based network intrusion detection systems for active intrusion responses. In Security, 2023.
    [104]
    F. Y. Yan, H. Ayers, C. Zhu, S. Fouladi, J. Hong, K. Zhang, P. Levis, and K. Winstein. Learning in situ: a randomized experiment in video streaming. In NSDI, 2020.
    [105]
    F. Y. Yan, J. Ma, G. D. Hill, D. Raghavan, R. S. Wahby, P. Levis, and K. Winstein. Pantheon: the training ground for internet congestion-control research. In USENIX ATC, 2018.
    [106]
    L. Yang, W. Guo, Q. Hao, A. Ciptadi, A. Ahmadzadeh, X. Xing, and G. Wang. {CADE}: Detecting and explaining concept drift samples for security applications. In USENIX Security, 2021.
    [107]
    H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
    [108]
    Y. Zhang and Q. Yang. An overview of multi-task learning. NSR, 2018.
    [109]
    Q. Zhou and D. Pezaros. Evaluation of machine learning classifiers for zero- day intrusion detection-an analysis on cic-aws-2018 dataset. arXiv preprint arXiv:1905.03685, 2019.

    Cited By

    View all
    • (2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024

    Index Terms

    1. In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
          November 2023
          3722 pages
          ISBN:9798400700507
          DOI:10.1145/3576915
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 21 November 2023

          Check for updates

          Author Tags

          1. artificial intelligence
          2. data collection
          3. generalizability
          4. machine learning
          5. network security

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          CCS '23
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

          Upcoming Conference

          CCS '24
          ACM SIGSAC Conference on Computer and Communications Security
          October 14 - 18, 2024
          Salt Lake City , UT , USA

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)578
          • Downloads (Last 6 weeks)98

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media