Model Assessment
Compared general-purpose and cybersecurity-specific open-source models on tasks such as security interpretation, alert enrichment, and context-aware response generation.
A research project exploring how open-source and hosted LLMs can support cybersecurity workflows, including alert enrichment, synthetic data generation, retrieval-augmented context building, and anomaly detection over industrial network data.
Compared general-purpose and cybersecurity-specific open-source models on tasks such as security interpretation, alert enrichment, and context-aware response generation.
Built an enrichment flow around alerts, assets, vulnerabilities, and synthetic records so a model could generate summaries, remediation guidance, and contributing factors.
Tested LLM-assisted anomaly classification on industrial network-derived NetFlows, comparing in-context learning, fine-tuning, and reasoning-first prompting.
The work surveyed practical AI use cases for cybersecurity teams, focusing on model accessibility, local deployment constraints, domain specialization, and how LLM-assisted reasoning might improve analysis of alerts, assets, and vulnerabilities.
Two tracks were explored in parallel: Hugging Face open-source model assessment for alert enrichment and security-data interpretation, and anomaly detection experiments using Azure OpenAI workflows on network-flow data derived from industrial traffic captures.
Filtered cybersecurity-domain models and narrowed the field to candidates that were realistic to run, compare, and evaluate in a security-research workflow.
Focused on quantized 4-bit models that could be deployed locally with GPU support, reducing hardware cost while keeping experiments practical.
Tested whether models could read alert context, asset data, and vulnerability records and turn that information into useful analyst-facing summaries.
The first major use case combined alerts, assets, and vulnerability context into a security-enrichment pipeline. Synthetic datasets were created to avoid confidentiality and integrity risks while still giving the models realistic security data to reason over.
Domain-tuned models such as Lily Cybersecurity and SecurityLLM outperformed the baseline Mistral run in the presentation’s qualitative results, even though all tested models still carried heavy inference costs.



The second track used the UNB CIC Modbus 2023 dataset, converting PCAPs to NetFlows and then reformatting those flows into structured LLM prompts. The goal was to determine whether LLMs could classify benign versus anomalous industrial traffic under different prompting and fine-tuning strategies.
Provided labeled examples of normal and anomalous NetFlows directly in the prompt to guide classification. It performed well on recall but remained constrained by token limits.
Fine-tuned a hosted model on labeled NetFlow data to avoid prompt-size limits. It scaled example volume better, but the measured classification quality was weaker than the best prompt-driven runs.
Asked the model to explain its reasoning before producing a prediction, reducing post-hoc hallucinated justifications and yielding the strongest overall balance in the deck’s reported results.


The strongest results came from domain-tuned cybersecurity models and from prompt structures that force reasoning before prediction, especially when the task depends on structured evidence and contextual security knowledge.
This project sharpened practical thinking around model selection, quantization, prompt design, synthetic data safety, and how to evaluate whether an AI workflow is actually useful for analysts instead of only sounding impressive in a demo.