Malware family clustering plays a crucial role in many security tasks, including malware analysis, classification, labeling, triage, threat hunting, and lineage studies. This work takes a close look at the influence on malware family clustering of 11 popular static similarity features, including whole-file fuzzy hashes (e.g., SSDeep, TLSH), structural hashes (e.g., PE Hash, Import Hash, VirusTotal’s VHash), certificate-based features, and icon-based features. Our goal is not to propose new features or clustering approaches. Instead, we aim to measure how often these 11 features make clustering errors, i.e., cluster together samples belonging to different malware families. We also investigate the root causes behind those errors, which often lead to misinterpretations of malware relationships, hinder effective threat detection, and propagate inaccuracies in downstream analyses. To study this phenomenon, we leverage three public datasets comprising 79,993 labeled Windows malware samples. We cluster those samples by using each of the analyzed features, measure their accuracy with a focus on their precision, and examine the reasons that caused some clusters to contain samples from different families. Our analysis identifies intrinsic limitations of some of the features and highlights the severe impact of EXE-building tools (like software protectors, installers, and selfextracting archives) on malware clustering. Finally, we discuss mitigations and evaluate potential improvements to address the problems we observed. Our findings provide a critical foundation for improving static malware clustering methodologies by emphasizing the importance of dataset curation and feature refinement for robust and precise clustering outcomes.
Family ties: A close look at the influence of static features on the precision of malware family clustering
APWG eCrime 2025, Cybercrimes Only AI and Crimebots can Dream of, 4-7 November 2025, San Diego, USA
Type:
Conférence
City:
San Diego
Date:
2025-11-04
Department:
Sécurité numérique
Eurecom Ref:
8519
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in APWG eCrime 2025, Cybercrimes Only AI and Crimebots can Dream of, 4-7 November 2025, San Diego, USA and is available at :
PERMALINK : https://www.eurecom.fr/publication/8519