In my previous blog post, I discussed our endeavor to benefit from unsupervised learning on CyberPoint's malware dataset. One of the more intriguing tools I played with during that effort was the normalized compression distance (NCD).
NCD1 was introduced in 2004 as a universal distance function, with two very promising features:
- It's applicable to any sort of data you might want to analyze — without requiring any knowledge of the particulars of the data format or other domain knowledge.
- It approximately subsumes any other (computable) distance function you might come up with — even if it was cleverly devised with subject matter expertise on the particular type of data.
It achieves this by approximating the normalized Kolmogorov distance.