While there will surely be an upcoming blog post about the inner workings of Haka, our internally-developed distributed fuzzer, I'll focus more on the infrastructure and workflow that runs our fuzzer, as well as the problems (and solutions) we encountered to building and running a fuzzing cluster on commodity hardware. The result is a project comprised of management scripts and Ansible playbooks that we call Gunsen.
In SRT we have good and, well, less than good hardware. While some of our old boxes lying around might be old, they're still powerful enough to hold a KVM guest or two. We put these boxes to use for some of our distributed projects, primarily our fuzzer.
"No Cold Cores"
The basic idea of Haka is to have a modular fuzzer that can be easily run on a "diverse" set of systems. I use diverse loosely, as the concern was more on setting up Haka on as much idle or unused commodity systems we have (we have a plethora). Haka makes heavy use of Ubuntu, though it certainly can be worked into RHEL or Debian-based systems, with some tweaking. The major requirement is the support for hardware virtualization (vmx/svm).
"No Cold Cores": The battle cry of SRT. The goal was to have a repeatable method to automatically provision nodes in order to make idle systems work for us. A lot of phases in our workflow are represented as playbooks in Ansible, which does most (if not all) of the legwork.
Within our inventory of ragtag hardware (an ode to afl on Edison is in order), we've installed Haka fuzzers on over 24 systems, which include everything from Asus Vivo PCs (horrible little boxes) to a Dell VRTX (the opposite). We ended up with over 260 fuzzing virtual nodes (300+ on certain tasks).
The Install phase involves building up the Haka infrastructure from scratch. This includes installing and configuring all the prerequisites to support the various Haka components (Tasker, Datastore, and Fuzzer). This should not happen very often on non-fuzzer components. When provisioning new systems, we simply add the new device's IP address into our inventory and run the installation playbook.
The current setup uses QCOW2 base images and overlays in order to greatly reduce the size of our virtual fuzz nodes. Multiple overlays can reference a base image and serves as an image that only tracks differences. Instead of having a raw image per node, QCOW2 allows us to greatly reduce our image size, running 20 Windows VMs in less than 10GB (15GB if you include the base image).
The Prepare phase deals with the generation of the golden image that we eventually push out to all KVM hosts (as part of deployment). This can include installing applications for fuzzing, OS-specific modifications for improving performance, or simply shrink the image size. Although it adds more time in the process we try to zeroize unused disk space (using sdelete –z) and compress the image prior to deployment.
This involves pushing out the golden image across all KVM hosts. Once the QCOW2 image has been successfully pushed to the hosts, a libvirt script is executed which automatically generates overlay images and domains, based on a node_count variable in the Ansible inventory file. At this point the fuzz nodes can start pulling work from the task queue.
It's a bit hacky, but it does the trick! Since we currently do not have a direct way to interface with the fuzz nodes, the KVM host temporarily stands up a web service and uses WinRM to command each fuzz node to retrieve files. This is mostly used when updated Haka clients need to be pushed out. Instead of redeploying golden images, which would take too long, this was a quick and easy workaround.
In the event that our systems enter a weird state, or are simply preparing to redeploy a new image, the Cleanup phase connects to each KVM host and destroys all domains, as well as deleting all overlay files.
Problem #1: The number of nodes is too high!
A good problem to have, but we had to juggle the fact that we might not be able to directly control the KVM domains easily without using some sort of port mapping with NAT/PAT configuration on the KVM network. Also having something larger than a /24 would just be annoying as we don't really feel the need to manually manage any of the underlying fuzz nodes. Virt-Manager + Virt-Viewer is easy enough to use for an RDP/VNC-like environment, and great for checking system state when an exception is encountered.
Problem #2: Deployment time is too long!
Probably the most painful process in all of this is deploying golden images on every KVM hosts (5GB * 24 = a lot); however, image redeployment does not happen very often. We've sort of brainstormed on how to tackle this issue, as the problem will get progressively worse with more idle systems we magically find around the office. Right now we're thinking of either using P2P for deploying golden images (maybe IPFS?), or having a tiered distribution system. It's still up in the air, but because of our ability to push files and run powershell commands via WinRM through the KVM host, it hasn't been a large priority.
Problem #3: Non-unique node names and node management.
Since we were cloning multiple systems off of a single image, it became difficult to properly identify where the exceptions were being generated, as all generated fuzz nodes had the same hostname as the golden image. To address this issue, we decided to change our nodes' hostnames to their MAC address (H-AAAABBBBCCCC). By including a startup script as part of the Haka client installer, we could check the hostname, then modify and restart as needed. This resulted in the creation of unique hostnames for when we have to investigate exception events.
As a follow up to this problem, we now have a bunch of randomized hostnames that we were unable to quickly map to their KVM host. This prevented us from directly managing the desired fuzz node. RDP/VNC is also out of the question because the nodes themselves are behind a NAT. We decided to write a script that uses the Ansible inventory file to poll the DHCP lease information for all KVM hosts and to run virt-viewer.
We built Gunsen to help with our fuzzer automation. However, we've found that it could be very useful as a stand-alone utility for other infosec research. For instance, deploying client-side high-interaction honeypots, doing AV testing, testing exploits for reliability, etc.
There are a lot of changes being made to the underlying fuzzer, but the ultimate goal of Gunsen is to simplify the management of our infrastructure. The idea is to "do less management and do more development". One idea we're throwing around is to use a job scheduler application and improve the usability of the Ansible playbooks, as well as doing some status monitoring on our hosts. In any case, we plan on open sourcing Gunsen in the coming weeks.