Mothra Overview

Mothra is a collection of libraries and tools for working with network flow data in the Apache Spark large-scale data analytics engine.

The Mothra libraries include Apache Spark data sources for reading IPFIX and SiLK flow data, as well as some additional SiLK data files. Since Mothra works with the Apache Spark data analytics engine, numerous other data sources and file formats may also be analyzed and cross-referenced with this network data.

Other Mothra libraries provide useful functions for working with information commonly present in network data, such as IP addresses, TCP flags, port numbers, and the like.

The Mothra tools include software for loading IPFIX and SiLK data into HDFS storage for later analysis, and for partitioning IPFIX data as it is loaded to support more efficiently queries.

What is Network Flow Data?

NetFlow is a traffic-summarizing format that was first implemented by Cisco Systems® primarily for accounting purposes. Network flow data (or network flow) is a generalization of NetFlow. Network flow collection differs from direct packet capture (such as with tcpdump) in that it builds a summary of communications between sources and destinations on a network. For NetFlow, this summary covers all traffic matching seven relevant keys: the source and destination IP addresses, the source and destination ports, the transport layer protocol, the type of service, and the router interface.

Both IPFIX and SiLK flow records (see below) are generally centered around the so-called "five-tuple":

  • source IP address
  • destination IP address
  • source port
  • destination port
  • transport layer protocol

When combined with information about where the flow was collected and the time the flow was observed to start, this information is enough to distinguish one network flow from another.

Large collections of network flow data along with supporting collections of information are what Mothra is designed to store and process.

IPFIX (Internet Protocol Flow Information Export) is an IETF standard protocol for exprt of IP flow information. It was originally based on Cisco Systems® NetFlow Version 9. IPFIX is documented fully in RFC 7011 through RFC 7015, along with RFC 5103.

SiLK is a C-based tool for working with network flow data, based originally on Cisco Systems® NetFlow Version 5. It is fully documented on the SiLK documentation website.