Packages

package root

This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data.

This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data. Some modules contain APIs of general use to Scala programmers. Some modules make those tools more useful on Spark data-processing systems.

Please see the documentation for the individual packages for more details on their use.

Scala Packages

These packages are useful in Scala code without involving Spark:

org.cert.netsa.data

This package, which is collected as the netsa-data library, provides types for working with various kinds of information:

org.cert.netsa.data.net - types for working with network data
org.cert.netsa.data.time - types for working with time data
org.cert.netsa.data.unsigned - types for working with unsigned integral values

org.cert.netsa.io.ipfix

The netsa-io-ipfix library provides tools for reading and writing IETF IPFIX data from various connections and files.

org.cert.netsa.io.silk

To read and write CERT NetSA SiLK file formats and configuration files, use the netsa-io-silk library.

org.cert.netsa.util

The "junk drawer" of netsa-util so far provides only two features: First, a method for equipping Scala scala.collection.Iterators with exception handling. And second, a way to query the versions of NetSA libraries present in a JVM at runtime.

Spark Packages

These packages require the use of Apache Spark:

org.cert.netsa.mothra.datasources

Spark datasources for CERT file types. This package contains utility features which add methods to Apache Spark DataFrameReader objects, allowing IPFIX and SiLK flows to be opened using simple spark.read... calls.

The mothra-datasources library contains both IPFIX and SiLK functionality, while mothra-datasources-ipfix and mothra-datasources-silk contain only what's needed for the named datasource.

org.cert.netsa.mothra.analysis

A grab-bag of analysis helper functions and example analyses.

org.cert.netsa.mothra.functions

This single Scala object provides Spark SQL functions for working with network data. It is the entirety of the mothra-functions library.

Definition Classes: root

package org

Definition Classes: root

package cert

Definition Classes: org

package netsa

Definition Classes: cert

package mothra

Definition Classes: netsa

package datasources

This package contains the Mothra datasources, along with mechanisms for working with those datasources.

This package contains the Mothra datasources, along with mechanisms for working with those datasources. The primary novel feature of these datasources is the fields mechanism.

To use the IPFIX or SiLK data sources, you can use the following methods added by the implicit CERTDataFrameReader on DataFrameReader after importing from this package:

import org.cert.netsa.mothra.datasources._
val silkDF = spark.read.silkFlow()                                    // to read from the default SiLK repository
val silkRepoDF = spark.read.silkFlow(repository="...")                // to read from an alternate SiLK repository
val silkFilesDF = spark.read.silkFlow("/path/to/silk/files")          // to read from loose SiLK files
val ipfixDF = spark.read.ipfix(repository="/path/to/mothra/data/dir") // for packed Mothra IPFIX data
val ipfixS3DF = spark.read.ipfix(s3Repository="bucket-name")          // for packed Mothra IPFIX data from an S3 bucket
val ipfixFilesDF = spark.read.ipfix("/path/to/ipfix/files")           // for loose IPFIX files

(The additional methods are defined on the implicit class CERTDataFrameReader.)

Using the fields method allows you to configure which SiLK or IPFIX fields you wish to retrieve. (This is particularly important for IPFIX data, as IPFIX files may contains many many possible fields organized in various ways.)

import org.cert.netsa.mothra.datasources._
val silkDF = spark.read.fields("sIP", "dIP").silkFlow(...)
val ipfixDF = spark.read.fields("sourceIPAddress", "destinationIPAddress").ipfix(...)

Both of these dataframes will contain only the source and destination IP addresses from the specified data sources. You may also provide column names different from the source field names:

val silkDF = spark.read.fields("server" -> "sIP", "client" -> "dIP").silkFlow(...)
val ipfixDF = spark.read.fields("server" -> "sourceIPAddress", "client" -> "destinationIPAddress").ipfix(...)

You may also mix the mapped and the default names in one call:

val df = spark.read.fields("sIP", "dIP", "s" -> "sensor").silkFlow(...)

Definition Classes: mothra
See also: IPFIX datasource
SiLK flow datasource

package fields

Definition Classes: datasources

package ipfix

A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.

You can use this by importing org.cert.netsa.mothra.datasources._ like this:

import org.cert.netsa.mothra.datasources._
val df1 = spark.read.ipfix("path/to/mothra/data/dir") // for packed Mothra IPFIX data
val df2 = spark.read.ipfix("path/to/ipfix/files")     // for loose IPFIX files

The IPFIX datasource uses the fields mechanism from org.cert.netsa.mothra.datasources. You can make use of this mechanism like these examples:

import org.cert.netsa.mothra.datasources._
val df1 = spark.read.fields(
  "startTime", "endTime", "sourceIPAddress", "destinationIPAddress"
).ipfix(...)

val df2 = spark.read.fields(
  "startTime", "endTime", "TOS" -> "ipClassOfService"
).ipfix(...)

with arbitrary sets of fields and field name mappings.

Default Fields

The default set of fields (defined in IPFIXFields.default) is:

"startTime" -> "func:startTime"
"endTime" -> "func:endTime"
"sourceIPAddress" -> "func:sourceIPAddress"
"sourcePort" -> "func:sourcePort"
"destinationIPAddress" -> "func:destinationIPAddress"
"destinationPort" -> "func:destinationPort"
"protocolIdentifier"
"observationDomainId"
"vlanId"
"reverseVlanId"
"silkAppLabel"
"packetCount" -> "packetTotalCount|packetDeltaCount"
"reversePacketCount" -> "reversePacketTotalCount|reversePacketDeltaCount"
"octetCount" -> "octetTotalCount|octetDeltaDcount"
"reverseOctetCount" -> "reverseOctetTotalCount|reverseOctetDeltaCount"
"initialTCPFlags"
"reverseInitialTCPFlags"
"unionTCPFlags"
"reverseUnionTCPFlags"

Some of these defaults are defined simply as IPFIX Information Elements. For example, "protocolIdentifier" and "vlanId" are exactly the Information Elements that are named. No "right-hand-side" is given for these definitions, because the name of the field is the same as the name of the Information Element.

Others have simple expressions. For example, packetCount is defined as "packetTotalCount|packetDeltaCount". This expressions means that the value should be found from the packetTotalCount IE, or if that is not set from the packetDeltaCount IE. This allows this field to be used regardless of which Information Element contains the data.

Some others are derived in more complex ways from basic IPFIX fields. For example, the startTime field is produced using "func:startTime", which runs the "gauntlet of time" to determine the start time for a flow by whatever means possible. Other time fields are similarly defined.

Some of the "func:..." fields are actually quite simple. For example, "func:sourceIPAddress", practically speaking, is the same as "sourceIPv4Address|sourceIPv6Address". However, these fields are defined using the func: extension mechanism so that partitioning on them is possible. (This restriction may be lifted in a future Mothra version.)

Field Types

The mappings between IPFIX types and Spark types are:

octetArray → Array[Byte]
unsigned8 → Short
unsigned16 → Int
unsigned32 → Long
unsigned64 → Long
signed8 → Byte
signed16 → Short
signed32 → Int
signed64 → Long
float32 → Float
float64 → Double
boolean → Boolean
macAddress → String
string → String
dateTimeSeconds → Timestamp
dateTimeMilliseconds → Timestamp
dateTimeMicroseconds → Timestamp
dateTimeNanoseconds → Timestamp
ipv4Address → String
ipv6Address → String

IPFIX's basicList, subTemplateList, and subTemplateMultiList data types are handled differently.

Field Expressions

As noted above, field expressions may contain simple IPFIX Information Element names, or collections of names separated by pipe characters to indicate taking the first matching choice. This language has a number of other capabilities which are documented for now in the IPFIX field parser object.

Functional Fields

A number of pre-defined "functional fields" are available. Some of these combine other information elements in ways that the expression language cannot (applying the so-called "gauntlet of time", for example). Others provide support for the Mothra repository partitioning system. And finally, a few are for debugging purposes and provide high-level overviews of IPFIX records or point to file locations on disk.

Function fields are all defined and described in the org.cert.netsa.mothra.datasources.ipfix.fields.func package.

Definition Classes: datasources

package silk

Definition Classes: datasources

CERTDataFrameReader

FilterResult

org.cert.netsa.mothra.datasources

CERTDataFrameReader

implicit class CERTDataFrameReader extends AnyRef

Additional methods for Spark DataFrameReader to enable reading IPFIX and SiLK data, and to allow specifying fields for IPFIX and SiLK datasources.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

CERTDataFrameReader
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Instance Constructors

new CERTDataFrameReader(self: DataFrameReader)

Value Members

final def !=(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def ##: Int
Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean
Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0
Definition Classes
Any
def clone(): AnyRef
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.CloneNotSupportedException]) @native()
final def eq(arg0: AnyRef): Boolean
Definition Classes
AnyRef
def equals(arg0: AnyRef): Boolean
Definition Classes
AnyRef → Any
def fields(fs: FieldsSpec*): DataFrameReader
Specifies the fields which will be used by the IPFIX or SiLK flow datasources.
Specifies the fields which will be used by the IPFIX or SiLK flow datasources. See the individual datasources for more details.
See also
IPFIX datasource for IPFIX fields
SiLK flow datasource for SiLK flow fields
fields.FieldsSpec object for ways to package groups of fields
def finalize(): Unit
Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.Throwable])
final def getClass(): Class[_ <: AnyRef]
Definition Classes
AnyRef → Any
Annotations
@native()
def hashCode(): Int
Definition Classes
AnyRef → Any
Annotations
@native()
def ipfix(path: String = "", repository: String = "", s3Repository: String = "", s3Prefix: String = "", numSlices: Int = 0, targetSize: Long = 0L, debugMode: String = ""): DataFrame
Loads IPFIX data from files or from a repository in HDFS or S3.
Loads IPFIX data from files or from a repository in HDFS or S3.
NOTE: If debugMode is given, *no result data will be generated*.
path
Hadoop path pattern for files to be read directly
repository
Hadoop path for filesystem repository
s3Repository
S3 bucket name for IPFiX repository
s3Prefix
Prefix to S3 keys for repository items
numSlices
Target parallelism
targetSize
Target size of slice for parallelism in bytes
debugMode
Mode and location for debugging (INTERNAL)
See also
IPFIX datasource for details
org.cert.netsa.mothra.datasources.ipfix.Debug for information about debugMode
final def isInstanceOf[T0]: Boolean
Definition Classes
Any
final def ne(arg0: AnyRef): Boolean
Definition Classes
AnyRef
final def notify(): Unit
Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit
Definition Classes
AnyRef
Annotations
@native()
def silkFlow(path: String = "", repository: String = "", configFile: String = ""): DataFrame
Loads SiLK flow data from files or from a SiLK repository in HDFS.
Loads SiLK flow data from files or from a SiLK repository in HDFS.
path
Hadoop path pattern for files to be read directly
repository
Hadoop path for SiLK repository data rootdir
configFile
Hadoop path for silk.conf config file
See also
SiLK flow datasource for details
final def synchronized[T0](arg0: => T0): T0
Definition Classes
AnyRef
def toString(): String
Definition Classes
AnyRef → Any
final def wait(): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long, arg1: Int): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException])
final def wait(arg0: Long): Unit
Definition Classes
AnyRef
Annotations
@throws(classOf[java.lang.InterruptedException]) @native()

Packages

Scala Packages

org.cert.netsa.data

org.cert.netsa.io.ipfix

org.cert.netsa.io.silk

org.cert.netsa.util

Spark Packages

org.cert.netsa.mothra.datasources

org.cert.netsa.mothra.analysis

org.cert.netsa.mothra.functions

Default Fields

Field Types

Field Expressions

Functional Fields

CERTDataFrameReader

implicit class CERTDataFrameReader extends AnyRef

Instance Constructors

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

Scala Packages

org.cert.netsa.data

org.cert.netsa.io.ipfix

org.cert.netsa.io.silk

org.cert.netsa.util

Spark Packages

org.cert.netsa.mothra.datasources

org.cert.netsa.mothra.analysis

org.cert.netsa.mothra.functions

Default Fields

Field Types

Field Expressions

Functional Fields

CERTDataFrameReader

implicit class CERTDataFrameReader extends AnyRef

Instance Constructors

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

CERTDataFrameReader