package datasources
This package contains the Mothra datasources, along with mechanisms
for working with those datasources. The primary novel feature of
these datasources is the fields
mechanism.
To use the IPFIX or SiLK data sources, you can use the following methods added by the implicit CERTDataFrameReader on DataFrameReader after importing from this package:
import org.cert.netsa.mothra.datasources._ val silkDF = spark.read.silkFlow() // to read from the default SiLK repository val silkRepoDF = spark.read.silkFlow(repository="...") // to read from an alternate SiLK repository val silkFilesDF = spark.read.silkFlow("/path/to/silk/files") // to read from loose SiLK files val ipfixDF = spark.read.ipfix(repository="/path/to/mothra/data/dir") // for packed Mothra IPFIX data val ipfixS3DF = spark.read.ipfix(s3Repository="bucket-name") // for packed Mothra IPFIX data from an S3 bucket val ipfixFilesDF = spark.read.ipfix("/path/to/ipfix/files") // for loose IPFIX files
(The additional methods are defined on the implicit class CERTDataFrameReader.)
Using the fields
method allows you to configure which SiLK or
IPFIX fields you wish to retrieve. (This is particularly important
for IPFIX data, as IPFIX files may contains many many possible
fields organized in various ways.)
import org.cert.netsa.mothra.datasources._ val silkDF = spark.read.fields("sIP", "dIP").silkFlow(...) val ipfixDF = spark.read.fields("sourceIPAddress", "destinationIPAddress").ipfix(...)
Both of these dataframes will contain only the source and destination IP addresses from the specified data sources. You may also provide column names different from the source field names:
val silkDF = spark.read.fields("server" -> "sIP", "client" -> "dIP").silkFlow(...) val ipfixDF = spark.read.fields("server" -> "sourceIPAddress", "client" -> "destinationIPAddress").ipfix(...)
You may also mix the mapped and the default names in one call:
val df = spark.read.fields("sIP", "dIP", "s" -> "sensor").silkFlow(...)
- Alphabetic
- By Inheritance
- datasources
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Package Members
- package fields
- package ipfix
A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.
A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.
You can use this by importing
org.cert.netsa.mothra.datasources._
like this:import org.cert.netsa.mothra.datasources._ val df1 = spark.read.ipfix("path/to/mothra/data/dir") // for packed Mothra IPFIX data val df2 = spark.read.ipfix("path/to/ipfix/files") // for loose IPFIX files
The IPFIX datasource uses the fields mechanism from org.cert.netsa.mothra.datasources. You can make use of this mechanism like these examples:
import org.cert.netsa.mothra.datasources._ val df1 = spark.read.fields( "startTime", "endTime", "sourceIPAddress", "destinationIPAddress" ).ipfix(...) val df2 = spark.read.fields( "startTime", "endTime", "TOS" -> "ipClassOfService" ).ipfix(...)
with arbitrary sets of fields and field name mappings.
Default Fields
The default set of fields (defined in IPFIXFields.default) is:
"startTime" -> "func:startTime"
"endTime" -> "func:endTime"
"sourceIPAddress" -> "func:sourceIPAddress"
"sourcePort" -> "func:sourcePort"
"destinationIPAddress" -> "func:destinationIPAddress"
"destinationPort" -> "func:destinationPort"
"protocolIdentifier"
"observationDomainId"
"vlanId"
"reverseVlanId"
"silkAppLabel"
"packetCount" -> "packetTotalCount|packetDeltaCount"
"reversePacketCount" -> "reversePacketTotalCount|reversePacketDeltaCount"
"octetCount" -> "octetTotalCount|octetDeltaDcount"
"reverseOctetCount" -> "reverseOctetTotalCount|reverseOctetDeltaCount"
"initialTCPFlags"
"reverseInitialTCPFlags"
"unionTCPFlags"
"reverseUnionTCPFlags"
Some of these defaults are defined simply as IPFIX Information Elements. For example,
"protocolIdentifier"
and"vlanId"
are exactly the Information Elements that are named. No "right-hand-side" is given for these definitions, because the name of the field is the same as the name of the Information Element.Others have simple expressions. For example,
packetCount
is defined as"packetTotalCount|packetDeltaCount"
. This expressions means that the value should be found from thepacketTotalCount
IE, or if that is not set from thepacketDeltaCount
IE. This allows this field to be used regardless of which Information Element contains the data.Some others are derived in more complex ways from basic IPFIX fields. For example, the
startTime
field is produced using"func:startTime"
, which runs the "gauntlet of time" to determine the start time for a flow by whatever means possible. Other time fields are similarly defined.Some of the
"func:..."
fields are actually quite simple. For example,"func:sourceIPAddress"
, practically speaking, is the same as"sourceIPv4Address|sourceIPv6Address"
. However, these fields are defined using thefunc:
extension mechanism so that partitioning on them is possible. (This restriction may be lifted in a future Mothra version.)Field Types
The mappings between IPFIX types and Spark types are:
- octetArray →
Array[Byte]
- unsigned8 →
Short
- unsigned16 →
Int
- unsigned32 →
Long
- unsigned64 →
Long
- signed8 →
Byte
- signed16 →
Short
- signed32 →
Int
- signed64 →
Long
- float32 →
Float
- float64 →
Double
- boolean →
Boolean
- macAddress →
String
- string →
String
- dateTimeSeconds →
Timestamp
- dateTimeMilliseconds →
Timestamp
- dateTimeMicroseconds →
Timestamp
- dateTimeNanoseconds →
Timestamp
- ipv4Address →
String
- ipv6Address →
String
IPFIX's basicList, subTemplateList, and subTemplateMultiList data types are handled differently.
Field Expressions
As noted above, field expressions may contain simple IPFIX Information Element names, or collections of names separated by pipe characters to indicate taking the first matching choice. This language has a number of other capabilities which are documented for now in the IPFIX field parser object.
Functional Fields
A number of pre-defined "functional fields" are available. Some of these combine other information elements in ways that the expression language cannot (applying the so-called "gauntlet of time", for example). Others provide support for the Mothra repository partitioning system. And finally, a few are for debugging purposes and provide high-level overviews of IPFIX records or point to file locations on disk.
Function fields are all defined and described in the org.cert.netsa.mothra.datasources.ipfix.fields.func package.
- package silk
Type Members
- implicit class CERTDataFrameReader extends AnyRef
Additional methods for Spark DataFrameReader to enable reading IPFIX and SiLK data, and to allow specifying fields for IPFIX and SiLK datasources.
- sealed trait FilterResult extends EnumEntry
A four-way logic filter result.
A four-way logic filter result. Passes if every record in the partition will pass the filter. Fails if every record in the partition will fail the filter, Maybe if some of the records in the partition might pass and some might fail the filter, Nulls if every record in the partition will produce NULL for the filter.
FilterResults may be explicitly converted to Booleans indicating whether it's possible for the result of the filter to be true (canMatch):
Passes
⟹true
Fails
⟹false
Maybe
⟹true
Nulls
⟹false
Logical and (
&&
) behavior:Passes && Passes
⟹Passes
Passes && Fails
⟹Fails
Passes && Maybe
⟹Maybe
Passes && Nulls
⟹Nulls
Fails && Passes
⟹Fails
Fails && Fails
⟹Fails
Fails && Maybe
⟹Fails
Fails && Nulls
⟹Fails
Maybe && Passes
⟹Maybe
Maybe && Fails
⟹Fails
Maybe && Maybe
⟹Maybe
Maybe && Nulls
⟹Fails
Nulls && Passes
⟹Nulls
Nulls && Fails
⟹Fails
Nulls && Maybe
⟹Fails
Nulls && Nulls
⟹Nulls
Logical or (
||
) behavior:Passes || Passes
⟹Passes
Passes || Fails
⟹Passes
Passes || Maybe
⟹Passes
Passes || Nulls
⟹Passes
Fails || Passes
⟹Passes
Fails || Fails
⟹Fails
Fails || Maybe
⟹Maybe
Fails || Nulls
⟹Nulls
Maybe || Passes
⟹Passes
Maybe || Fails
⟹Maybe
Maybe || Maybe
⟹Maybe
Maybe || Nulls
⟹Maybe
Nulls || Passes
⟹Passes
Nulls || Fails
⟹Nulls
Nulls || Maybe
⟹Maybe
Nulls || Nulls
⟹Nulls
Logical not (
!
) behavior:!Passes
⟹Fails
!Fails
⟹Passes
!Maybe
⟹Maybe
!Nulls
⟹Nulls
Value Members
- object FilterResult extends Enum[FilterResult]
This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data. Some modules contain APIs of general use to Scala programmers. Some modules make those tools more useful on Spark data-processing systems.
Please see the documentation for the individual packages for more details on their use.
Scala Packages
These packages are useful in Scala code without involving Spark:
org.cert.netsa.data
This package, which is collected as the
netsa-data
library, provides types for working with various kinds of information:org.cert.netsa.data.net
- types for working with network dataorg.cert.netsa.data.time
- types for working with time dataorg.cert.netsa.data.unsigned
- types for working with unsigned integral valuesorg.cert.netsa.io.ipfix
The
netsa-io-ipfix
library provides tools for reading and writing IETF IPFIX data from various connections and files.org.cert.netsa.io.silk
To read and write CERT NetSA SiLK file formats and configuration files, use the
netsa-io-silk
library.org.cert.netsa.util
The "junk drawer" of
netsa-util
so far provides only two features: First, a method for equipping Scala scala.collection.Iterators with exception handling. And second, a way to query the versions of NetSA libraries present in a JVM at runtime.Spark Packages
These packages require the use of Apache Spark:
org.cert.netsa.mothra.datasources
Spark datasources for CERT file types. This package contains utility features which add methods to Apache Spark DataFrameReader objects, allowing IPFIX and SiLK flows to be opened using simple
spark.read...
calls.The
mothra-datasources
library contains both IPFIX and SiLK functionality, whilemothra-datasources-ipfix
andmothra-datasources-silk
contain only what's needed for the named datasource.org.cert.netsa.mothra.analysis
A grab-bag of analysis helper functions and example analyses.
org.cert.netsa.mothra.functions
This single Scala object provides Spark SQL functions for working with network data. It is the entirety of the
mothra-functions
library.