package ipfix
A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.
You can use this by importing org.cert.netsa.mothra.datasources._
like this:
import org.cert.netsa.mothra.datasources._ val df1 = spark.read.ipfix("path/to/mothra/data/dir") // for packed Mothra IPFIX data val df2 = spark.read.ipfix("path/to/ipfix/files") // for loose IPFIX files
The IPFIX datasource uses the fields mechanism from org.cert.netsa.mothra.datasources. You can make use of this mechanism like these examples:
import org.cert.netsa.mothra.datasources._ val df1 = spark.read.fields( "startTime", "endTime", "sourceIPAddress", "destinationIPAddress" ).ipfix(...) val df2 = spark.read.fields( "startTime", "endTime", "TOS" -> "ipClassOfService" ).ipfix(...)
with arbitrary sets of fields and field name mappings.
Default Fields
The default set of fields (defined in IPFIXFields.default) is:
"startTime" -> "func:startTime"
"endTime" -> "func:endTime"
"sourceIPAddress" -> "func:sourceIPAddress"
"sourcePort" -> "func:sourcePort"
"destinationIPAddress" -> "func:destinationIPAddress"
"destinationPort" -> "func:destinationPort"
"protocolIdentifier"
"observationDomainId"
"vlanId"
"reverseVlanId"
"silkAppLabel"
"packetCount" -> "packetTotalCount|packetDeltaCount"
"reversePacketCount" -> "reversePacketTotalCount|reversePacketDeltaCount"
"octetCount" -> "octetTotalCount|octetDeltaDcount"
"reverseOctetCount" -> "reverseOctetTotalCount|reverseOctetDeltaCount"
"initialTCPFlags"
"reverseInitialTCPFlags"
"unionTCPFlags"
"reverseUnionTCPFlags"
Some of these defaults are defined simply as IPFIX Information
Elements. For example, "protocolIdentifier"
and "vlanId"
are
exactly the Information Elements that are named. No
"right-hand-side" is given for these definitions, because the name
of the field is the same as the name of the Information Element.
Others have simple expressions. For example, packetCount
is defined as "packetTotalCount|packetDeltaCount"
. This
expressions means that the value should be found from the
packetTotalCount
IE, or if that is not set from the
packetDeltaCount
IE. This allows this field to be used regardless
of which Information Element contains the data.
Some others are derived in more complex ways from basic IPFIX
fields. For example, the startTime
field is produced using
"func:startTime"
, which runs the "gauntlet of time" to determine
the start time for a flow by whatever means possible. Other time
fields are similarly defined.
Some of the "func:..."
fields are actually quite simple. For
example, "func:sourceIPAddress"
, practically speaking, is the same
as "sourceIPv4Address|sourceIPv6Address"
. However, these fields
are defined using the func:
extension mechanism so that
partitioning on them is possible. (This restriction may be lifted in
a future Mothra version.)
Field Types
The mappings between IPFIX types and Spark types are:
- octetArray →
Array[Byte]
- unsigned8 →
Short
- unsigned16 →
Int
- unsigned32 →
Long
- unsigned64 →
Long
- signed8 →
Byte
- signed16 →
Short
- signed32 →
Int
- signed64 →
Long
- float32 →
Float
- float64 →
Double
- boolean →
Boolean
- macAddress →
String
- string →
String
- dateTimeSeconds →
Timestamp
- dateTimeMilliseconds →
Timestamp
- dateTimeMicroseconds →
Timestamp
- dateTimeNanoseconds →
Timestamp
- ipv4Address →
String
- ipv6Address →
String
IPFIX's basicList, subTemplateList, and subTemplateMultiList data types are handled differently.
Field Expressions
As noted above, field expressions may contain simple IPFIX Information Element names, or collections of names separated by pipe characters to indicate taking the first matching choice. This language has a number of other capabilities which are documented for now in the IPFIX field parser object.
Functional Fields
A number of pre-defined "functional fields" are available. Some of these combine other information elements in ways that the expression language cannot (applying the so-called "gauntlet of time", for example). Others provide support for the Mothra repository partitioning system. And finally, a few are for debugging purposes and provide high-level overviews of IPFIX records or point to file locations on disk.
Function fields are all defined and described in the org.cert.netsa.mothra.datasources.ipfix.fields.func package.
- Alphabetic
- By Inheritance
- ipfix
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Package Members
- package fields
Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.
Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.
The IPFIXFieldParsing object defines the parser used for IPFIX field expressions, and includes the documentation for that language.
Other mechanisms, including implementations of the IPFIXField trait, provide the ability to define new "function" fields and register them into the Func registry. This is an experimental capabilty and is likely to be deprecated and then removed from public access in the future.
- Note
This is an experimental interface and is likely to be removed or made private in a future version.
Type Members
- sealed trait Constraint extends ConstraintSparkVerImpl
A constraint on the value of a field within a storage partition.
A constraint on the value of a field within a storage partition. For example, a partition with the constraint
Constraint.EQ("protocolIdentifier", "6")
on it would guarantee that every record in that partition has the value of 6 for itsprotocolIdentifier
field.Constraint are typically matched up against filters on a specific query to determine if the set of filters and constraints may produce any matches, or if the partition might be discarded completely.
This mechanism is exposed so that function fields may override the constraints used for evaluation.
- Note
This is an experimental interface and is likely to be removed or made private in a future version.
- class DefaultSource extends RelationProvider with StrictLogging
A Spark datasource for working with IPFIX data.
A Spark datasource for working with IPFIX data. This is the entrypoint for Spark to call the datasource. See org.cert.netsa.mothra.datasources.ipfix's documentation for details on how to use it as a user.
- trait SlicingRelationSparkVerImpl extends AnyRef
Value Members
- def IPFIX(labels: String*): FieldsSpec
Return a set of IPFIX fields for a given set of labels or field names.
Return a set of IPFIX fields for a given set of labels or field names. Known labels and field names are listed in org.cert.netsa.mothra.datasources.ipfix.IPFIXFields
- returns
a org.cert.netsa.mothra.datasources.fields.FieldsSpec for the given fields
IPFIX("#default", "#dns", "httpGet")
Example: - object Constraint
The basic constraint types used by Mothra.
The basic constraint types used by Mothra.
- Note
This is an experimental interface and is likely to be removed or made private in a future version.
- object Debug
Collects debugging information for the IPFIX datasource.
Collects debugging information for the IPFIX datasource.
NOTE: This may not work in all configurations.
To use this debugging facility, provide a
debugMode
argument to theipfix
data source method. The prefix of this should currently be"slices"
or"files"
.For a prefix of
"slices"
, such as"slices/0"
or"slices"
or"slicesfoo"
, the debugging result will be a collection ofString
s representing the slices to be processed by the data source. This can be used to verify how the data is being divvied up between executors.For a prefix of
"files"
, such as"files/237"
or"files"
or"filesinfo"
, the debugging result will be a collection ofString
s representing the locations to be processed by the data source. This can be used to verify that the correct data partitions are being selected.To retrieve the data for the given
debugMode
, indexDebug
with thedebugMode
, for exampleDebug("slices")
orDebug("files/237")
or the like. The result will be the debugging output from the last query that used thatdebugMode
argument. - object DefaultSource
- object IPFIXFields
Collections of fields relevant to IPFIX data sources.
Collections of fields relevant to IPFIX data sources.
These are primarily the fields produced by YAF version 3 and up, although the same fields are produced by other tools.
These collections and the individual fields within them may be passed to spark.read.fields directly and all of the fields in the collection will be included in the output.
If using a REPL with tab-completion, many of the provided symbols may be accessed using tab completion. In addition, most of the fields and field collections provide useful information if converted to strings. For example:
scala> import org.cert.netsa.mothra.datasources.ipfix.IPFIXFields import org.cert.netsa.mothra.datasources.ipfix.IPFIXFields scala> println(IPFIXFields.default) // Most common fields for flow-based traffic analysis FieldsSpec( // #default // Core flow fields, including time and the 5-tuple FieldsSpec( // #core // Timestamp of the first packet of this Flow "startTime" -> "func:startTime", // Timestamp of the final packet of this Flow "endTime" -> "func:endTime", // IPv4 or IPv6 source for incoming packets in this Flow "sourceIPAddress" -> "func:sourceIPAddress", // Any source port identifier from the transport header "sourcePort" -> "func:sourcePort", // IPv4 or IPv6 destination for incoming packets in this Flow "destinationIPAddress" -> "func:destinationIPAddress", // Any destination port identifier from the transport header "destinationPort" -> "func:destinationPort", // Protocol number from the IP packet header "protocolIdentifier" ), ... )
This is printing out all of the fields in the "default" collection. The fields each include their definition as an IPFIX field expression and a short description. Some fields are grouped into larger collections which are also given names and descriptions.
Each of the field names like
"startTime"
may be used as an index to pull out a specific field from the collection:scala> println(IPFIXFields.default("startTime")) FieldsSpec( // Timestamp of the first packet of this Flow "startTime" -> "func:startTime" )
And the collection names such as
"#core"
may also be used to pull out sub-collections:scala> println(IPFIXFields.default("#tcpflags")) // Fields for TCP protocol flag information FieldsSpec( // #tcpflags // TCP flags of first incoming packet of a TCP Flow "initialTCPFlags", // TCP flags of first outgoing packet of a TCP Flow "reverseInitialTCPFlags", // Union of TCP flags of all incoming packets after the first "unionTCPFlags", // Union of TCP flags of all outgoing packets after the first "reverseUnionTCPFlags" )
This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data. Some modules contain APIs of general use to Scala programmers. Some modules make those tools more useful on Spark data-processing systems.
Please see the documentation for the individual packages for more details on their use.
Scala Packages
These packages are useful in Scala code without involving Spark:
org.cert.netsa.data
This package, which is collected as the
netsa-data
library, provides types for working with various kinds of information:org.cert.netsa.data.net
- types for working with network dataorg.cert.netsa.data.time
- types for working with time dataorg.cert.netsa.data.unsigned
- types for working with unsigned integral valuesorg.cert.netsa.io.ipfix
The
netsa-io-ipfix
library provides tools for reading and writing IETF IPFIX data from various connections and files.org.cert.netsa.io.silk
To read and write CERT NetSA SiLK file formats and configuration files, use the
netsa-io-silk
library.org.cert.netsa.util
The "junk drawer" of
netsa-util
so far provides only two features: First, a method for equipping Scala scala.collection.Iterators with exception handling. And second, a way to query the versions of NetSA libraries present in a JVM at runtime.Spark Packages
These packages require the use of Apache Spark:
org.cert.netsa.mothra.datasources
Spark datasources for CERT file types. This package contains utility features which add methods to Apache Spark DataFrameReader objects, allowing IPFIX and SiLK flows to be opened using simple
spark.read...
calls.The
mothra-datasources
library contains both IPFIX and SiLK functionality, whilemothra-datasources-ipfix
andmothra-datasources-silk
contain only what's needed for the named datasource.org.cert.netsa.mothra.analysis
A grab-bag of analysis helper functions and example analyses.
org.cert.netsa.mothra.functions
This single Scala object provides Spark SQL functions for working with network data. It is the entirety of the
mothra-functions
library.