Packages

  • package root

    This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data.

    This is documentation for Mothra, a collection of Scala and Spark library functions for working with Internet-related data. Some modules contain APIs of general use to Scala programmers. Some modules make those tools more useful on Spark data-processing systems.

    Please see the documentation for the individual packages for more details on their use.

    Scala Packages

    These packages are useful in Scala code without involving Spark:

    org.cert.netsa.data

    This package, which is collected as the netsa-data library, provides types for working with various kinds of information:

    org.cert.netsa.io.ipfix

    The netsa-io-ipfix library provides tools for reading and writing IETF IPFIX data from various connections and files.

    org.cert.netsa.io.silk

    To read and write CERT NetSA SiLK file formats and configuration files, use the netsa-io-silk library.

    org.cert.netsa.util

    The "junk drawer" of netsa-util so far provides only two features: First, a method for equipping Scala scala.collection.Iterators with exception handling. And second, a way to query the versions of NetSA libraries present in a JVM at runtime.

    Spark Packages

    These packages require the use of Apache Spark:

    org.cert.netsa.mothra.datasources

    Spark datasources for CERT file types. This package contains utility features which add methods to Apache Spark DataFrameReader objects, allowing IPFIX and SiLK flows to be opened using simple spark.read... calls.

    The mothra-datasources library contains both IPFIX and SiLK functionality, while mothra-datasources-ipfix and mothra-datasources-silk contain only what's needed for the named datasource.

    org.cert.netsa.mothra.analysis

    A grab-bag of analysis helper functions and example analyses.

    org.cert.netsa.mothra.functions

    This single Scala object provides Spark SQL functions for working with network data. It is the entirety of the mothra-functions library.

    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package cert
    Definition Classes
    org
  • package netsa
    Definition Classes
    cert
  • package mothra
    Definition Classes
    netsa
  • package datasources

    This package contains the Mothra datasources, along with mechanisms for working with those datasources.

    This package contains the Mothra datasources, along with mechanisms for working with those datasources. The primary novel feature of these datasources is the fields mechanism.

    To use the IPFIX or SiLK data sources, you can use the following methods added by the implicit CERTDataFrameReader on DataFrameReader after importing from this package:

    import org.cert.netsa.mothra.datasources._
    val silkDF = spark.read.silkFlow()                                    // to read from the default SiLK repository
    val silkRepoDF = spark.read.silkFlow(repository="...")                // to read from an alternate SiLK repository
    val silkFilesDF = spark.read.silkFlow("/path/to/silk/files")          // to read from loose SiLK files
    val ipfixDF = spark.read.ipfix(repository="/path/to/mothra/data/dir") // for packed Mothra IPFIX data
    val ipfixS3DF = spark.read.ipfix(s3Repository="bucket-name")          // for packed Mothra IPFIX data from an S3 bucket
    val ipfixFilesDF = spark.read.ipfix("/path/to/ipfix/files")           // for loose IPFIX files

    (The additional methods are defined on the implicit class CERTDataFrameReader.)

    Using the fields method allows you to configure which SiLK or IPFIX fields you wish to retrieve. (This is particularly important for IPFIX data, as IPFIX files may contains many many possible fields organized in various ways.)

    import org.cert.netsa.mothra.datasources._
    val silkDF = spark.read.fields("sIP", "dIP").silkFlow(...)
    val ipfixDF = spark.read.fields("sourceIPAddress", "destinationIPAddress").ipfix(...)

    Both of these dataframes will contain only the source and destination IP addresses from the specified data sources. You may also provide column names different from the source field names:

    val silkDF = spark.read.fields("server" -> "sIP", "client" -> "dIP").silkFlow(...)
    val ipfixDF = spark.read.fields("server" -> "sourceIPAddress", "client" -> "destinationIPAddress").ipfix(...)

    You may also mix the mapped and the default names in one call:

    val df = spark.read.fields("sIP", "dIP", "s" -> "sensor").silkFlow(...)
    Definition Classes
    mothra
    See also

    IPFIX datasource

    SiLK flow datasource

  • package fields
    Definition Classes
    datasources
  • package ipfix

    A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.

    A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.

    You can use this by importing org.cert.netsa.mothra.datasources._ like this:

    import org.cert.netsa.mothra.datasources._
    val df1 = spark.read.ipfix("path/to/mothra/data/dir") // for packed Mothra IPFIX data
    val df2 = spark.read.ipfix("path/to/ipfix/files")     // for loose IPFIX files

    The IPFIX datasource uses the fields mechanism from org.cert.netsa.mothra.datasources. You can make use of this mechanism like these examples:

    import org.cert.netsa.mothra.datasources._
    val df1 = spark.read.fields(
      "startTime", "endTime", "sourceIPAddress", "destinationIPAddress"
    ).ipfix(...)
    
    val df2 = spark.read.fields(
      "startTime", "endTime", "TOS" -> "ipClassOfService"
    ).ipfix(...)

    with arbitrary sets of fields and field name mappings.

    Default Fields

    The default set of fields (defined in IPFIXFields.default) is:

    • "startTime" -> "func:startTime"
    • "endTime" -> "func:endTime"
    • "sourceIPAddress" -> "func:sourceIPAddress"
    • "sourcePort" -> "func:sourcePort"
    • "destinationIPAddress" -> "func:destinationIPAddress"
    • "destinationPort" -> "func:destinationPort"
    • "protocolIdentifier"
    • "observationDomainId"
    • "vlanId"
    • "reverseVlanId"
    • "silkAppLabel"
    • "packetCount" -> "packetTotalCount|packetDeltaCount"
    • "reversePacketCount" -> "reversePacketTotalCount|reversePacketDeltaCount"
    • "octetCount" -> "octetTotalCount|octetDeltaDcount"
    • "reverseOctetCount" -> "reverseOctetTotalCount|reverseOctetDeltaCount"
    • "initialTCPFlags"
    • "reverseInitialTCPFlags"
    • "unionTCPFlags"
    • "reverseUnionTCPFlags"

    Some of these defaults are defined simply as IPFIX Information Elements. For example, "protocolIdentifier" and "vlanId" are exactly the Information Elements that are named. No "right-hand-side" is given for these definitions, because the name of the field is the same as the name of the Information Element.

    Others have simple expressions. For example, packetCount is defined as "packetTotalCount|packetDeltaCount". This expressions means that the value should be found from the packetTotalCount IE, or if that is not set from the packetDeltaCount IE. This allows this field to be used regardless of which Information Element contains the data.

    Some others are derived in more complex ways from basic IPFIX fields. For example, the startTime field is produced using "func:startTime", which runs the "gauntlet of time" to determine the start time for a flow by whatever means possible. Other time fields are similarly defined.

    Some of the "func:..." fields are actually quite simple. For example, "func:sourceIPAddress", practically speaking, is the same as "sourceIPv4Address|sourceIPv6Address". However, these fields are defined using the func: extension mechanism so that partitioning on them is possible. (This restriction may be lifted in a future Mothra version.)

    Field Types

    The mappings between IPFIX types and Spark types are:

    • octetArray → Array[Byte]
    • unsigned8 → Short
    • unsigned16 → Int
    • unsigned32 → Long
    • unsigned64 → Long
    • signed8 → Byte
    • signed16 → Short
    • signed32 → Int
    • signed64 → Long
    • float32 → Float
    • float64 → Double
    • boolean → Boolean
    • macAddress → String
    • string → String
    • dateTimeSeconds → Timestamp
    • dateTimeMilliseconds → Timestamp
    • dateTimeMicroseconds → Timestamp
    • dateTimeNanoseconds → Timestamp
    • ipv4Address → String
    • ipv6Address → String

    IPFIX's basicList, subTemplateList, and subTemplateMultiList data types are handled differently.

    Field Expressions

    As noted above, field expressions may contain simple IPFIX Information Element names, or collections of names separated by pipe characters to indicate taking the first matching choice. This language has a number of other capabilities which are documented for now in the IPFIX field parser object.

    Functional Fields

    A number of pre-defined "functional fields" are available. Some of these combine other information elements in ways that the expression language cannot (applying the so-called "gauntlet of time", for example). Others provide support for the Mothra repository partitioning system. And finally, a few are for debugging purposes and provide high-level overviews of IPFIX records or point to file locations on disk.

    Function fields are all defined and described in the org.cert.netsa.mothra.datasources.ipfix.fields.func package.

    Definition Classes
    datasources
  • package fields

    Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.

    Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.

    The IPFIXFieldParsing object defines the parser used for IPFIX field expressions, and includes the documentation for that language.

    Other mechanisms, including implementations of the IPFIXField trait, provide the ability to define new "function" fields and register them into the Func registry. This is an experimental capabilty and is likely to be deprecated and then removed from public access in the future.

    Note

    This is an experimental interface and is likely to be removed or made private in a future version.

  • Constraint
  • Debug
  • DefaultSource
  • IPFIXFields
  • SlicingRelationSparkVerImpl
  • package silk
    Definition Classes
    datasources

package ipfix

A data source as defined by the Spark Data Source API for reading IPFIX records from Mothra data spools and from loose files.

You can use this by importing org.cert.netsa.mothra.datasources._ like this:

import org.cert.netsa.mothra.datasources._
val df1 = spark.read.ipfix("path/to/mothra/data/dir") // for packed Mothra IPFIX data
val df2 = spark.read.ipfix("path/to/ipfix/files")     // for loose IPFIX files

The IPFIX datasource uses the fields mechanism from org.cert.netsa.mothra.datasources. You can make use of this mechanism like these examples:

import org.cert.netsa.mothra.datasources._
val df1 = spark.read.fields(
  "startTime", "endTime", "sourceIPAddress", "destinationIPAddress"
).ipfix(...)

val df2 = spark.read.fields(
  "startTime", "endTime", "TOS" -> "ipClassOfService"
).ipfix(...)

with arbitrary sets of fields and field name mappings.

Default Fields

The default set of fields (defined in IPFIXFields.default) is:

  • "startTime" -> "func:startTime"
  • "endTime" -> "func:endTime"
  • "sourceIPAddress" -> "func:sourceIPAddress"
  • "sourcePort" -> "func:sourcePort"
  • "destinationIPAddress" -> "func:destinationIPAddress"
  • "destinationPort" -> "func:destinationPort"
  • "protocolIdentifier"
  • "observationDomainId"
  • "vlanId"
  • "reverseVlanId"
  • "silkAppLabel"
  • "packetCount" -> "packetTotalCount|packetDeltaCount"
  • "reversePacketCount" -> "reversePacketTotalCount|reversePacketDeltaCount"
  • "octetCount" -> "octetTotalCount|octetDeltaDcount"
  • "reverseOctetCount" -> "reverseOctetTotalCount|reverseOctetDeltaCount"
  • "initialTCPFlags"
  • "reverseInitialTCPFlags"
  • "unionTCPFlags"
  • "reverseUnionTCPFlags"

Some of these defaults are defined simply as IPFIX Information Elements. For example, "protocolIdentifier" and "vlanId" are exactly the Information Elements that are named. No "right-hand-side" is given for these definitions, because the name of the field is the same as the name of the Information Element.

Others have simple expressions. For example, packetCount is defined as "packetTotalCount|packetDeltaCount". This expressions means that the value should be found from the packetTotalCount IE, or if that is not set from the packetDeltaCount IE. This allows this field to be used regardless of which Information Element contains the data.

Some others are derived in more complex ways from basic IPFIX fields. For example, the startTime field is produced using "func:startTime", which runs the "gauntlet of time" to determine the start time for a flow by whatever means possible. Other time fields are similarly defined.

Some of the "func:..." fields are actually quite simple. For example, "func:sourceIPAddress", practically speaking, is the same as "sourceIPv4Address|sourceIPv6Address". However, these fields are defined using the func: extension mechanism so that partitioning on them is possible. (This restriction may be lifted in a future Mothra version.)

Field Types

The mappings between IPFIX types and Spark types are:

  • octetArray → Array[Byte]
  • unsigned8 → Short
  • unsigned16 → Int
  • unsigned32 → Long
  • unsigned64 → Long
  • signed8 → Byte
  • signed16 → Short
  • signed32 → Int
  • signed64 → Long
  • float32 → Float
  • float64 → Double
  • boolean → Boolean
  • macAddress → String
  • string → String
  • dateTimeSeconds → Timestamp
  • dateTimeMilliseconds → Timestamp
  • dateTimeMicroseconds → Timestamp
  • dateTimeNanoseconds → Timestamp
  • ipv4Address → String
  • ipv6Address → String

IPFIX's basicList, subTemplateList, and subTemplateMultiList data types are handled differently.

Field Expressions

As noted above, field expressions may contain simple IPFIX Information Element names, or collections of names separated by pipe characters to indicate taking the first matching choice. This language has a number of other capabilities which are documented for now in the IPFIX field parser object.

Functional Fields

A number of pre-defined "functional fields" are available. Some of these combine other information elements in ways that the expression language cannot (applying the so-called "gauntlet of time", for example). Others provide support for the Mothra repository partitioning system. And finally, a few are for debugging purposes and provide high-level overviews of IPFIX records or point to file locations on disk.

Function fields are all defined and described in the org.cert.netsa.mothra.datasources.ipfix.fields.func package.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ipfix
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Package Members

  1. package fields

    Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.

    Most of these classes and traits relate to the definition of IPFIX fields as IPFIX record processing objects.

    The IPFIXFieldParsing object defines the parser used for IPFIX field expressions, and includes the documentation for that language.

    Other mechanisms, including implementations of the IPFIXField trait, provide the ability to define new "function" fields and register them into the Func registry. This is an experimental capabilty and is likely to be deprecated and then removed from public access in the future.

    Note

    This is an experimental interface and is likely to be removed or made private in a future version.

Type Members

  1. sealed trait Constraint extends ConstraintSparkVerImpl

    A constraint on the value of a field within a storage partition.

    A constraint on the value of a field within a storage partition. For example, a partition with the constraint Constraint.EQ("protocolIdentifier", "6") on it would guarantee that every record in that partition has the value of 6 for its protocolIdentifier field.

    Constraint are typically matched up against filters on a specific query to determine if the set of filters and constraints may produce any matches, or if the partition might be discarded completely.

    This mechanism is exposed so that function fields may override the constraints used for evaluation.

    Note

    This is an experimental interface and is likely to be removed or made private in a future version.

  2. class DefaultSource extends RelationProvider with StrictLogging

    A Spark datasource for working with IPFIX data.

    A Spark datasource for working with IPFIX data. This is the entrypoint for Spark to call the datasource. See org.cert.netsa.mothra.datasources.ipfix's documentation for details on how to use it as a user.

  3. trait SlicingRelationSparkVerImpl extends AnyRef

Value Members

  1. def IPFIX(labels: String*): FieldsSpec

    Return a set of IPFIX fields for a given set of labels or field names.

    Return a set of IPFIX fields for a given set of labels or field names. Known labels and field names are listed in org.cert.netsa.mothra.datasources.ipfix.IPFIXFields

    returns

    a org.cert.netsa.mothra.datasources.fields.FieldsSpec for the given fields

    Example:
    1. IPFIX("#default", "#dns", "httpGet")
  2. object Constraint

    The basic constraint types used by Mothra.

    The basic constraint types used by Mothra.

    Note

    This is an experimental interface and is likely to be removed or made private in a future version.

  3. object Debug

    Collects debugging information for the IPFIX datasource.

    Collects debugging information for the IPFIX datasource.

    NOTE: This may not work in all configurations.

    To use this debugging facility, provide a debugMode argument to the ipfix data source method. The prefix of this should currently be "slices" or "files".

    For a prefix of "slices", such as "slices/0" or "slices" or "slicesfoo", the debugging result will be a collection of Strings representing the slices to be processed by the data source. This can be used to verify how the data is being divvied up between executors.

    For a prefix of "files", such as "files/237" or "files" or "filesinfo", the debugging result will be a collection of Strings representing the locations to be processed by the data source. This can be used to verify that the correct data partitions are being selected.

    To retrieve the data for the given debugMode, index Debug with the debugMode, for example Debug("slices") or Debug("files/237") or the like. The result will be the debugging output from the last query that used that debugMode argument.

  4. object DefaultSource
  5. object IPFIXFields

    Collections of fields relevant to IPFIX data sources.

    Collections of fields relevant to IPFIX data sources.

    These are primarily the fields produced by YAF version 3 and up, although the same fields are produced by other tools.

    These collections and the individual fields within them may be passed to spark.read.fields directly and all of the fields in the collection will be included in the output.

    If using a REPL with tab-completion, many of the provided symbols may be accessed using tab completion. In addition, most of the fields and field collections provide useful information if converted to strings. For example:

    scala> import org.cert.netsa.mothra.datasources.ipfix.IPFIXFields
    import org.cert.netsa.mothra.datasources.ipfix.IPFIXFields
    
    scala> println(IPFIXFields.default)
    // Most common fields for flow-based traffic analysis
    FieldsSpec( // #default
      // Core flow fields, including time and the 5-tuple
      FieldsSpec( // #core
        // Timestamp of the first packet of this Flow
        "startTime" -> "func:startTime",
        // Timestamp of the final packet of this Flow
        "endTime" -> "func:endTime",
        // IPv4 or IPv6 source for incoming packets in this Flow
        "sourceIPAddress" -> "func:sourceIPAddress",
        // Any source port identifier from the transport header
        "sourcePort" -> "func:sourcePort",
        // IPv4 or IPv6 destination for incoming packets in this Flow
        "destinationIPAddress" -> "func:destinationIPAddress",
        // Any destination port identifier from the transport header
        "destinationPort" -> "func:destinationPort",
        // Protocol number from the IP packet header
        "protocolIdentifier"
      ),
      ...
    )

    This is printing out all of the fields in the "default" collection. The fields each include their definition as an IPFIX field expression and a short description. Some fields are grouped into larger collections which are also given names and descriptions.

    Each of the field names like "startTime" may be used as an index to pull out a specific field from the collection:

    scala> println(IPFIXFields.default("startTime"))
    FieldsSpec(
      // Timestamp of the first packet of this Flow
      "startTime" -> "func:startTime"
    )

    And the collection names such as "#core" may also be used to pull out sub-collections:

    scala> println(IPFIXFields.default("#tcpflags"))
    // Fields for TCP protocol flag information
    FieldsSpec( // #tcpflags
      // TCP flags of first incoming packet of a TCP Flow
      "initialTCPFlags",
      // TCP flags of first outgoing packet of a TCP Flow
      "reverseInitialTCPFlags",
      // Union of TCP flags of all incoming packets after the first
      "unionTCPFlags",
      // Union of TCP flags of all outgoing packets after the first
      "reverseUnionTCPFlags"
    )

Inherited from AnyRef

Inherited from Any

Ungrouped