Installing Mothra libraries locally

Some organizations may need or prefer to install libraries locally rather than fetch them from Maven. This can make it easier to provide a common configuration to multiple local users.

Using local Mothra libraries for a single user

The easiest way to run the spark-shell CLI interactive tool using Mothra libraries is to include the mothra_2.12-1.6.0-full.jar file in the --jars option to spark-shell:


    $ spark-shell --jars path/to/mothra_2.12-1.6.0-full.jar
    

If you have additional jars for this list, they should be included in a single --jars option, separated by commas, like:


    $ spark-shell --jars path/to/first.jar,path/to/second.jar
    

The --jars switch may also be used with the spark-submit command for non-interactive jobs that need the Mothra libraries.

When you use --jars, Spark will automatically copy the jar files to every node that needs them for the job (for either spark-shell or spark-submit).

Configuring Spark system-wide to use Mothra

You can also make a version of Mothra available as a system-wide default by editing your spark-defaults.conf file to include:


    spark.driver.extraClassPath /full/path/to/mothra_2.12-1.6.0-full.jar
    spark.executor.extraClassPath /full/path/to/mothra_2.12-1.6.0-full.jar
    

As with the command-line options, these are comma-separated lists, and you'll want to preserve any existing contents.

When using spark-defaults.conf, Spark will not automatically copy the jar file for you. You must make sure that the jar file is present at the same expected path on every machine that might need it. (This includes any machine where either a Spark driver or a Spark executor might be run.)

Spark configuration for Notebook Servers

Note that while making Mothra available system-wide in a Spark installation should cause everything that uses that Spark installation to work with Mothra, some notebook server configurations may by default use their own Spark installation. You will want to consult the documentation for your notebook server to determine how you might configure it to use the system-wide Spark installation, or how you might configure its built-in Spark installation to use the locally install Mothra libraries.

Troubleshooting version mismatches

These so-called "full" jars contain embedded dependencies so that no additional files must be downloaded to use Mothra. We've done what we can to ensure that the provided jar files work correctly with a wide range of Spark and Hadoop versions, but it's possible that something will go wrong. If you encounter difficulties, please consider the following steps to check for a version mismatch, and report your results with any bug report.

The "full" jars are built with the aim of including only the dependencies that are not known to be provided by Spark. Our testing suggests that one build for each Scala version and each Spark version which supports it is sufficient. Make sure that you use the appropriate version for your Spark installation.

A quick and easy way to test using Spark with Mothra using this jar is to run the following command, which will load the Mothra libraries (and demonstrate the available version):


    $ spark-shell --jars mothra_2.12-1.6.0-full.jar
    ...
    scala> org.cert.netsa.util.versionInfo("mothra")
    res0: Option[String] = Some(1.6.0)

    scala> org.cert.netsa.util.versionInfo.detail("mothra")
    res1: Option[String] = Some(mothra (Scala 2.12.15) (Spark 3.2.1))
    

Examine the version number to make sure you're using the correct version of Mothra. If it doesn't match the version you think that you are trying to test, it might indicate that a previously installed version is taking priority, and you should check your configuration.

If something goes wrong, looking at the version details shows you precisely what versions of Scala, Spark, and Hadoop were used for building and testing this jar file. In the above example, Scala 2.12.15 and Spark 3.2.1 were used.

(Note that Scala versions with the same first two numbers should always be binary compatible, but if the second number differs it will not be compatible. (That is: 2.11.11 and 2.11.12 are comptaible. 2.12.1 and 2.12.15 are compatible. 2.11.x and 2.12.x are not.)

Next, define a tiny sample query to make sure that Spark can construct a query correctly:


    scala> import org.cert.netsa.mothra.datasources._
    import org.cert.netsa.mothra.datasources._

    scala> val df = spark.read.ipfix("fccx-sample.ipfix")
    df: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp \
    ... 17 more fields]
    

(This specific data file is available in Sample Data, but any IPFIX file will do.)

If you are using a version of Spark with a different version of Scala than Mothra (for example, you're using Scala 2.12, but the full jar was built for 2.11), you are very likely to encounter an error here. Something like:


    java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
      at org.cert.netsa.mothra.datasources.fields.Field.<init>(Field.scala:6)
    ...
    

Next, you should try to actually run a query to make sure that data is decoded correctly:


    scala> df.show(1, 0, true)
    -RECORD 0-----------------------------------------
     startTime              | 2015-09-14 14:55:20.568
    ...
    only showing top 1 row

    scala> df.count
    res2: Long = 428
    

If nothing up to this point causes a failure, it's likely there's a deeper problem than dependency issues. Either way, please report the results of your investigation to us with your bug report. We can use it to investigate the source of the problem. It would also be very helpful if you could include the output from the following command with your bug report:


    $ spark-shell --version
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
          /_/

    Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_332
    Branch HEAD
    Compiled by user hgao on 2022-01-20T19:26:14Z
    Revision 4f25b3f71238a00508a356591553f2dfa89f8290
    Url https://github.com/apache/spark
    Type --help for more information.
    

Hadoop may not be installed by itself, but please note if it is not, or if it is please also include the output of the following command:


    $ hadoop version
    Hadoop 2.10.1
    Subversion https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c56\
    2d78e816
    Compiled by centos on 2020-09-14T13:17Z
    Compiled with protoc 2.5.0
    From source with checksum 3114edef868f1f3824e7d0f68be03650
    This command was run using .../hadoop-2.10.1/share/hadoop/common/hadoop-common-\
    2.10.1.jar