Installing Mothra libraries locally

Some organizations may need or prefer to install libraries locally rather than fetch them from Maven. This can make it easier to provide a common configuration to multiple local users. In order to accomplish this, you’ll need to generate an “assembly” using the Mothra build scripts in a location with Internet access. You can do this by running ./mill show 'mothra.cross[2.12.20].assembly', and then making a copy of the output jar file produced using a name like mothra_2.12-1.7.0-full.jar. (Using whatever version numbers are appropriate for your installation.)

Using local Mothra libraries for a single user

The easiest way to run the spark-shell CLI interactive tool using Mothra libraries is to include the mothra_2.12-1.7.0-full.jar file in the --jars option to spark-shell:


    $ spark-shell --jars path/to/mothra_2.12-1.7.0-full.jar
    

If you have additional jars for this list, they should be included in a single --jars option, separated by commas, like:


    $ spark-shell --jars path/to/first.jar,path/to/second.jar
    

The --jars switch may also be used with the spark-submit command for non-interactive jobs that need the Mothra libraries.

When you use --jars, Spark will automatically copy the jar files to every node that needs them for the job (for either spark-shell or spark-submit).

Configuring Spark system-wide to use Mothra

You can also make a version of Mothra available as a system-wide default by editing your spark-defaults.conf file to include:


    spark.driver.extraClassPath /full/path/to/mothra_2.12-1.7.0-full.jar
    spark.executor.extraClassPath /full/path/to/mothra_2.12-1.7.0-full.jar
    

As with the command-line options, these are comma-separated lists, and you'll want to preserve any existing contents.

When using spark-defaults.conf, Spark will not automatically copy the jar file for you. You must make sure that the jar file is present at the same expected path on every machine that might need it. (This includes any machine where either a Spark driver or a Spark executor might be run.)

Spark configuration for Notebook Servers

Note that while making Mothra available system-wide in a Spark installation should cause everything that uses that Spark installation to work with Mothra, some notebook server configurations may by default use their own Spark installation. You will want to consult the documentation for your notebook server to determine how you might configure it to use the system-wide Spark installation, or how you might configure its built-in Spark installation to use the locally install Mothra libraries.

Troubleshooting version mismatches

These so-called "full" jars contain embedded dependencies so that no additional files must be downloaded to use Mothra. It is possible for this to produce a version mismatch which needs to be reoslved. If this is the case, please let us know about your build and runtime versions in any bug report.

A quick and easy way to test using Spark with Mothra using this jar is to run the following command, which will load the Mothra libraries (and demonstrate the available version):


    $ spark-shell --jars mothra_2.12-1.7.0-full.jar
    ...
    scala> org.cert.netsa.util.versionInfo("mothra")
    res0: Option[String] = Some(1.7.0)

    scala> org.cert.netsa.util.versionInfo.detail("mothra")
    res1: Option[String] = Some(mothra (Scala 2.12.20))
    

Examine the version number to make sure you're using the correct version of Mothra. If it doesn't match the version you think that you are trying to test, it might indicate that a previously installed version is taking priority, and you should check your configuration.

If something goes wrong, looking at the version details shows you precisely what version of Scala was used for building and testing this jar file. In the above example, Scala 2.12.20 was used.

(Note that Scala versions with the same first two numbers should always be binary compatible, but if the second number differs it will not be compatible. (That is: 2.13.5 and 2.13.16 are comptaible. 2.12.1 and 2.12.20 are compatible. 2.12.x and 2.13.x are not.)

Next, define a tiny sample query to make sure that Spark can construct a query correctly:


    scala> import org.cert.netsa.mothra.datasources._
    import org.cert.netsa.mothra.datasources._

    scala> val df = spark.read.ipfix("fccx-sample.ipfix")
    df: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp \
    ... 17 more fields]
    

(This specific data file is available in Sample Data, but any IPFIX file will do.)

If you are using a version of Spark with a different version of Scala than Mothra (for example, you're using Scala 2.13, but the full jar was built for 2.12), you are very likely to encounter an error here. Something like:


    java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
      at org.cert.netsa.mothra.datasources.fields.Field.<init>(Field.scala:6)
    ...
    

Next, you should try to actually run a query to make sure that data is decoded correctly:


    scala> df.show(1, 0, true)
    -RECORD 0-----------------------------------------
     startTime              | 2015-09-14 14:55:20.568
    ...
    only showing top 1 row

    scala> df.count
    res2: Long = 428
    

If nothing up to this point causes a failure, it's likely there's a deeper problem than dependency issues. Either way, please report the results of your investigation to us with your bug report. We can use it to investigate the source of the problem. It would also be very helpful if you could include the output from the following command with your bug report:


    $ spark-shell --version
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.5.4
          /_/

    Using Scala version 2.13.8, OpenJDK 64-Bit Server VM, 17.0.12
    Branch HEAD
    Compiled by user yangjie01 on 2024-12-17T04:17:18Z
    Revision a6f220d951742f4074b37772485ee0ec7a774e7d
    Url https://github.com/apache/spark
    Type --help for more information.