Some organizations may need or prefer to install libraries locally rather than fetch them from Maven. This can make it easier to provide a common configuration to multiple local users.
The easiest way to run the spark-shell
CLI
interactive tool using Mothra libraries is to include the
mothra_2.12-1.6.0-full.jar
file in the --jars
option to
spark-shell
:
$ spark-shell --jars path/to/mothra_2.12-1.6.0-full.jar
If you have additional jars for this list, they should be included
in a single --jars
option, separated by commas, like:
$ spark-shell --jars path/to/first.jar,path/to/second.jar
The --jars
switch may also be used with the
spark-submit
command for non-interactive jobs that
need the Mothra libraries.
When you use --jars
, Spark will automatically copy
the jar files to every node that needs them for the job (for
either spark-shell
or spark-submit
).
You can also make a version of Mothra available as a system-wide
default by editing your spark-defaults.conf
file to
include:
spark.driver.extraClassPath /full/path/to/mothra_2.12-1.6.0-full.jar
spark.executor.extraClassPath /full/path/to/mothra_2.12-1.6.0-full.jar
As with the command-line options, these are comma-separated lists, and you'll want to preserve any existing contents.
When using spark-defaults.conf
, Spark will
not automatically copy the jar file for you. You must
make sure that the jar file is present at the same expected path
on every machine that might need it. (This includes any machine
where either a Spark driver or a Spark executor might be run.)
Note that while making Mothra available system-wide in a Spark installation should cause everything that uses that Spark installation to work with Mothra, some notebook server configurations may by default use their own Spark installation. You will want to consult the documentation for your notebook server to determine how you might configure it to use the system-wide Spark installation, or how you might configure its built-in Spark installation to use the locally install Mothra libraries.
These so-called "full" jars contain embedded dependencies so that no additional files must be downloaded to use Mothra. We've done what we can to ensure that the provided jar files work correctly with a wide range of Spark and Hadoop versions, but it's possible that something will go wrong. If you encounter difficulties, please consider the following steps to check for a version mismatch, and report your results with any bug report.
The "full" jars are built with the aim of including only the dependencies that are not known to be provided by Spark. Our testing suggests that one build for each Scala version and each Spark version which supports it is sufficient. Make sure that you use the appropriate version for your Spark installation.
A quick and easy way to test using Spark with Mothra using this jar is to run the following command, which will load the Mothra libraries (and demonstrate the available version):
$ spark-shell --jars mothra_2.12-1.6.0-full.jar
...
scala> org.cert.netsa.util.versionInfo("mothra")
res0: Option[String] = Some(1.6.0)
scala> org.cert.netsa.util.versionInfo.detail("mothra")
res1: Option[String] = Some(mothra (Scala 2.12.15) (Spark 3.2.1))
Examine the version number to make sure you're using the correct version of Mothra. If it doesn't match the version you think that you are trying to test, it might indicate that a previously installed version is taking priority, and you should check your configuration.
If something goes wrong, looking at the version details shows you precisely what versions of Scala, Spark, and Hadoop were used for building and testing this jar file. In the above example, Scala 2.12.15 and Spark 3.2.1 were used.
(Note that Scala versions with the same first two numbers should always be binary compatible, but if the second number differs it will not be compatible. (That is: 2.11.11 and 2.11.12 are comptaible. 2.12.1 and 2.12.15 are compatible. 2.11.x and 2.12.x are not.)
Next, define a tiny sample query to make sure that Spark can construct a query correctly:
scala> import org.cert.netsa.mothra.datasources._
import org.cert.netsa.mothra.datasources._
scala> val df = spark.read.ipfix("fccx-sample.ipfix")
df: org.apache.spark.sql.DataFrame = [startTime: timestamp, endTime: timestamp \
... 17 more fields]
(This specific data file is available in Sample Data, but any IPFIX file will do.)
If you are using a version of Spark with a different version of Scala than Mothra (for example, you're using Scala 2.12, but the full jar was built for 2.11), you are very likely to encounter an error here. Something like:
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at org.cert.netsa.mothra.datasources.fields.Field.<init>(Field.scala:6)
...
Next, you should try to actually run a query to make sure that data is decoded correctly:
scala> df.show(1, 0, true)
-RECORD 0-----------------------------------------
startTime | 2015-09-14 14:55:20.568
...
only showing top 1 row
scala> df.count
res2: Long = 428
If nothing up to this point causes a failure, it's likely there's a deeper problem than dependency issues. Either way, please report the results of your investigation to us with your bug report. We can use it to investigate the source of the problem. It would also be very helpful if you could include the output from the following command with your bug report:
$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_332
Branch HEAD
Compiled by user hgao on 2022-01-20T19:26:14Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.
Hadoop may not be installed by itself, but please note if it is not, or if it is please also include the output of the following command:
$ hadoop version
Hadoop 2.10.1
Subversion https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c56\
2d78e816
Compiled by centos on 2020-09-14T13:17Z
Compiled with protoc 2.5.0
From source with checksum 3114edef868f1f3824e7d0f68be03650
This command was run using .../hadoop-2.10.1/share/hadoop/common/hadoop-common-\
2.10.1.jar