NAME

mothra-packer - Load and partition IPFIX data in a Mothra repository

SYNOPSIS

  mothra-packer [--help] [--version]

  mothra-packer --incoming-dir=DIR --outgoing-dir=DIR
                --packing-logic=FILE --work-dir=LOCAL-DIR
                [--archive-dir=DIR] [--check-interval=N]
                [--compression=CODEC] [--file-cache-size=N]
                [--hours-per-file=N] [--max-pack-jobs=N]
                [--maximum-age=N] [--maximum-size=N] [--minimum-age=N]
                [--minimum-size=N] [--num-move-threads=N] [--one-shot]
                [--pack-attempts=N] [--polling-interval=N]

DESCRIPTION

While running, mothra-packer scans the incoming directory (--incoming-dir) for IPFIX files. It splits the IPFIX records in each file into output files under --outgoing-dir in a time-based directory structure based on the partitioning rules in the packing logic file (--packing-logic).

The incoming directory, outgoing directory, and packing logic file may all be specified as Hadoop filesystem URIs. The working directory must be a local directory, and must be a local file path.

Output files are initially created on local disk in the working directory (--work-dir), and when they meet size and/or age thresholds, they are moved to the outgoing directory (--outgoing-dir).

How files are checked for completion

Every time the --check-interval passes, the sizes and ages of files in --work-dir are checked. Files that meet any one of the following criteria are closed and moved into the repository in --outgoing-dir.

The criteria are:

Packing logic definitions

The packing logic configuration is Scala code which is loaded and compiled at run-time to produce partitioning configuration. You can read more at the documentation website.

OPTIONS

--archive-dir=DIR

If --archive-dir is provided then archival copies of working files are placed in DIR after the files are copied into the repository. If this option is not specified, then working files are deleted after being copied. DIR is a Hadoop file URI.

--check-interval=N

When --check-interval is given, it determines how often (in seconds) files in the work directory are checked to determine if they should be moved to the outgoing directory. If it is not given, the work directory is checked every 60 seconds. See "How files are checked for completion" for details on when files are closed and moved.

--compression=CODEC

If --compression is provided then files written to HDFS will use the compression codec named CODEC. If this codec cannot be found, mothra-packer will exit with an error. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. If none or the empty string is given or if this option is not specified, no compression will be used.

--file-cache-size=N

When --file-cache-size is specified, it determines the maximum number of open output files maintained by the file cache for writing to --work-dir. The packer does not limit the number of files in --work-dir; this only limits the number of files open simultaneously for writing. Once the cache reaches this number of open files and the packer needs to (re-)open a file, the packer closes the least-recently-used file. This limit does not include the files required to read incoming files, nor to copy files from the work directory to the outgoing area. If --file-cache-size is not given, the default value of 2000 is used. This value must be at least 128, and the packer will refuse to operate with a smaller value.

--help

Print the available options and exit.

--hours-per-file=N

When --hours-per-file is provided, it determines the number of hours of data which are included in each packed output file. Values from 1 (one hour per file) and 24 (one day per file) are allowed. For example, if "12" is given, records will be split up based on whether the records began in the first half or the second half of the day. If this option is not specified, the default value of "1" (one hour per file) is used.

--incoming-dir=DIR

The required --incoming-dir option specifies the directory which should be watched for incoming files which are ready to be processed. Files in this directory which are non-empty (have a length greater than zero) and which do not have a filename that begins with a dot . character are read as IPFIX input files to be packed. Note that if you create files in this directory and then write data to them, those files may be processed by the packer before they are completely written. Always use filenames beginning with . while writing, or move already complete files into this directory.

--max-pack-jobs=N

If --max-pack-jobs is provided then it determines the maximum number of input files which may be processed simultaneously. A larger value provides more throughput. If this option is not specified, then only one input file is processed at a time.

--maximum-age=N

If --maximum-age is provided, it determines the maximum age in seconds (time since first written to) that a file may have before it is closed and moved to --outgoing-dir. If --maximum-age is not specified, the default value of one hour is used. See "How files are checked for completion" for details on when files are closed and moved.

--maximum-size=N

If --maximum-size is provided, it determines the maximum size in bytes that a file may have before it is closed and moved to --outgoing-dir. If --maximum-size is not specified, the default value of 100MB is used. See "How files are checked for completion" for details on when files are closed and moved.

--minimum-age=N

If --minimum-age is specified, it determines the minimum age in seconds (time since first written to) that a file should have before it is closed and moved to --outgoing-dir. If --minimum-age is not specified, the default value of 5 minutes is used. See "How files are checked for completion" for details on when files are closed and moved.

--minimum-size=N

If --minimum-size is provided, it determines the minimum size in bytes that a file should have before it is closed and moved to --outgoing-dir. If --minimum-size is not specified, the default value of 64MB is used. See "How files are checked for completion" for details on when files are closed and moved.

--num-move-threads=N

When --num-move-threads is provided, it determines the number of simultaneous threads used for closing work files and moving them to --outgoing-dir. At most one thread is created every --check-interval. If --num-move-threads is not specified, the default value of 4 is used.

--one-shot

When --one-shot is included on the command line, --incoming-dir is only scanned one time. Once all files in --incoming-dir have been packed (or they fail to be packed after some number of attempts), the packer exits.

--outgoing-dir=DIR

The --outgoing-dir option, which must be provided, determines the root location of the outgoing IPFIX repository. This is a Hadoop filesystem URL, and packed files are moved here from the working directory when they're completed. Packed files are collected in a directory structure with YYYY/MM/DD at the root, and then further refined partitioning information as the directory tree grows deeper.

--pack-attempts=N

If --pack-attempts is given, it determines the number of times the packer will attempt to process each input file. If N packing attempts fail for a given input file, that file will be ignored for the remainder of this invocation of the packer. If --pack-attempts is not specified, the default value of 3 is used.

--packing-logic=FILE

The required option --packing-logic specifies a Hadoop filesystm URL to a file containing Scala source code that determines how records are packed into outgoing files. See the documentation website for examples.

--polling-interval=N

If --polling-interval is provided, it determines how long the main thread sleeps (in seconds) between scans (polls) of the incoming directory to check for new IPFIX files to process. If this option is not specified then the incoming directory is scanned every 30 seconds.

--version

Print the version number and information about how Mothra was configured, then exit.

--work-dir=LOCAL-DIR

The required --work-dir option specifies the directory (using a local file-system path, not a Hadoop URL) which will be used to store files while processing and before moving the files to the outgoing data spool. Any files in this directory when the packer starts will be moved into the repository immediately.