NAME

mothra-invariantpacker - Load and partition pre-split IPFIX data in a Mothra repository

SYNOPSIS

  mothra-invariantpacker [--help] [--version]

  mothra-invariantpacker --incoming-dir=DIR --outgoing-dir=DIR
                         --packing-logic=FILE
                         [--compression=CODEC]
                         [--file-cache-size=N]
                         [--max-input-age=N]
                         [--max-threads=N]
                         [--maximum-size=N]
                         [--min-input-count=N]
                         [--min-input-size=N]
                         [--observation-domain-id=ID]
                         [--one-shot]
                         [--output-idle-time=N]
                         [--polling-interval=N]

DESCRIPTION

While running, mothra-invariantpacker scans the incoming directory (--incoming-dir) for IPFIX files created by super_mediator running in "invariant" mode. It splits the IPFIX records in each file into output files under --outgoing-dir in a time-based directory structure based on the partitioning rules in the packing logic file (--packing-logic).

The incoming directory, outgoing directory, and packing logic file may all be specified as Hadoop filesystem URIs. The working directory must be a local directory, and must be specified using a path, not a URL.

As incoming files arrive, mothra-invariantpacker does not begin processing input or opening output files until some conditions are satisfied.

After processing begins for a partition, all current input files for that partition will be processed, and further input files that arrive will also be processed. If no new files arrive for more than the time specified by --output-idle-time, then the output file will be closed and the partition will wait until the conditions described above apply again before continuing processing with a new output file.

In addition, output files may be closed if their size exceeds the value of --maximum-size, or if the size of the file cache is exceeded and a new file must be opened.

OPTIONS

--compression=CODEC

If --compression is provided then files written to HDFS will use the compression codec named CODEC. If this codec cannot be found, mothra-invariantpacker will exit with an error. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. If none or the empty string is given or if this option is not specified, no compression will be used.

--file-cache-size=N

When --file-cache-size is specified, it determines the maximum number of open output files maintained by the file cache for writing to --outgoing-dir. This only limits the number of files open simultaneously for writing. Once the cache reaches this number of open files and the invariant packer needs to open a file, it closes the least-recently-used file. This limit does not include the files required to read incoming files. If --file-cache-size is not given, the default value of 2000 is used. This value must be at least 128, and the invariant packer will refuse to operate with a smaller value.

--incoming-dir=DIR

The required --incoming-dir option specifies the directory which should be watched for incoming files which are ready to be processed. Files in this directory which are non-empty (have a length greater than zero), which do not have a filename that begins with a dot . character, and which match the pattern of a super_mediator invariant file are read as IPFIX input files to be packed.

The pattern that is expected of invariant-packer filenames is something like *-inv-year-????-month-??-day-??-hour-??-*.med. Files which do not match this pattern are ignored. Files which do match this pattern will be packed assuming the date information present in the filename correctly describes the records contained within the file.

Note that if you create files in this directory and then write data to them, those files may be processed by the packer before they are completely written. Always use filenames beginning with . while writing, or move already complete files into this directory.

--max-input-age=N

The --max-input-age option determines the age (in seconds) of an available input file that will force output to begin. Output will continue until all input files for the given output file have been processed, and an amount of time specified by --output-idle-time has passed. The --min-input-count and --min-input-size options may also cause output to begin. If this option is not specified, the default value of 15 minutes will be used.

--max-threads=N

When --max-threads is specified, it determines the number of output threads to run simultaneously, and therefore the number of output files to write to simultaneously. If this option is not specified, the default value of 6 is used.

--maximum-size=N

Specifying the --maximum-size option will cause output files to be closed when their size goes over N octets. Typically, a file's size will not exceed this value by more than the maximum size of an IPFIX message (64kB). This value may not be less than 512kB. When this option is not specified, an output file will not be closed due to its size, but may be closed based on the values of --output-idle-time and --file-cache-size.

--min-input-count=N

When --min-input-count is specified, it determines the number of input files for a given output file which will trigger output to begin. Output will continue until all input files for the given output file have been processed, and an amount of time specified by --output-idle-time has passed. The --max-input-age and --min-input-size options may also cause output to begin. If this option is not specified, the default value of 3 files is used.

--min-input-size=N

Specifying the --min-input-size option determines the cumulative size of input files (in octets) for a given output file which will trigger output to begin. Output will continue until all input files for the given output file have been processed, and an amount of time specified by --output-idle-time has passed. The --max-input-age and --min-input-count options may also cause output to begin. If this option is not specified, the default value of 1MB is used.

--observation-domain-id=ID

When specified, --observation-domain-id determines the Observation Domain ID used in stored IPFIX data. All output files produced by mothra-invariantpacker will use this same value, regardless of their original source. If this option is not specified, the default value of 0 (zero) is used.

--one-shot

When --one-shot is included on the command line, --incoming-dir is only scanned one time. Once all files in --incoming-dir have been packed (or they fail to be packed after some number of attempts), mothra-invariantpacker exits.

--outgoing-dir=DIR

The --outgoing-dir option, which must be provided, determines the root location of the outgoing IPFIX repository. This is a Hadoop filesystem URL, and packed files are written here. Packed files are collected in a directory structure with YYYY/MM/DD at the root, and then further refined partitioning information as the directory tree grows deeper.

--output-idle-time=N

Specifying --output-idle-time determines the maximum amount of time (in seconds) to allow an idle output file to remain open so that additional incoming input records may be appended to it. This value may not be less than 1 minute. If this option is not specified, the default value of 15 minutes is usIt ied.

--packing-logic=FILE

The required option --packing-logic specifies a Hadoop filesystm URL to a file containing Scala source code that determines how records are packed into outgoing files. See the documentation website for examples.

--polling-interval=N

If --polling-interval is provided, it determines how long the main thread sleeps (in seconds) between scans (polls) of the incoming directory to check for new IPFIX files to process. If this option is not specified then the incoming directory is scanned every 15 seconds.