NAME

mothra-repacker - Modify the partition structure of a Mothra repository

SYNOPSIS

  mothra-repacker [--help] [--version]

  mothra-repacker SOURCE-1 [ ... SOURCE-N ] --outgoing-dir=DIR
                  --packing-logic=FILE --work-dir=LOCAL-DIR
                  [--archive-dir=DIR] [--compression=CODEC]
                  [--file-cache-size=N] [--hours-per-file=N]
                  [--max-scan-jobs=N] [--max-threads=N] [--maximum-size=N]
                  [--readers-per-scanner=N]

DESCRIPTION

mothra-repacker makes a single recursive scan of the source directories SOURCE-1 through SOURCE-N for IPFIX files. (At least one source directory must be specified.) It splits the IPFIX records in these files into output files under --outgoing-dir in a time-based directory structure based on the partitioning rules in the packing logic file (--packing-logic).

The source directories, outgoing directory, and packing logic file may all be specified as Hadoop filesystem URIs. The working directory must be a local directory, and must be a local file path.

Output files are initially created on local disk in the working directory (--work-dir). Once all input files have been read, the output files are all moved to the outgoing directory and the initial source files removed. The outgoing directory may be or be contained within one of the source directories.

mothra-repacker always runs as a batch process; not as a daemon.

Some possiblme uses for mothra-repacker include:

1. Changing how the records are packed---for example packing by silkAppLabel instead of protocolIdentifier.
2. Combining multiple files for an hour into a single file for that hour, merging hourly files into a file that covers a longer duration, or spliting a longer duration file into smaller files.
3. Changing the compression algorithm used on the IPFIX files.

mothra-repacker does not currently support modifying the content of records. It only collects records in different files and path structures.

mothra-repacker uses multiple threads. By default, each source directory specified on the command line gets a dedicated thread to scan that directory and its subdirectories recursively for IPFIX files, and another thread decidated to reading those files and repacking them. mothra-repacker does not support having multiple threads scan a directory, but it does allow multiple threads to process a single directory's files.

The work directory (--work-dir) most not be or be contained within a source directory. To repack the files in an existing working directory, use a different working directory. The repacker ignores any files in the working directory that exist when the repacker is started, and it ignores files placed there by other programs.

OPTIONS

--archive-dir=DIR

If --archive-dir is provided then archival copies of working files are placed in DIR after the files are copied into the repository. If this option is not specified, then working files are deleted after being copied. DIR is a Hadoop file URI.

--compression=CODEC

If --compression is provided then files written to HDFS will use the compression codec named CODEC. If this codec cannot be found, mothra-repacker will exit with an error. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. If none or the empty string is given or if this option is not specified, no compression will be used.

--file-cache-size=N

When --file-cache-size is specified, it determines the maximum number of open output files maintained by the file cache for writing to --work-dir. The packer does not limit the number of files in --work-dir; this only limits the number of files open simultaneously for writing. Once the cache reaches this number of open files and the packer needs to (re-)open a file, the packer closes the least-recently-used file. This limit does not include the files required to read incoming files, nor to copy files from the work directory to the outgoing area. If --file-cache-size is not given, the default value of 2000 is used. This value must be at least 128, and the packer will refuse to operate with a smaller value.

--help

Print the available options and exit.

--hours-per-file=N

When --hours-per-file is provided, it determines the number of hours of data which are included in each packed output file. Values from 1 (one hour per file) and 24 (one day per file) are allowed. For example, if "12" is given, records will be split up based on whether the records began in the first half or the second half of the day. If this option is not specified, the default value of "1" (one hour per file) is used.

--max-scan-jobs=N

If --max-scan-jobs is specified, it determines the maximum number of source directories to be scanned simultaneously. Setting this to a value larger than the number of source directories has no effect. If this option is not specified, every source directory will be scanned simultaneously.

--max-threads=N

When --max-threads is specified, it determines the maximum number of possible threads to be used for either scanning jobs or reading jobs. The default value is enough to account for every scanner and every reader thread created by --max-scan-jobs and --readers-per-scanner. Setting it to more than this value will have no effect.

--maximum-size=N

If --maximum-size is provided, it determines the maximum size in bytes that a file may have before it is closed and moved to --outgoing-dir. If --maximum-size is not specified, there is no maximum size. Note: If --maximum-size is given and files are being repacked into the same repository that is being read, duplicate records may temporarily appear in the repository.

--outgoing-dir=DIR

The --outgoing-dir option, which must be provided, determines the root location of the outgoing IPFIX repository. This is a Hadoop filesystem URL, and repacked files are moved here from the working directory when they're completed. Packed files are collected in a directory structure with YYYY/MM/DD at the root, and then further refined partitioning information as the directory tree grows deeper. Note that the source directories may be contained within the outgoing directory, although data may temporarily be duplicated during processing if the --maximum-size option is also used.

--packing-logic=FILE

The required option --packing-logic specifies a Hadoop filesystm URL to a file containing Scala source code that determines how records are packed into outgoing files. See the documentation website for examples.

--readers-per-scanner=N

The --readers-per-scanner option determines the number of threads use to read and repack data for each scanning thread. If this option is not specified, the default value of 1 is used.

--version

Print the version number and information about how Mothra was configured, then exit.

--work-dir=LOCAL-DIR

The required --work-dir option specifies the directory (using a local file-system path, not a Hadoop URL) which will be used to store files while processing and before moving the files to the outgoing data spool. Any files in this directory when the packer starts will be moved into the repository immediately.