NAME

mothra-filejoiner - Reduce number of files in a Mothra repository

SYNOPSIS

  mothra-filejoiner [--help] [--version]

  mothra-filejoiner TARGET-1 [ ... TARGET-N ]
                    [--compression=CODEC] [--maximum-size=N]
                    [--max-threads=N] [--min-count-to-join=N]
                    [--spawn-thread=MODE]

DESCRIPTION

mothra-filejoiner reduces the number of data files in a Mothra repository. It may also be used to modify the files' compression.

Multiple directories (whether part of a single Mothra repository or of several distinct Mothra repositories) may be processed at the same time. Multiple Information Elements may be removed in the same invocation of mothra-filejoiner.

This tool runs as a batch process, never as a daemon.

It makes a single recursive scan of the target directories TARGET-1 ... TARGET-N for files whose names match the pattern YYYYMMDD.HH. or YYYYMMDD.HH-PTNNH. (Specifically, it looks for files matching the regular expression ^\d{8}\.\d{2}(?:-PT\d\d?H)?\.) Files whose names match that pattern are processed by mothra-filejoiner to create a single new file that has the same prefix as the originals, and then the original files are removed.

Using the --maximum-size option may result in data being coalesced into multiple smaller files rather than one larger unified file.

OPTIONS

--compression=CODEC

If --compression is provided then files written to HDFS will use the compression codec named CODEC. If this codec cannot be found, mothra-filejoiner will exit with an error. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. If none or the empty string is given or if this option is not specified, no compression will be used.

--help

Print the available options and exit.

--max-threads=N

When --max-threads is specified, it determines the maximum number of threads which will be used to join files simultaneously. One thread is always used to recursively scan the target directories. This value determines the number of threads started as described in the --spawn-thread option.

--maximum-size=N

If --maximum-size is provided, it determines the maximum size in bytes that a file may have before it is closed. After at least this many compressed bytes have been written the output file will be closed and a new output file created. Files will be slightly larger than N bytes, since files are not closed until they exceed this size.

--min-count-to-join=N

When --min-count-to-join is given, it determines the minimum number of files with a shared prefix which must exist before the files will be processed. Setting this value to one or less will result in even single files being processed (useful to re-write files with a new compression method, for example). Setting this value to a larger number may be used to avoid coalescing files unless there's an unreasonable number with the same prefix. If this option is not specified, the default value of 2 is used.

--spawn-thread=MODE

Specifying the --spawn-threads option determines how mothra-filejoiner allocates work to individual threads. If MODE is by-directory, then a single thread is used to process all of the files in each directory which contains files to process. If MODE is by-prefix, then within each directory one thread is used for all of the files sharing a common YYYYMMDD.HH. or YYYYMMDD.HHPTNNH. prefix. The number of threads which run simultaneously is determined by --max-threads. If --spawn-threads is not specified, the default value of by-directory is used.

--version

Print the version number and information about how Mothra was configured, then exit.