NAME

mothra-rollupday - Reduce the number of files per day in a Mothra repository

SYNOPSIS

  mothra-rollupday [--help] [--version]

  mothra-rollupday TARGET-1 [ ... TARGET-N ]
                   [--max-threads=N]
                   [--compression=CODEC]
                   [--maximum-size=N]

DESCRIPTION

mothra-rollupday reduces the number of data files in a Mothra repository. It may also be used to modify the files' compression.

The tool first makes a single recursive scan over the target directories TARGET-1 through TARGET-N looking for files whose names start with the pattern YYYYMMDD.HH. or YYYYMMDD.HH-PTddH. For example, 20211108.16.anything-at-all or 20211108.16-PT4H.abcdefg.filename would both match.

Matching files within the same directory are processed to create one or more new output files in the same directory, containing the records of all of the original files in that directory.

Effectively, assuming that the records for a given partition in a given day are spread across several files within the same directory, mothra-rollupday will "roll up" all of these records into a single daily collection.

A single thread scans the target directories, but the number of threads processing file contents may be controlled with the --max-threads option.

By default, the output is not compressed (even if the input was). This may be controlled using the --compression option.

OPTIONS

--compression=CODEC

By default, the output of mothra-rollupday is not compressed, even if the input is compressed. If --compression is provided then files written to HDFS will use the compression codec named CODEC. If this codec cannot be found, mothra-rollupday will exit with an error. Values typically supported by Hadoop include bzip2, gzip, lz4, lzo, lzop, snappy, and default. If none or the empty string is given or if this option is not specified, no compression will be used.

--maximum-size=N

By default, a single file per day is produced, but using the --maximum-size option may be used to limit the maximum size of a single compressed file to N bytes. The value is approximate since it is checked only after data appears on disk, which occurs in large blocks because of buffering and compression.

--max-threads=N

If the --max-threads option is specified, it will determine the number of reading threads that bring in data and merge it together for output. If this option is not specified, the default value of 6 is used.