NAME
rwsplit - Divide a SiLK file into a (sampled) collection of subfiles
SYNOPSIS
rwsplit { --ip-limit=LIMIT | --packet-limit=LIMIT
| --flow-limit=LIMIT | --byte-limit=LIMIT }
[--sample-ratio=SAMPLE_RATIO] [--file-ratio=FILE_RATIO]
[--file-limit=FILE_LIMIT] [--site-config-file=FILENAME]
--basename=BASENAME [FILES]
DESCRIPTION
rwsplit reads SiLK Flow records from the standard input or from files named on the command line and writes the flows into a set of subfiles based on the partition specification. In its simplest form, rwsplit partitions the file, meaning that each input flow will appear in one (and only one) of the subfiles.
In addition to splitting the file, rwsplit can generate files containing sample flows. Sampling is specified by using the --sample-ratio and --file-ratio switches.
OPTIONS
Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.
The splitting criterion is defined using one of the limit specifiers; one and only one must be specified. They are:
- --ip-limit=LIMIT
- Specifies the count of unique source and destination IPs at which to close the current subfile and begin a new subfile; next-hip-IPs do not count toward the limit. Note that LIMIT is approximate, since a record with source and destination IPs not seen previously will increase the unique IP count by 2.
- --flow-limit=LIMIT
- Specifies the count of SiLK Flow records at which to close the current subfile and begin a new subfile.
- --packet-limit=LIMIT
- Specifies the count of packets at which to close the current subfile and begin a new subfile. LIMIT is the lower threshold on the packet count for all files except the last.
- --byte-limit=LIMIT
- Specifies the byte count at which to close the current subfile and begin a new subfile. LIMIT is the lower threshold on the byte count for all files except the last.
The other switches are:
- --basename=BASENAME
- Specifies the basename of the output files; this switch is required. The flows are written sequentially to a set of subfiles whose names follow the format BASENAME.ORDER.rwf, where ORDER is an 8-digit zero-formatted sequence number (i.e., 00000000, 00000001, and so on). The sequence number will begin at zero and increase by one for every file written, unless --file-ratio is specified,
- --sample-ratio=SAMPLE_RATIO
- Writes one flow record, chosen at random, from every SAMPLE_RATIO flows that are read.
- --file-ratio=FILE_RATIO
- Picks one subfile, chosen from random, out of every FILE_RATIO names generated, for writing to disk.
- --file-limit=NUMBER
- Limits the number of files that are written to disk to NUMBER.
- --site-config-file=FILENAME
- Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, the location specified by the SILK_CONFIG_FILE environment variable is used if that variable is not empty. The value of SILK_CONFIG_FILE should include the name of the file. Otherwise, the application looks for a file named silk.conf in the following directories: the directory specified in the SILK_DATA_ROOTDIR environment variable; the data root directory that is compiled into SiLK (use the --version switch to view this value); the directories $SILK_PATH/share/silk/ and $SILK_PATH/share/; and the share/silk/ and share/ directories parallel to the application's directory.
EXAMPLES
Assume a source file source.rwf; to split that file into files that each contain about 100 unique IP addresses:
rwsplit --basename=result --ip-limit=100 source.rwf
To split source.rwf into files that each contain 100 flows:
rwsplit --basename=result --flow-limit=100 source.rwf
The following causes rwsplit to sample 1 out of every 10 records from source.rwf; i.e., rwsplit will read 1000 flow records to produce each subfile:
rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf
When --file-ratio is specified, the file names are generated as usual (e.g., base-00000000, base-00000001, ...); however, one of these names will be chosen randomly from each set of --file-ratio candidates, and only that file will be written to disk.
$ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf $ ls result-00000002.rwf result-00000008.rwf result-00000013.rwf result-00000016.rwf
LIMITATIONS
rwsplit can take exactly 1 partitioning switch per invocation.
Partitioning is not exact, rwsplit keeps appending flow records a file until it meets or exceeds the specified LIMIT. For example, if you specify --ip-limit=100, then rwsplit will fill up the file until it has 100 IP addresses in it; if the file has 99 addresses and a new record with 2 previously unseen addresses is received, rwsplit will put this in the current file, resulting in a 101-address file. Similarly, if you specify --byte-limit=2000, and rwsplit receives a 2GB flow record, that flow record will be placed in the current subfile.
The switches --sample-ratio, --file-ratio, and --file-limit are processed in that order. So, when you specify
rwsplit --sample-ratio=10 --ip-limit=100 --file-ratio=10 --file-limit=20
then rwsplit will pick 1 out of every 10 flow records, write that to a file until it has 100 IP's per file, pick 1 out of every 10 files to write, and write up to 20 files. If there are 1000 records, each with 2 unique IPs in them, then rwsplit will write at most 1 file (it will write 200 unique IP addresses, but it may not pick one of the files from the set to write).
ENVIRONMENT
- SILK_CONFIG_FILE
- This environment variable is used as the value for the --site-config-file when that switch is not provided.
- SILK_DATA_ROOTDIR
- When the --site-config-file switch is not provided and the SILK_CONFIG_FILE environment variable is not set, rwsplit looks for the site configuration file in $SILK_DATA_ROOTDIR/silk.conf.
- SILK_PATH
- This environment variable gives the root of the install tree. As part of its search for the SiLK site configuration file, rwsplit checks for a file named silk.conf in the directories $SILK_PATH/share/silk and $SILK_PATH/share.
SEE ALSO
BUGS
When used in an IPv6 environment, rwsplit will process all flows unless the --ip-limit is requested. When limiting files to a certain number of IP addresses, rwsplit will attempt to convert any IPv6 addresses to IPv4. Records that can be converted will be processed, all other records will be silently ignored.


