CERT/CC
background
background
CERT NetSA Security Suite 
Open Source Tools for Network Monitoring 
News | Documentation | Downloads
YAF 0.8.1 | NAF 0.6.0 | SiLK 1.0.1 | RAVE 1.9.9
fixbuf 0.7.3 | ipa 0.2.1 | airdbc 0.2.2 | airframe 0.7.2 | Portal 0.8.0
SiLK - Documentation - rwsplit
Documentation | Downloads | Release Notes | FAQ | License | Credits | Reference Data | Live CD


NAME

rwsplit - Divide a SiLK file into a (sampled) collection of subfiles


SYNOPSIS

  rwsplit { --ip-limit=LIMIT | --packet-limit=LIMIT
            | --flow-limit=LIMIT | --byte-limit=LIMIT }
        [--sample-ratio=SAMPLE_RATIO] [--file-ratio=FILE_RATIO]
        [--file-limit=FILE_LIMIT] [--site-config-file=FILENAME]
        --basename=BASENAME [FILES]


DESCRIPTION

rwsplit reads SiLK Flow records from the standard input or from files named on the command line and writes the flows into a set of subfiles based on the partition specification. In its simplest form, rwsplit partitions the file, meaning that each input flow will appear in one (and only one) of the subfiles.

In addition to splitting the file, rwsplit can generate files containing sample flows. Sampling is specified by using the --sample-ratio and --file-ratio switches.


OPTIONS

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

The splitting criterion is defined using one of the limit specifiers; one and only one must be specified. They are:

--ip-limit=LIMIT
Specifies the count of unique source and destination IPs at which to close the current subfile and begin a new subfile; next-hip-IPs do not count toward the limit. Note that LIMIT is approximate, since a record with source and destination IPs not seen previously will increase the unique IP count by 2.

--flow-limit=LIMIT
Specifies the count of SiLK Flow records at which to close the current subfile and begin a new subfile.

--packet-limit=LIMIT
Specifies the count of packets at which to close the current subfile and begin a new subfile. LIMIT is the lower threshold on the packet count for all files except the last.

--byte-limit=LIMIT
Specifies the byte count at which to close the current subfile and begin a new subfile. LIMIT is the lower threshold on the byte count for all files except the last.

The other switches are:

--basename=BASENAME
Specifies the basename of the output files; this switch is required. The flows are written sequentially to a set of subfiles whose names follow the format BASENAME.ORDER.rwf, where ORDER is an 8-digit zero-formatted sequence number (i.e., 00000000, 00000001, and so on). The sequence number will begin at zero and increase by one for every file written, unless --file-ratio is specified,

--sample-ratio=SAMPLE_RATIO
Writes one flow record, chosen at random, from every SAMPLE_RATIO flows that are read.

--file-ratio=FILE_RATIO
Picks one subfile, chosen from random, out of every FILE_RATIO names generated, for writing to disk.

--file-limit=NUMBER
Limits the number of files that are written to disk to NUMBER.

--site-config-file=FILENAME
Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, the location specified by the SILK_CONFIG_FILE environment variable is used if that variable is not empty. The value of SILK_CONFIG_FILE should include the name of the file. Otherwise, the application looks for a file named silk.conf in the following directories: the directory specified in the SILK_DATA_ROOTDIR environment variable; the data root directory that is compiled into SiLK (use the --version switch to view this value); the directories $SILK_PATH/share/silk/ and $SILK_PATH/share/; and the share/silk/ and share/ directories parallel to the application's directory.


EXAMPLES

Assume a source file source.rwf; to split that file into files that each contain about 100 unique IP addresses:

  rwsplit --basename=result --ip-limit=100 source.rwf

To split source.rwf into files that each contain 100 flows:

  rwsplit --basename=result --flow-limit=100 source.rwf

The following causes rwsplit to sample 1 out of every 10 records from source.rwf; i.e., rwsplit will read 1000 flow records to produce each subfile:

  rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf

When --file-ratio is specified, the file names are generated as usual (e.g., base-00000000, base-00000001, ...); however, one of these names will be chosen randomly from each set of --file-ratio candidates, and only that file will be written to disk.

  $ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
  $ ls
  result-00000002.rwf
  result-00000008.rwf
  result-00000013.rwf
  result-00000016.rwf


LIMITATIONS

rwsplit can take exactly 1 partitioning switch per invocation.

Partitioning is not exact, rwsplit keeps appending flow records a file until it meets or exceeds the specified LIMIT. For example, if you specify --ip-limit=100, then rwsplit will fill up the file until it has 100 IP addresses in it; if the file has 99 addresses and a new record with 2 previously unseen addresses is received, rwsplit will put this in the current file, resulting in a 101-address file. Similarly, if you specify --byte-limit=2000, and rwsplit receives a 2GB flow record, that flow record will be placed in the current subfile.

The switches --sample-ratio, --file-ratio, and --file-limit are processed in that order. So, when you specify

  rwsplit --sample-ratio=10 --ip-limit=100 --file-ratio=10 --file-limit=20

then rwsplit will pick 1 out of every 10 flow records, write that to a file until it has 100 IP's per file, pick 1 out of every 10 files to write, and write up to 20 files. If there are 1000 records, each with 2 unique IPs in them, then rwsplit will write at most 1 file (it will write 200 unique IP addresses, but it may not pick one of the files from the set to write).


ENVIRONMENT

SILK_CONFIG_FILE
This environment variable is used as the value for the --site-config-file when that switch is not provided.

SILK_DATA_ROOTDIR
When the --site-config-file switch is not provided and the SILK_CONFIG_FILE environment variable is not set, rwsplit looks for the site configuration file in $SILK_DATA_ROOTDIR/silk.conf.

SILK_PATH
This environment variable gives the root of the install tree. As part of its search for the SiLK site configuration file, rwsplit checks for a file named silk.conf in the directories $SILK_PATH/share/silk and $SILK_PATH/share.


SEE ALSO

rwfilter(1)


BUGS

When used in an IPv6 environment, rwsplit will process all flows unless the --ip-limit is requested. When limiting files to a certain number of IP addresses, rwsplit will attempt to convert any IPv6 addresses to IPv4. Records that can be converted will be processed, all other records will be silently ignored.