rwsort - Sort SiLK Flow records on one or more fields
rwsort --fields=KEY [--presorted-input] [--reverse]
[--temp-directory=DIR_PATH] [--sort-buffer-size=SIZE]
[--note-add=TEXT] [--note-file-add=FILE]
[--compression-method=COMP_METHOD] [--print-filenames]
[--output-path=PATH] [--site-config-file=FILENAME]
[--plugin=PLUGIN [--plugin=PLUGIN ...]]
[--python-file=PATH [--python-file=PATH ...]]
[--pmap-file=MAPNAME:PATH [--pmap-file=MAPNAME:PATH ...]]
{[--input-pipe=PATH] | [--xargs]|[--xargs=FILE] | [FILES...]}
rwsort [--pmap-file=MAPNAME:PATH [--pmap-file=MAPNAME:PATH ...]]
[--plugin=PLUGIN ...] [--python-file=PATH ...] --help
rwsort [--pmap-file=MAPNAME:PATH [--pmap-file=MAPNAME:PATH ...]]
[--plugin=PLUGIN ...] [--python-file=PATH ...] --help-fields
rwsort --version
rwsort reads SiLK Flow records, sorts the records by the field(s) listed in the --fields switch, and writes the records to the --output-path or to the standard output if it is not connected to a terminal. The output from rwsort is binary SiLK Flow records; the output must be passed into another tool for human-readable output.
Sorting records is an expensive operation, and it should only be used when necessary. The tools that bin flow records (rwcount(1), rwuniq(1), rwstats(1), etc) do not require sorted data.
rwsort reads SiLK Flow records from the files named on the command line or from the standard input when no file names are specified and neither --xargs nor --input-pipe is present. To read the standard input in addition to the named files, use -
or stdin
as a file name. If an input file name ends in .gz
, the file is uncompressed as it is read. When the --xargs switch is provided, rwsort reads the names of the files to process from the named text file or from the standard input if no file name argument is provided to the switch. The input to --xargs must contain one file name per line. The --input-pipe switch is deprecated and it is provided for legacy reasons; its use is not required since rwsort will automatically read form the standard input. The --input-pipe switch will be removed in the SiLK 4.0 release.
The amount of fast memory used by rwsort will increase until it reaches a maximum near 2GB. (Use the --sort-buffer-size switch to change this upper limit on the buffer size.) If more records are read than will fit into memory, the in-core records are sorted and temporarily stored on disk as described by the --temp-directory switch. When all records have been read, the on-disk files are merged and the sorted records written to the output.
By default, the temporary files are stored in the /tmp directory. Because these temporary files will be large, it is strongly recommended that /tmp not be used as the temporary directory. To modify the temporary directory used by rwsort, provide the --temp-directory switch, set the SILK_TMPDIR environment variable, or set the TMPDIR environment variable.
To merge previously sorted SiLK data files into a sorted stream, run rwsort with the --presorted-input switch. rwsort will merge-sort all the input files, reducing it's memory requirements considerably. It is the user's responsibility to ensure that all the input files have been sorted with the same --fields value (and --reverse if applicable). rwsort may still require use of a temporary directory while merging the files (for example, if rwsort does not have enough available file handles to open all the input files at once).
Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.
The --fields switch is required. rwsort will fail when it is not provided.
KEY contains the list of flow attributes (a.k.a. fields or columns) that make up the key by which flows are sorted. The fields are in listed in order from primary sort key, secondary key, etc. Each field may be specified once only. KEY is a comma separated list of field-names, field-integers, and ranges of field-integers; a range is specified by separating the start and end of the range with a hyphen (-). Field-names are case insensitive. Example:
--fields=stime,10,1-5
There is no default value for the --fields switch; the switch must be specified.
The complete list of built-in fields that the SiLK tool suite supports follows, though note that not all fields are present in all SiLK file formats; when a field is not present, its value is 0.
source IP address
destination IP address
source port for TCP and UDP, or equivalent
destination port for TCP and UDP, or equivalent. See note at iType
.
IP protocol
packet count
byte count
bit-wise OR of TCP flags over all packets
starting time of flow (nanoseconds resolution)
duration of flow (nanoseconds resolution)
end time of flow (nanoseconds resolution)
name or ID of sensor where flow was collected
integer value of the class/type pair assigned to the flow by rwflowpack(8)
the ICMP type value for ICMP or ICMPv6 flows and zero for non-ICMP flows. Internally, SiLK stores the ICMP type and code in the dPort
field, so there is no need have both dPort
and iType
or iCode
in the sort key. This field was introduced in SiLK 3.8.1.
the ICMP code value for ICMP or ICMPv6 flows and zero for non-ICMP flows. See note at iType
.
equivalent to iType
,iCode
. This field may not be mixed with iType
or iCode
, and this field is deprecated as of SiLK 3.8.1. Prior to SiLK 3.8.1, specifying the icmpTypeCode
field was equivalent to specifying the dPort
field.
Many SiLK file formats do not store the following fields and their values will always be 0; they are listed here for completeness:
router SNMP input interface or vlanId if packing tools were configured to capture it (see sensor.conf(5))
router SNMP output interface or postVlanId
router next hop IP
SiLK can store flows generated by enhanced collection software that provides more information than NetFlow v5. These flows may support some or all of these additional fields; for flows without this additional information, the field's value is always 0.
TCP flags on first packet in the flow
bit-wise OR of TCP flags over all packets except the first in the flow
flow attributes set by the flow generator:
S
all the packets in this flow record are exactly the same size
F
flow generator saw additional packets in this flow following a packet with a FIN flag (excluding ACK packets)
T
flow generator prematurely created a record for a long-running connection due to a timeout. (When the flow generator yaf(1) is run with the --silk switch, it will prematurely create a flow and mark it with T
if the byte count of the flow cannot be stored in a 32-bit value.)
C
flow generator created this flow as a continuation of long-running connection, where the previous flow for this connection met a timeout (or a byte threshold in the case of yaf).
Consider a long-running ssh session that exceeds the flow generator's active timeout. (This is the active timeout since the flow generator creates a flow for a connection that still has activity). The flow generator will create multiple flow records for this ssh session, each spanning some portion of the total session. The first flow record will be marked with a T
indicating that it hit the timeout. The second through next-to-last records will be marked with TC
indicating that this flow both timed out and is a continuation of a flow that timed out. The final flow will be marked with a C
, indicating that it was created as a continuation of an active flow.
guess as to the content of the flow. Some software that generates flow records from packet data, such as yaf, will inspect the contents of the packets that make up a flow and use traffic signatures to label the content of the flow. SiLK calls this label the application; yaf refers to it as the appLabel. The application is the port number that is traditionally used for that type of traffic (see the /etc/services file on most UNIX systems). For example, traffic that the flow generator recognizes as FTP will have a value of 21, even if that traffic is being routed through the standard HTTP/web port (80).
The following fields provide a way to label the IPs or ports on a record. These fields require external files to provide the mapping from the IP or port to the label:
categorize the source IP address as non-routable
, internal
, or external
and sort based on the category. Uses the mapping file specified by the SILK_ADDRESS_TYPES environment variable, or the address_types.pmap mapping file, as described in addrtype(3).
as sType for the destination IP address
the country code of the source IP address. Uses the mapping file specified by the SILK_COUNTRY_CODES environment variable, or the country_codes.pmap mapping file, as described in ccfilter(3).
as scc for the destination IP
label contained in the prefix map file associated with map-name. If the prefix map is for IP addresses, the label is that associated with the source IP address. If the prefix map is for protocol/port pairs, the label is that associated with the protocol and source port. See also the description of the --pmap-file switch below and the pmapfilter(3) manual page.
as src-map-name for the destination IP address or the protocol and destination port.
as src-map-name when no map-name is associated with the prefix map file
as dst-map-name when no map-name is associated with the prefix map file
Finally, the list of built-in fields may be augmented by the run-time loading of PySiLK code or plug-ins written in C (also called shared object files or dynamic libraries), as described by the --python-file and --plugin switches.
Instruct rwsort to merge-sort the input files; that is, rwsort assumes the input files have been previously sorted using the same values for the --fields and --reverse switches as was given for this invocation. This switch can greatly reduce rwsort's memory requirements as a large buffer is not required for sorting the records. If the input files were created with rwsort, you can run rwfileinfo(1) on the files to see the rwsort invocation that created them.
Cause rwsort to reverse the sort order, causing larger values to occur in the output before smaller values. Normally smaller values appear before larger values.
Augment the list of fields by using run-time loading of the plug-in (shared object) whose path is PLUGIN. The switch may be repeated to load multiple plug-ins. The creation of plug-ins is described in the silk-plugin(3) manual page. When PLUGIN does not contain a slash (/
), rwsort will attempt to find a file named PLUGIN in the directories listed in the "FILES" section. If rwsort finds the file, it uses that path. If PLUGIN contains a slash or if rwsort does not find the file, rwsort relies on your operating system's dlopen(3) call to find the file. When the SILK_PLUGIN_DEBUG environment variable is non-empty, rwsort prints status messages to the standard error as it attempts to find and open each of its plug-ins.
Specify the name of the directory in which to store data files temporarily when more records have been read that will fit into RAM. This switch overrides the directory specified in the SILK_TMPDIR environment variable, which overrides the directory specified in the TMPDIR variable, which overrides the default, /tmp.
Set the maximum size of the buffer used for sorting the records, in bytes. A larger buffer means fewer temporary files need to be created, reducing the I/O wait times. When this switch is not specified, the default maximum for this buffer is near 2GB. The SIZE may be given as an ordinary integer, or as a real number followed by a suffix K
, M
or G
, which represents the numerical value multiplied by 1,024 (kilo), 1,048,576 (mega), and 1,073,741,824 (giga), respectively. For example, 1.5K represents 1,536 bytes, or one and one-half kilobytes. (This value does not represent the absolute maximum amount of RAM that rwsort will allocate, since additional buffers will be allocated for reading the input and writing the output.) The sort buffer is not used when the --presorted-input switch is specified.
Add the specified TEXT to the header of the output file as an annotation. This switch may be repeated to add multiple annotations to a file. To view the annotations, use the rwfileinfo(1) tool.
Open FILENAME and add the contents of that file to the header of the output file as an annotation. This switch may be repeated to add multiple annotations. Currently the application makes no effort to ensure that FILENAME contains text; be careful that you do not attempt to add a SiLK data file as an annotation.
Specify the compression library to use when writing output files. If this switch is not given, the value in the SILK_COMPRESSION_METHOD environment variable is used if the value names an available compression method. When no compression method is specified, output to the standard output or to named pipes is not compressed, and output to files is compressed using the default chosen when SiLK was compiled. The valid values for COMP_METHOD are determined by which external libraries were found when SiLK was compiled. To see the available compression methods and the default method, use the --help or --version switch. SiLK can support the following COMP_METHOD values when the required libraries are available.
Do not compress the output using an external library.
Use the zlib(3) library for compressing the output, and always compress the output regardless of the destination. Using zlib produces the smallest output files at the cost of speed.
Use the lzo1x algorithm from the LZO real time compression library for compression, and always compress the output regardless of the destination. This compression provides good compression with less memory and CPU overhead.
Use the snappy library for compression, and always compress the output regardless of the destination. This compression provides good compression with less memory and CPU overhead. Since SiLK 3.13.0.
Use lzo1x if available, otherwise use snappy if available, otherwise use zlib if available. Only compress the output when writing to a file.
Print to the standard error the names of input files as they are opened.
Write the binary SiLK Flow records to PATH, where PATH is a filename, a named pipe, the keyword stderr
to write the output to the standard error, or the keyword stdout
or -
to write the output to the standard output. If PATH names an existing file, rwsort exits with an error unless the SILK_CLOBBER environment variable is set, in which case PATH is overwritten. If this switch is not given, the output is written to the standard output. Attempting to write the binary output to a terminal causes rwsort to exit with an error.
Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, rwsort searches for the site configuration file in the locations specified in the "FILES" section.
Read the SiLK Flow records to be sorted from the named pipe at PATH. If PATH is stdin
or -
, records are read from the standard input. Use of this switch is not required, since rwsort will automatically read data from the standard input when no file names are specified on the command line. This switch is deprecated and will be removed in the SiLK 4.0 release.
Read the names of the input files from FILENAME or from the standard input if FILENAME is not provided. The input is expected to have one filename per line. rwsort opens each named file in turn and reads records from it as if the filenames had been listed on the command line.
Print the available options and exit. Specifying switches that add new fields or additional switches before --help will allow the output to include descriptions of those fields or switches.
Print the description and alias(es) of each field and exit. Specifying switches that add new fields before --help-fields will allow the output to include descriptions of those fields.
Print the version number and information about how SiLK was configured, then exit the application.
Load the prefix map file located at PATH and create fields named src-map-name and dst-map-name where map-name is either the MAPNAME part of the argument or the map-name specified when the file was created (see rwpmapbuild(1)). If no map-name is available, rwsort names the fields sval
and dval
. Specify PATH as -
or stdin
to read from the standard input. The switch may be repeated to load multiple prefix map files, but each prefix map must use a unique map-name. The --pmap-file switch(es) must precede the --fields switch. See also pmapfilter(3).
When the SiLK Python plug-in is used, rwsort reads the Python code from the file PATH to define additional fields that can be used as part of the sort key. This file should call register_field() for each field it wishes to define. For details and examples, see the silkpython(3) and pysilk(3) manual pages.
When the temporary files and the final output are stored on the same file volume, rwsort will require approximately twice as much free disk space as the size of data to be sorted.
When the temporary files and the final output are on different volumes, rwsort will require between 1 and 1.5 times as much free space on the temporary volume as the size of the data to be sorted.
In the following examples, the dollar sign ($
) represents the shell prompt. The text after the dollar sign represents the command line.
To sort the records in infile.rw based primarily on destination port and secondarily on source IP and write the binary output to outfile.rw, run:
$ rwsort --fields=dport,sip --output-path=outfile.rw infile.rw
The silkpython(3) manual page provides examples that use PySiLK to create arbitrary fields to use as part of the key for rwsort.
When set and --temp-directory is not specified, rwsort writes the temporary files it creates to this directory. SILK_TMPDIR overrides the value of TMPDIR.
When set and SILK_TMPDIR is not set, rwsort writes the temporary files it creates to this directory.
This environment variable is used by Python to locate modules. When --python-file is specified, rwsort must load the Python files that comprise the PySiLK package, such as silk/__init__.py. If this silk/ directory is located outside Python's normal search path (for example, in the SiLK installation tree), it may be necessary to set or modify the PYTHONPATH environment variable to include the parent directory of silk/ so that Python can find the PySiLK module.
When set, Python plug-ins will output traceback information on Python errors to the standard error.
This environment variable allows the user to specify the country code mapping file that rwsort uses when computing the scc and dcc fields. The value may be a complete path or a file relative to the SILK_PATH. See the "FILES" section for standard locations of this file.
This environment variable allows the user to specify the address type mapping file that rwsort uses when computing the sType and dType fields. The value may be a complete path or a file relative to the SILK_PATH. See the "FILES" section for standard locations of this file.
The SiLK tools normally refuse to overwrite existing files. Setting SILK_CLOBBER to a non-empty value removes this restriction.
This environment variable is used as the value for --compression-method when that switch is not provided. Since SiLK 3.13.0.
This environment variable is used as the value for the --site-config-file when that switch is not provided.
This environment variable specifies the root directory of data repository. As described in the "FILES" section, rwsort may use this environment variable when searching for the SiLK site configuration file.
This environment variable gives the root of the install tree. When searching for configuration files and plug-ins, rwsort may use this environment variable. See the "FILES" section for details.
When set to 1, rwsort prints status messages to the standard error as it attempts to find and open each of its plug-ins. In addition, when an attempt to register a field fails, the application prints a message specifying the additional function(s) that must be defined to register the field in the application. Be aware that the output can be rather verbose.
When set to 1, rwsort prints debugging messages to the standard error as it creates, re-opens, and removes temporary files.
Possible locations for the address types mapping file required by the sType and dType fields.
Possible locations for the SiLK site configuration file which are checked when the --site-config-file switch is not provided.
Possible locations for the country code mapping file required by the scc and dcc fields.
Directories that rwsort checks when attempting to load a plug-in.
Directory in which to create temporary files.
rwcount(1), rwcut(1), rwfileinfo(1), rwstats(1), rwuniq(1), rwpmapbuild(1), addrtype(3), ccfilter(3), pmapfilter(3), pysilk(3), silkpython(3), silk-plugin(3), sensor.conf(5), rwflowpack(8), silk(7), yaf(1), dlopen(3), zlib(3)
Fields sTime+msec, eTime+msec, dur+msec, and their aliases (22, 23, 24) were removed in SiLK 3.23.0. Use fields sTime, eTime, and duration instead.
If an output path is not specified, rwsort will write to the standard output unless it is connected to a terminal, in which case an error is printed and rwsort exits.
If an input pipe or a set of input files are not specified, rwsort will read records from the standard input unless it is connected to a terminal, in which case an error is printed and rwsort exits.
Note that rwsort produces binary output. Use rwcut(1) to view the records.
Do not spend the resources to sort the data if you are going to be passing it to an aggregation tool like rwtotal or rwaddrcount, which have their own internal data structures that will ignore the sorted data.
Both rwuniq(1) and rwstats(1) can take advantage of previously sorted data, but you must explicitly inform them that the input is sorted by providing the --presorted-input switch.