CERT/CC
background
background
CERT NetSA Security Suite 
Open Source Tools for Network Monitoring 
News | Downloads | Documentation | Wiki | Tooltips
SiLK 2.1.0 | YAF 1.0.0.2 | IPA 0.4.0 | fixbuf 0.8.0 | Portal 0.9.0 | RAVE 1.9.16 | iSiLK 0.1.6
SiLK - Documentation - rwsort
Documentation | Downloads | Release Notes | FAQ | License | Credits | Reference Data | Live CD


NAME

rwsort - Sort SiLK Flow records on one or more fields


SYNOPSIS

  rwsort --fields=KEY [--presorted-input] [--reverse]
        [--plugin=PLUGIN]
        [--temp-directory=DIR_PATH] [--sort-buffer-size=SIZE]
        [--note-add=TEXT] [--note-file-add=FILE]
        [--compression-method=COMP_METHOD] [--print-filenames]
        [--output-path=PATH] [--site-config-file=FILENAME]
        [--pmap-file=MAPNAME:PATH [--pmap-file=MAPNAME:PATH ...]]
        [--python-file=PATH ...] 
        [ { --input-pipe=PATH | FILE [FILES ...] } ]
  rwsort [--pmap-file=MAPNAME:PATH [--pmap-file=MAPNAME:PATH ...]]
        [--plugin=PLUGIN ...] [--python-file=PATH ...] --help
  rwsort --version


DESCRIPTION

rwsort reads SiLK Flow records from the specified --input-pipe, from the files named on the command line, or from the standard input. The records are sorted on the field(s) listed by the --fields switch, and the SiLK Flow records are written sent to the --output-path or to the standard output if it is not connected to a terminal. The output from rwsort is binary SiLK Flow records; the output must be passed into another tool for human-readable output.

The amount of fast memory used by rwsort will increase until it reaches a maximum near 2GB. (Use the --sort-buffer-size switch to change this upper limit on the buffer size.) If more records are read than will fit into memory, the in-core records are sorted and temporarily stored on disk as described by the --temp-directory switch. When all records have been read, the on-disk files are merged and the sorted records written to the output.

By default, the temporary files are stored in the /tmp directory. Because these temporary files will be large, it is strongly recommended that /tmp not be used as the temporary directory. To modify the temporary directory used by rwsort, provide the --temp-directory switch, set the SILK_TMPDIR environment variable, or set the TMPDIR environment variable.

To merge previously sorted SiLK data files into a sorted stream, run rwsort with the --presorted-input switch. rwsort will merge-sort all the input files, reducing it's memory requirements considerably. It is the user's responsibility to ensure that all the input files have been sorted with the same --fields value (and --reverse if applicable). rwsort may still require use of a temporary directory while merging the files (for example, if rwsort does not have enough available file handles to open all the input files at once).


OPTIONS

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

The --fields switch is required. rwsort will fail when it is not provided.

--fields=KEY

KEY contains the list of flow attributes (a.k.a. fields or columns) that make up the key by which flows are sorted. The fields are in listed in order from primary sort key, secondary key, etc. Each field may be specified once only. KEY is a comma separated list of field-names, field-integers, and ranges of field-integers; a range is specified by separating the start and end of the range with a hyphen (-). Field-names are case insensitive. Example:

 --fields=stime,10,1-5

There is no default value for the --fields switch; the switch must be specified.

The complete list of built-in fields that the SiLK tool suite supports follows, though note that not all fields are present in all SiLK file formats; when a field is not present, its value is 0.

sIP,1

source IP address

dIP,2

destination IP address

sPort,3

source port for TCP and UDP, or equivalent

dPort,4

destination port for TCP and UDP, or equivalent

protocol,5

IP protocol

packets,pkts,6

packet count

bytes,7

byte count

flags,8

bit-wise OR of TCP flags over all packets

sTime,9,sTime+msec,22

starting time of flow (milliseconds resolution)

dur,10,dur+msec,24

duration of flow (milliseconds resolution)

eTime,11,eTime+msec,23

end time of flow (milliseconds resolution)

sensor,12

name or ID of sensor at the collection point

class,20

class of sensor at the collection point

type,21

type of sensor at the collection point

icmpTypeCode,25

the ICMP type and code

Many SiLK file formats do not store the following fields and their values will always be 0; they are listed here for completeness:

in,13

router SNMP input interface

out,14

router SNMP output interface

nhIP,15

router next hop IP

SiLK can store flows generated by enhanced collection software that provides more information than NetFlow v5. These flows may support some or all of these additional fields; for flows without this additional information, the field's value is always 0.

initialFlags,26

TCP flags on first packet in the flow

sessionFlags,27

bit-wise OR of TCP flags over all packets except the first in the flow

attributes,28

flow attributes set by the flow generator:

F

flow generator saw additional packets in this flow following a packet with a FIN flag (excluding ACK packets)

T

flow generator prematurely created a record for a long-running connection due to a timeout. (When the flow generator yaf(1) is run with the --silk switch, it will prematurely create a flow and mark it with T if the byte count of the flow cannot be stored in a 32-bit value.)

C

flow generator created this flow as a continuation of long-running connection, where the previous flow for this connection met a timeout (or a byte threshold in the case of yaf).

Consider a long-running ssh session that exceeds the flow generator's active timeout. (This is the active timeout since the flow generator creates a flow for a connection that still has activity). The flow generator will create multiple flow records for this ssh session, each spanning some portion of the total session. The first flow record will be marked with a T indicating that it hit the timeout. The second through next-to-last records will be marked with TC indicating that this flow both timed out and is a continuation of a flow that timed out. The final flow will be marked with a C, indicating that it was created as a continuation of an active flow.

application,29

guess as to the content the flow. Some software that generates flow records from packet data, such as yaf, will inspect the contents of the packets that make up a flow and use traffic signatures to label the content of the flow. SiLK calls this label the application; yaf refers to it as the appLabel. The application is the port number that is traditionally used for that type of traffic (see the /etc/services file on most UNIX systems). For example, traffic that the flow generator recognizes as FTP will have a value of 21, even if that traffic is being routed through the standard HTTP/web port (80).

The list of built-in fields may be augmented by run-time loading of plug-ins (shared object files or dynamic libraries) when the plug-in is available. rwsort automatically looks for the following plug-ins:

ADDRESS TYPE (addrtype.so)

stype,16

categorize the source IP address as non-routable, internal, or external and sort based on the category. See addrtype(3).

dtype,17

as stype for the destination IP address

COUNTRY CODE (ccfilter.so)

scc,18

the country code of the source IP address. See ccfilter(3).

dcc,19

as scc for the destination IP

PREFIX MAP (pmapfilter.so)

src-MAPNAME

value determined by passing the source IP or the protocol/source-port to the user-defined mapping defined in the prefix map associated with MAPNAME. See the description of the --pmap-file switch and the pmapfilter(3) manual page.

dst-MAPNAME

as src-MAPNAME for the destination IP or protocol/destination-port.

sval
dval

These are deprecated field names created by pmapfilter that correspond to src-MAPNAME and dst-MAPNAME, respectively. These fields are available when a prefix map is used that is not associated with a MAPNAME.

--presorted-input

Instruct rwsort to merge-sort the input files; that is, rwsort assumes the input files have been previously sorted using the same values for the --fields and --reverse switches as was given for this invocation. This switch can greatly reduce rwsort's memory requirements as a large buffer is not required for sorting the records. If the input files were created with rwsort, you can run rwfileinfo(1) on the files to see the rwsort invocation that created them.

--reverse

Cause rwsort to reverse the sort order, causing larger values to occur in the output before smaller values. Normally smaller values appear before larger values.

--plugin=PLUGIN

Augment the list of fields by using run-time loading of the plug-in (shared object) whose path is PLUGIN. The creation of these plug-ins is beyond the scope of this manual page. When PLUGIN contains a slash (/), rwsort assumes the path to PLUGIN is correct. Otherwise, rwsort will attempt to find the file in $SILK_PATH/lib/silk, $SILK_PATH/share/lib, $SILK_PATH/lib, and in these directories parallel to the application's directory: lib/silk, share/lib, and lib. If rwsort does not find the file, it assumes the plug-in is in the current directory. To force rwsort to look in the current directory first, specify --plugin=./PLUGIN. When the SILK_PLUGIN_DEBUG environment variable is non-empty, rwsort prints status messages to the standard error as it tries to open each of its plug-ins.

--temp-directory=DIR_PATH

Specify the name of the directory in which to store data files temporarily when more records have been read that will fit into RAM. This switch overrides the directory specified in the SILK_TMPDIR environment variable, which overrides the directory specified in the TMPDIR variable, which overrides the default, /tmp.

--sort-buffer-size=SIZE

Set the maximum size of the buffer used for sorting the records, in bytes. A larger buffer means fewer temporary files need to be created, reducing the I/O wait times. When this switch is not specified, the default maximum for this buffer is near 2GB. The SIZE may be given as an ordinary integer, or as a real number followed by a suffix K, M or G, which represents the numerical value multiplied by 1,024 (kilo), 1,048,576 (mega), and 1,073,741,824 (giga), respectively. For example, 1.5K represents 1,536 bytes, or one and one-half kilobytes. (This value does not represent the absolute maximum amount of RAM that rwsort will allocate, since additional buffers will be allocated for reading the input and writing the output.) The sort buffer is not used when the --presorted-input switch is specified.

--note-add=TEXT

Add the specified TEXT to the header of the output file as an annotation. This switch may be repeated to add multiple annotations to a file. To view the annotations, use the rwfileinfo(1) tool.

--note-file-add=FILENAME

Open FILENAME and add the contents of that file to the header of the output file as an annotation. This switch may be repeated to add multiple annotations. Currently the application makes no effort to ensure that FILENAME contains text; be careful that you do not attempt to add a SiLK data file as an annotation.

--compression-method=COMP_METHOD

Set the compression method of the output to COMP_METHOD. Some SiLK tools can use an external library to compress their binary output. The list of available compression methods and the default method are set when SiLK is compiled (the --help and --version switches print the available and default compression methods) and depend on which supported libraries are found. SiLK can support:

none

Do not compress the output using an external library

zlib

Use the zlib(3) library for compressing the output

lzo1x

Use the lzo1x algorithm from the LZO real time compression library for compression

best

Use whichever available method gives the best compression in general, though not necessarily the best for this particular output.

--print-filenames

Print to the standard error the names of input files as they are opened.

--output-path=PATH

Write the sorted SiLK Flow records to the file at PATH. This switch must not name an existing regular file. When the standard output is not a terminal and this switch is not provided or its argument is stdout, the sorted records are written to the standard output.

--input-pipe=PATH

Read the SiLK Flow records to be sorted from the named pipe at PATH. If PATH is stdin, records are read from the standard input. Use of this switch is not required, since rwsort will automatically read data from the standard input when no file names are specified on the command line.

--site-config-file=FILENAME

Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, the location specified by the SILK_CONFIG_FILE environment variable is used if that variable is not empty. The value of SILK_CONFIG_FILE should include the name of the file. Otherwise, the application looks for a file named silk.conf in the following directories: the directory specified in the SILK_DATA_ROOTDIR environment variable; the data root directory that is compiled into SiLK (use the --version switch to view this value); the directories $SILK_PATH/share/silk/ and $SILK_PATH/share/; and the share/silk/ and share/ directories parallel to the application's directory.

--help

Print the available options and exit. Options that add fields can be specified before --help so that the new options appear in the output.

--version

Print the version number and information about how SiLK was configured, then exit the application.

--dynamic-library=PLUGIN

This switch is deprecated. It is an alias for --plugin.

--pmap-file=MAPNAME:PATH
--pmap-file=PATH

When the prefix map plug-in is used, rwsort reads the mapping file located at PATH. When MAPNAME is provided, it will be used to refer to the fields specific to that prefix map. If MAPNAME is not provided, rwsort will check the prefix map file to see if a map-name was specified when the file was created. Using multiple --prefix-map switches allows additional prefix map files to be read as long as each uses a unique map-name. For more information, see pmapfilter(3).

--python-file=PATH

When the SiLK Python plug-in is used, rwsort reads the Python code from the file PATH to define additional fields that can be used as part of the sort key. This file should call register_plugin_field() for each field it wishes to define. For details and examples, see the silkpython(3) and pysilk(3) manual pages.


LIMITATIONS

When the temporary files and the final output are stored on the same file volume, rwsort will require approximately twice as much free disk space as the size of data to be sorted.

When the temporary files and the final output are on different volumes, rwsort will require between 1 and 1.5 times as much free space on the temporary volume as the size of the data to be sorted.


EXAMPLES

To sort the records in infile.rwf based primarily on destination port and secondarily on source IP and write the binary output to outfile.rwf, run:

 rwsort --fields=dport,sip --output-path=outfile.rwf infile.rwf

The silkpython(3) manual page provides examples that use PySiLK to create arbitrary fields to use as part of the key for rwsort.


ENVIRONMENT

SILK_TMPDIR

When set and --temp-directory is not specified, rwsort writes the temporary files it creates to this directory. SILK_TMPDIR overrides the value of TMPDIR.

TMPDIR

When set and SILK_TMPDIR is not set, rwsort writes the temporary files it creates to this directory.

PYTHONPATH

This environment variable is used by Python to locate modules. When --python-file is specified, rwsort loads Python which in turn loads the PySiLK module which is comprised of several files (silk/pysilk_nl.so, silk/__init__.py, etc). If this silk/ directory is located outside Python's normal search path (for example, in the SiLK installation tree), it may be necessary to set or modify the PYTHONPATH environment variable to include the parent directory of silk/ so that Python can find the PySiLK module.

SILK_PYTHON_TRACEBACK

When set, Python plug-ins will output traceback information on Python errors to the standard error.

SILK_COUNTRY_CODES

This environment variable allows the user to specify the country code mapping file that the ccfilter(3) plug-in will use. The value may be a complete path or a file relative to the SILK_PATH. If the variable is not specified, the code looks for a file named country_codes.pmap in the location specified by SILK_PATH.

SILK_CONFIG_FILE

This environment variable is used as the value for the --site-config-file when that switch is not provided.

SILK_DATA_ROOTDIR

When the --site-config-file switch is not provided and the SILK_CONFIG_FILE environment variable is not set, rwsort looks for the site configuration file in $SILK_DATA_ROOTDIR/silk.conf.

SILK_PATH

This environment variable gives the root of the install tree. As part of its search for the SiLK site configuration file, rwsort checks for a file named silk.conf in the directories $SILK_PATH/share/silk and $SILK_PATH/share. These directories are also searched when any other configuration file is required (e.g., the country code map). In addition, rwsort looks for plug-ins in $SILK_PATH/lib/silk, $SILK_PATH/share/lib and $SILK_PATH/lib.

SILK_PLUGIN_DEBUG

When set to 1, rwsort prints status messages to the standard error as it tries to open each of its plug-ins.


SEE ALSO

rwfilter(1), rwcut(1), rwfileinfo(1), rwuniq(1), addrtype(3), ccfilter(3), pmapfilter(3), silkpython(3), pysilk(3), yaf(1), zlib(3)


NOTES

If an output path is not specified, rwsort will write to the standard output unless it is connected to a terminal, in which case an error is printed and rwsort exits.

If an input pipe or a set of input files are not specified, rwsort will read records from the standard input unless it is connected to a terminal, in which case an error is printed and rwsort exits.

Note that rwsort produces binary output. Use rwcut(1) to view the records.

Do not spend the resources to sort the data if you are going to be passing it to an aggregation tool like rwtotal or rwaddrcount, which have their on internal data structures that will ingore the sorted data.

rwuniq(1) can take advantage of previously sorted data if it is instructed to do so with its --presorted-input switch.