CERT/CC
background
background
CERT NetSA Security Suite 
Open Source Tools for Network Monitoring 
News | Documentation | Downloads
YAF 0.8.1 | NAF 0.6.0 | SiLK 1.0.1 | RAVE 1.9.9
fixbuf 0.7.3 | ipa 0.2.1 | airdbc 0.2.2 | airframe 0.7.2 | Portal 0.8.0
SiLK - Documentation - rwdedupe
Documentation | Downloads | Release Notes | FAQ | License | Credits | Reference Data | Live CD


NAME

rwdedupe - Eliminate duplicate SiLK Flow records


SYNOPSIS

  rwdedupe [--ignore-fields=FIELDS] [--packets-delta=NUM]
        [--bytes-delta=NUM] [--stime-delta=NUM] [--duration-delta=NUM]
        [--temp-directory=DIR_PATH] [--buffer-size=SIZE]
        [--compression-method=COMP_METHOD] [--output-path=PATH]
        [--site-config-file=FILENAME] [FILES ...]


DESCRIPTION

rwdedupe reads SiLK Flow records from the files named on the command line or from the standard input. Records that appear in the input file(s) multiple times will only appear in the output stream once; that is, duplicate records are not written to the output. The SiLK Flows are written to the file specified by the --output-path switch or to the standard output when the --output-path switch is not provided and the standard output is not connected to a terminal.

As part of its processing, rwdedupe will re-order the records before writing them.

By default, rwdedupe will consider one record to be a duplicate of another when all the fields in the records match exactly. From another point on view, any difference in two records results in both records appearing in the output. Note that all means every field that exists on a SiLK Flow record. The complete list of fields is specified in the description of --ignore-fields in the OPTIONS section below.

To have rwdedupe ignore fields in the comparison, specify those fields in the --ignore-fields switch. When --ignore-fields=FIELDS is specified, a record is considered a duplicate of another if all fields except those in FIELDS match exactly. rwdedupe will treat FIELDS as being identical across all records. Put another way, if the only difference between two records is in the FIELDS fields, only one of those records will be written to the output.

The --packets-delta, --bytes-delta, --stime-delta and --duration-delta switches allow for ``fuzziness'' in the input. For example, if --stime-delta=NUM is specified and the only difference between two records is in the sTime fields, and the fields are within NUM milliseconds of each other, only one record will be written to the output.

During its processing, rwdedupe will try to allocate a large (near 2GB) in-memory array to hold the records. If 2GB cannot be allocated, rwdedupe reduces the requested size until it succeeds. (You may use the --buffer-size switch to change this default buffer size.) If more records are read than will fit into memory, the in-core records are temporarily stored on disk as described by the --temp-directory switch. When all records have been read, the on-disk files are merged to produce the output.

Because of the sizes of the temporary files, it is strongly recommended that /tmp not be used as the temporary directory, and rwdedupe will print a warning when /tmp is used. To modify the temporary directory used by rwdedupe, provide the --temp-directory switch, set the SILK_TMPDIR environment variable, or set the TMPDIR environment variable.


OPTIONS

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

--ignore-fields=FIELDS
Ignore the fields listed in FIELDS when determining if two flow records are identical; that is, treat FIELDS as being identical across all flows. By default, all fields are treated as significant.

FIELDS is a comma separated list of field-names, field-integers, and ranges of field-integers; a range is specified by separating the start and end of the range with a hyphen (-), e.g.,

  --ignore-fields=stime,12-15

The list of supported fields are:

sIP,sip,1
source IP address

dIP,dip,2
destination IP address

sPort,sport,3
source port for TCP and UDP, or equivalent

dPort,dport,4
destination port for TCP and UDP, or equivalent

protocol,5
IP protocol

packets,pkts,6
packet count

bytes,7
byte count

flags,8
bit-wise OR of TCP flags over all packets

sTime,stime,9
starting time of flow (milliseconds resolution)

dur,10
duration of flow (milliseconds resolution)

sensor,12
name or ID of sensor at the collection point

in,13
router SNMP input interface

out,14
router SNMP output interface

nhIP,15
router next hop IP

class,20
class of sensor at the collection point

type,21
type of sensor at the collection point

initialFlags,initialflags,26
TCP flags on first packet in the flow

sessionFlags,sessionflags,27
bit-wise OR of TCP flags over all packets except the first in the flow

attributes,28
flow attributes set by flow collector

application,29
guess as to the application generating the flow; value will be standard port for the application, such as 80 for web traffic

--packets-delta=NUM
Treat the packets field on two records as being the same if the values differ by NUM packets or less. If not specified, the default is 0.

--bytes-delta=NUM
Treat the bytes field on two records as being the same if the values differ by NUM bytes or less. If not specified, the default is 0.

--stime-delta=NUM
Treat the start-time field on two records as being the same if the values differ by NUM milliseconds or less. If not specified, the default is 0.

--duration-delta=NUM
Treat the duration field on two records as being the same if the values differ by NUM milliseconds or less. If not specified, the default is 0.

--temp-directory=DIR_PATH
Specify the name of the directory in which to store data files temporarily when more records have been read that will fit into RAM. This switch overrides the directory specified in the SILK_TMPDIR environment variable, which overrides the directory specified in the TMPDIR variable, which overrides /tmp.

--buffer-size=SIZE
Set the initial (maximum) size of the buffer to use for holding the records, in bytes. A larger buffer means fewer temporary files need to be created, reducing the I/O wait times. The default maximum for this buffer is near 2GB. If the buffer cannot be allocated, the requested size is reduced by 25% and the allocation is attempted again. This cycle continues until a buffer is allocated or the minimum buffer size is reached. The SIZE may be given as an ordinary integer, or as a real number followed by a suffix K, M or G, which represents the numerical value multiplied by 1,024 (kilo), 1,048,576 (mega), and 1,073,741,824 (giga), respectively. For example, 1.5K represents 1,536 bytes, or one and one-half kilobytes. (This value does not represent the absolute maximum amount of RAM that rwdedupe will allocate, since additional buffers will be allocated for reading the input and writing the output.)

--compression-method=COMP_METHOD
Set the compression method of the output to COMP_METHOD. Some SiLK tools can use an external library to compress their binary output. The list of available compression methods and the default method are set when SiLK is compiled (the --help and --version switches print the available and default compression methods) and depend on which supported libraries are found. SiLK can support:
none
Do not compress the output using an external library

zlib
Use the zlib(3) library for compressing the output

lzo1x
Use the lzo1x algorithm from the LZO real time compression library for compression

best
Use whichever available method gives the best compression in general, though not necessarily the best for this particular output.

--output-path=PATH
Write the SiLK Flow records to the specified file or named pipe. This switch must not name an existing regular file. When the standard output is not a terminal and this switch is not provided or its argument is stdout, the records are written to the standard output.

--site-config-file=FILENAME
Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, the location specified by the SILK_CONFIG_FILE environment variable is used if that variable is not empty. The value of SILK_CONFIG_FILE should include the name of the file. Otherwise, the application looks for a file named silk.conf in the following directories: the directory specified in the SILK_DATA_ROOTDIR environment variable; the data root directory that is compiled into SiLK (use the --version switch to view this value); the directories $SILK_PATH/share/silk/ and $SILK_PATH/share/; and the share/silk/ and share/ directories parallel to the application's directory.


LIMITATIONS

When the temporary files and the final output are stored on the same file volume, rwdedupe will require approximately twice as much free disk space as the size of input data.

When the temporary files and the final output are on different volumes, rwdedupe will require between 1 and 1.5 times as much free space on the temporary volume as the size of the input data.


EXAMPLE

Suppose you have made several rwfilter runs to find interesting traffic:

  rwfilter --start-date=2008/02/04 ... --pass=data1.rwf
  rwfilter --start-date=2008/02/04 ... --pass=data2.rwf
  rwfilter --start-date=2008/02/04 ... --pass=data3.rwf
  rwfilter --start-date=2008/02/04 ... --pass=data4.rwf

You now want to merge that traffic into a single output file, but you want to ensure that any records appearing in multiple output files are only counted once. You can use rwdedupe to merge the output:

  rwdedupe data1.rwf data2.rwf data3.rwf data4.rwf --output=data.rwf


ENVIRONMENT

SILK_TMPDIR
When set, rwdedupe writes the temporary files it creates to this directory. SILK_TMPDIR overrides the value of TMPDIR.

TMPDIR
When set and SILK_TMPDIR is not set, rwdedupe writes the temporary files it creates to this directory.

SILK_CONFIG_FILE
This environment variable is used as the value for the --site-config-file when that switch is not provided.

SILK_DATA_ROOTDIR
When the --site-config-file switch is not provided and the SILK_CONFIG_FILE environment variable is not set, rwdedupe looks for the site configuration file in $SILK_DATA_ROOTDIR/silk.conf.

SILK_PATH
This environment variable gives the root of the install tree. As part of its search for the SiLK site configuration file, rwdedupe checks for a file named silk.conf in the directories $SILK_PATH/share/silk and $SILK_PATH/share.


SEE ALSO

rwfilter(1)