NAME

pipeline - Examine SiLK Flow, YAF, or IPFIX records as they arrive


SYNOPSIS

There are 4 possible data sources: SiLK, YAF, IPFIX, or a configuration file with all of the details.

There are 4 possible input modes, 3 of which run continuously and will be run as a daemon by default: UDP or TCP socket (which require --break-on-recs), polling a directory for new files. The last is a finite list of files to process, which is never run as a daemon.

Allowable combinations: SiLK with directory polling or named files. YAF with UDP or TCP sockets or named files. IPFIX with UDP or TCP sockets, directory polling, or named files.

A data source configuration file contains all necessary details of both the data source and the input method.

There are 4 general input modes for pipeline, each of which can be run with snarf and without snarf.

To run pipeline when built with snarf, a snarf destination can be specified with: --snarf-destination=ENDPOINT.

To run pipeline when built without snarf, alert log files must be specified with: --alert-log-file=FILE_PATH --aux-alert-file=FILE_PATH

In the examples below, substitute the above alerting configurations in place of "ALERT CONFIGURATION OPTIONS".

To run pipeline continuously but not as a daemon:

  pipeline --configuration-file=FILE_PATH
        ALERT CONFIGURATION OPTIONS
        { --silk | --yaf | --ipfix }
        { --udp-port=NUMBER | --tcp-port=NUMBER | 
            --incoming-directory=DIR_PATH --error-directory=DIR_PATH
            [--archive-directory=DIR_PATH] [--flat-archive] 
        }
        [--break-on-recs=NUMBER]
        { [--time-is-clock] | [--time-field-name=STRING] | 
          [--time-from-schema] |
          [--time-field-ent=NUMBER --time-field-id=NUMBER] 
        }
        [--polling-interval=NUMBER] [--polling-timeout=NUMBER ]
        [--country-code-file=FILE_PATH]
        [--site-config-file=FILENAME]
        --do-not-daemonize

To run pipeline over a finite list of files:

  pipeline --configuration-file=FILE_PATH
        ALERT CONFIGURATION OPTIONS
        { --silk | --yaf | --ipfix }
        --name-files
        [--break-on-recs=NUMBER]
        { [--time-is-clock] | [--time-field-name=STRING] |
          [--time-from-schema] |
          [--time-field-ent=NUMBER --time-field-id=NUMBER] 
        }
        [--polling-interval=NUMBER] [--polling-timeout=NUMBER ]
        [--country-code-file=FILE_PATH]
        [--site-config-file=FILENAME]

To run pipeline using a configuration file specifying all data source and data input options. Daemonizing can be turned off it needed.

  pipeline --configuration-file=FILE_PATH
        ALERT CONFIGURATION OPTIONS
        --data-source-configuration-file=FILE_PATH
        [--country-code-file=FILE_PATH]
        [--site-config-file=FILENAME]
        { --do-not-daemonize |
          { --log-destination=DESTINATION |
            --log-directory=DIR_PATH [--log-basename=BASENAME] |
            --log-pathname=FILE_PATH 
          }
          [--log-level=LEVEL] [--log-sysfacility=NUMBER]
          [--pidfile=FILE_PATH] 
        }

To run pipeline continuously as a daemon:

  pipeline --configuration-file=FILE_PATH
        ALERT CONFIGURATION OPTIONS
        { --silk | --yaf | --ipfix }
        { --udp-port=NUMBER | --tcp-port=NUMBER |
            --incoming-directory=DIR_PATH --error-directory=DIR_PATH
            [--archive-directory=DIR_PATH] [--flat-archive] 
        }
        [--break-on-recs=NUMBER]
        { [--time-is-clock] | [--time-field-name=STRING] | 
          [--time-from-schema] | 
          [--time-field-ent=NUMBER --time-field-id=NUMBER] 
        }
        [--polling-interval=NUMBER] [--polling-timeout=NUMBER ]
        [--country-code-file=FILE_PATH]
        [--site-config-file=FILENAME]
        { --log-destination=DESTINATION
          | --log-directory=DIR_PATH [--log-basename=BASENAME]
          | --log-pathname=FILE_PATH 
        }
        [--log-level=LEVEL] [--log-sysfacility=NUMBER]
        [--pidfile=FILE_PATH]

Help options:

  pipeline --configuration-file=FILE_PATH --verify-configuration
  pipeline --help
  pipeline --version


DESCRIPTION

The Analysis Pipeline program, pipeline, is designed to be run over three different types of input. The first, as in version 4.x, is files of SiLK Flow records as they are processed by the SiLK packing system. The second type is data coming directly out of YAF (or super_mediator) including deep packet inspection information. The last is any raw IPFIX records.

pipeline requires a configuration file that specifies filters and evaluations. The filter blocks determine which flow records are of interest (similar to SiLK's rwfilter(1) command). The evaluation blocks can compute aggregate information over the flow records (similar to rwuniq(1)) to determine whether the flow records should generate an alert. Information on the syntax of the configuration file is available in the Analysis Pipeline Handbook.

The output that pipeline produces depends on whether support for the snarf alerting library was compiled into the pipeline binary, as described in the next subsections.

Either form of output from pipeline includes country code information. To map the IP addresses to country codes, a SiLK prefix map file, country_codes.pmap must be available to pipeline. This file can be installed in SiLK's install tree, or its location can be specified with the SILK_COUNTRY_CODES environment variable or the --country-codes-file command line switch.

Output Using Snarf

When pipeline is built with support for the snarf alerting library (http://tools.netsa.cert.org/snarf/), the --snarf-destination switch can be used to specify where to send the alerts. The parameter to the switch takes the form tcp://HOST:PORT, which specifies that a snarfd process is running on HOST at PORT. When --snarf-destination is not specified, pipeline uses the value in the SNARF_ALERT_DESTINATION environment variable. If it is not set, pipeline prints the alerts encoded in JSON (JavaScript Object Notation). The outputs go to the log file when running as a daemon, or to the standard output when the --name-files switch is specified.

Legacy Output Not Using Snarf

When snarf support is not built into pipeline, the output of pipeline is a textual file in pipe-delimited (|-delimited) format describing which flow records raised an alert and the type of alert that was raised. The location of the output file must be specified via the --alert-log-file switch. The file is in a format that a properly configured ArcSight Log File Flexconnector can use. The pipeline.sdkfilereader.properties file in the share/analysis-pipeline/ directory can be used to configure the ArcSight Flexconnector to process the file.

pipeline can provide additional information about the alert in a separate file, called the auxiliary alert file. To use this feature, specify the complete path to the file in the --aux-alert-file switch. This option is required.

pipeline will assume that both the alert-log-file and the aux-alert-file are under control of the logrotate(8) daemon. See the Analysis Pipeline Handbook for details.

Integrating pipeline into the SiLK Packing System

Normally pipeline is run as a daemon during SiLK's collection and packing process. pipeline runs on the flow records after they have been processed rwflowpack(8), since pipeline may need to use the class, type, and sensor data that rwflowpack assigns to each flow record.

pipeline should get a copy of each incremental file that rwflowpack generates. There are three places that pipeline can be inserted so it will see every incremental file:

We describe each of these in turn. If none of these daemons are in use at your site, you must modify how rwflowpack runs, which is also described below.

rwsender

To use pipeline with the rwsender in SiLK 2.2 or later, specify a --local-directory argument to rwsender, and have pipeline use that directory as its incoming-directory, for example:

 rwsender ... --local-directory=/var/silk/pipeline/incoming ...
 pipeline ... --incoming-directory=/var/silk/pipeline/incoming ...

rwreceiver

When pipeline is running on a dedicated machine separate from the machine where rwflowpack is running, one can use a dedicated rwreceiver to receive the incremental files from an rwsender running on the machine where rwflowpack is running. In this case, the incoming-directory for pipeline will be the destination-directory for rwreceiver. For example:

 rwreceiver ... --destination-dir=/var/silk/pipeline/incoming ...
 pipeline ... --incoming-directory=/var/silk/pipeline/incoming ...

When pipeline is running on a machine where an rwreceiver (version 2.2. or newer) is already running, one can specify an additional --duplicate-destination directory to rwreceiver, and have pipeline use that directory as its incoming directory. For example:

 rwreceiver ... --duplicate-dest=/var/silk/pipeline/incoming ...
 pipeline ... --incoming-directory=/var/silk/pipeline/incoming ...

rwflowappend

One way to use pipeline with rwflowappend is to have rwflowappend store incremental files into an archive-directory, and have pipeline process those files. However, since rwflowappend stores the incremental files in subdirectories under the archive-directory, you must specify a --post-command to rwflowappend to move (or copy) the files into another directory where pipeline can process them. For example:

 rwflowappend ... --archive-dir=/var/silk/rwflowappend/archive
       --post-command='mv %s /var/silk/pipeline/incoming' ...
 pipeline ... --incoming-directory=/var/silk/pipeline/incoming ...

Note: Newer versions of rwflowappend support a --flat-archive switch, which places the files into the root of the archive-directory. For this situation, make the archive-directory of rwflowappend the incoming-directory of pipeline:

 rwflowappend ... --archive-dir=/var/silk/pipeline/incoming
 pipeline ... --incoming-directory=/var/silk/pipeline/incoming ...

rwflowpack only

If none of the above daemons are in use at your site because rwflowpack writes files directly into the data repository, you must modify how rwflowpack runs so it uses a temporary directory that rwflowappend monitors, and you can then insert pipeline after rwflowappend has processed the files.

Assuming your current configuration for rwflowpack is:

 rwflowpack --sensor-conf=/var/silk/rwflowpack/sensor.conf
       --log-directory=/var/silk/rwflowpack/log
       --root-directory=/data

You can modify it as follows:

 rwflowpack --sensor-conf=/var/silk/rwflowpack/sensor.conf
       --log-directory=/var/silk/rwflowpack/log
       --output-mode=sending
       --incremental-dir=/var/silk/rwflowpack/incremental
       --sender-dir=/var/silk/rwflowappend/incoming
 rwflowappend --root-directory=/data
       --log-directory=/var/silk/rwflowappend/log
       --incoming-dir=/var/silk/rwflowappend/incoming
       --error-dir=/var/silk/rwflowappend/error
       --archive-dir=/var/silk/rwflowappend/archive
       --post-command='mv %s /var/silk/pipeline/incoming' ...
 pipeline --silk --incoming-directory=/var/silk/pipeline/incoming
       --error-directory=/var/silk/pipeline/error
       --log-directory=/var/silk/pipeline/log
       --configuration-file=/var/silk/pipeline/pipeline.conf

Non-daemon mode

There are two ways to run pipeline in non-daemon mode. The first is to run it using one of the ways above that runs forever (socket or directory polling) but just not run it as a daemon. use --do-not-daemonize to keep the process is the foreground.

The other way is to run pipeline over files whose names are specified on the command line. In this mode, pipeline stays in the foreground, processes the files, and exits. None of the files specified on the command line are changed in any way---they are neither moved nor deleted. To run pipeline in this mode, specify the --name-files switch and the names of the files to process.


OPTIONS

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

General Configuration

These switches affect general configuration of pipeline. The first two switches are required:

--configuration-file=FILE_PATH

Give the path to the configuration file that specifies the filters that determine which flow records are of interest and the evaluations that signify when an alert is to be raised. This switch is required.

--country-codes-file=FILE_PATH

Use the designated country code prefix mapping file instead of the default.

--site-config-file=FILENAME

Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, the location specified by the SILK_CONFIG_FILE environment variable is used if that variable is not empty. The value of SILK_CONFIG_FILE should include the name of the file. Otherwise, the application looks for a file named silk.conf in the following directories: the directories $SILK_PATH/share/silk/ and $SILK_PATH/share/; and the share/silk/ and share/ directories parallel to the application's directory.

--dns-public-suffix-file=FILENAME

pipeline comes with a public suffix file provided by Mozilla at: https://publicsuffix.org/list/public_suffix_list.dat To provide pipeline with a different list, use this option to provide a file. The file must be formatted the same way as Mozilla's file. This is optional.

--stats-log-interval=NUMBER

The number of integer minutes between pipeline logging statistics regarding records processed and memory usage. Setting this value to 0 turns off this feature. This is optional and the default value is 5 minutes.

Data Source Configuration Options

pipeline needs to know what general type of data it will be receiving, SiLK flows, YAF data, or raw IPFIX. If there are multiple data sources, a data source configuration file is required. If using a daemon config file, the data source configuration file variable is required.

If there is a single data source, the data source type can be specified on the command line. Depending on the type of data, there are different available options for receiving data.

--silk

The records are SiLK flows. The data input method options are the same as in past versions --incoming-directory=DIR_PATH Pipeline will poll a direcory forever for new flow files --name-files The list of files pipeline will process are listed on the command line as the last group of arguments

--yaf

The records are coming directly from a YAF sensor (or from an instance of super_mediator). The data input options are: --udp-port=NUMBER and --break-on-recs=NUMBER UDP socket to listen for YAF data on, and how many records to process before breaking and running evaluations. --tcp-port=NUMBER and --break-on-recs=NUMBER TCP socket to listen for YAF data on, and how many records to process before breaking and running evaluations. --name-files Process YAF data files listed on the command line.

--ipfix

The records are raw IPFIX records, not coming directly from YAF. The data input options are: --udp-port=NUMBER and --break-on-recs=NUMBER UDP socket to listen for YAF data on, and how many records to process before breaking and running evaluations. --tcp-port=NUMBER and --break-on-recs=NUMBER TCP socket to listen for YAF data on, and how many records to process before breaking and running evaluations. --name-files Process YAF data files listed on the command line. --incoming-directory=DIR_PATH Pipeline will poll a direcory forever for new flow files

--data-source-configuration-file=FILENAME

The data source and input options are detailed in a configuration file. The sytnax for the file can be referenced by the Pipeline Handbook.

Timing Source Configuration Options

If the primary (or only) data source is SiLK, these options are not used. If it is a SiLK data source, flow end time is still used for timing source.

Otherwise, one of these options is required to provide a timing source.

--time-is-clock

Use the system clock time as the timing source

--time-field-name=STRING

Use the provided field name as the timing source.

--time-field-ent=NUMBER and --time-field-id=NUMBER

These must be used together, as it takes an enterprise ID and an element ID to define an information element. This element will be used as the timing source.

--time-from-schema

Use the timing source specified by the schema. If no timing source is specified by the schema(s) used, pipeline will report an error.

--break-on-recs=NUMBER

Versions 4.x only worked on SiLK files, which provided an easy way to know when to stop processing/filtering records and run evaluations. When accepting a stream of records from a socket, there is no break, so pipeline needs to know how many records to process/filter before running evaluations. Use this option to tell pipeline how many records to process. This option is required for socket connections.

Alert Destination when Snarf is Available

When pipeline is built with support for snarf (http://tools.netsa.cert.org/snarf/), the following switch is available. Its use is optional.

--snarf-destination=ENDPOINT

Specify where pipeline is to send alerts. The ENDPOINT has the form tcp://HOST:PORT, which specifies that a snarfd process is running on HOST at PORT. When this switch is not specified, pipeline uses the value in the SNARF_ALERT_DESTINATION environment variable. If that variable is not set, pipeline prints the alerts locally, either to the log file (when running as a daemon), or to the standard output.

Alert Destination when Snarf is Unavailable

When pipeline is built without support for snarf, the following switches are available, and the --alert-log-file switch is required.

--alert-log-file=FILE_PATH

Specify the path to the file where pipeline will write the alert records. The full path to the log file must be specified. pipeline assumes that this file will be under control of the logrotate(8) command.

--aux-alert-file=FILE_PATH

Have pipeline provide additional information about an alert to FILE_PATH. When a record causes an alert, pipeline writes the record in textual format to the alert-log-file. Often there is additional information associated with an alert that cannot be captured in a single record; this is especially true for statistic-type alerts. The aux-alert-file is a location for pipeline to write that additional information. The FILE_PATH must be an absolute path, and pipeline assumes that this file will be under control of the logrotate(8) command.

Daemon Mode

The following switches are used when pipeline is run as a daemon. They may not be mixed with the switches related to Processing Existing Files described below. The first two switches are required, and at least one switch related to logging is required.

--incoming-directory=DIR_PATH

Watch this directory for new SiLK Flow files that are to be processed by pipeline. pipeline ignores any files in this directory whose names begin with a dot (.). In addition, new files will only be considered when their size is constant for one polling-interval after they are first noticed.

--polling-interval=NUMBER

Sets the interval in seconds for how often pipeline checks for new files if polling a direcory using --incoming-directory

--polling-timeout=NUMBER

Sets the amount of time in seconds pipeline will wait for a new file when polling a directory using --incoming-directory

--udp-port=NUMBER

Listen on a UDP port for YAF or IPFIX records, not SiLK records. pipeline will reestablish this connection if the sender closes the socket, unless --do-not-reestablish is used.

--tcp-port=NUMBER

Listen on a TCP port for YAF or IPFIX records, not SiLK records. pipeline will reestablish this connection if the sender closes the socket, unless --do-not-reestablish is used.

--error-directory=DIR_PATH

Store in this directory SiLK files that were NOT successfully processed by pipeline.

One of the following mutually-exclusive logging-related switches is required:

--log-destination=DESTINATION

Specify the destination where logging messages are written. When DESTINATION begins with a slash /, it is treated as a file system path and all log messages are written to that file; there is no log rotation. When DESTINATION does not begin with /, it must be one of the following strings:

none

Messages are not written anywhere.

stdout

Messages are written to the standard output.

stderr

Messages are written to the standard error.

syslog

Messages are written using the syslog(3) facility.

both

Messages are written to the syslog facility and to the standard error (this option is not available on all platforms).

--log-directory=DIR_PATH

Use DIR_PATH as the directory where the log files are written. DIR_PATH must be a complete directory path. The log files have the form

  DIR_PATH/LOG_BASENAME-YYYYMMDD.log

where YYYYMMDD is the current date and LOG_BASENAME is the application name or the value passed to the --log-basename switch when provided. The log files will be rotated: at midnight local time a new log will be opened and the previous day's log file will be compressed using gzip(1). (Old log files are not removed by pipeline; the administrator should use another tool to remove them.) When this switch is provided, a process-ID file (PID) will also be written in this directory unless the --pidfile switch is provided.

--log-pathname=FILE_PATH

Use FILE_PATH as the complete path to the log file. The log file will not be rotated.

The following switches are optional:

--archive-directory=DIR_PATH

Move incoming SiLK Flow files that pipeline processes successfully into the directory DIR_PATH. DIR_PATH must be a complete directory path. When this switch is not provided, the SiLK Flow files are deleted once they have been successfully processed. When the --flat-archive switch is also provided, incoming files are moved into the top of DIR_PATH; when --flat-archive is not given, each file is moved to a subdirectory based on the current local time: DIR_PATH/YEAR/MONTH/DAY/HOUR/. Removing files from the archive-directory is not the job of pipeline; the system administrator should implement a separate process to clean this directory.

--flat-archive

When archiving incoming SiLK Flow files via the --archive-directory switch, move the files into the top of the archive-directory, not into subdirectories of the archive-directory. This switch has no effect if --archive-directory is not also specified. This switch can be used to allow another process to watch for new files appearing in the archive-directory.

--polling-interval=NUM

Configure pipeline to check the incoming directory for new files every NUM seconds. The default polling interval is 15 seconds.

--log-level=LEVEL

Set the severity of messages that will be logged. The levels from most severe to least are: emerg, alert, crit, err, warning, notice, info, debug. The default is info.

--log-sysfacility=NUMBER

Set the facility that syslog(3) uses for logging messages. This switch takes a number as an argument. The default is a value that corresponds to LOG_USER on the system where pipeline is running. This switch produces an error unless --log-destination=syslog is specified.

--log-basename=LOG_BASENAME

Use LOG_BASENAME in place of the application name for the files in the log directory. See the description of the --log-directory switch.

--pidfile=FILE_PATH

Set the complete path to the file in which pipeline writes its process ID (PID) when it is running as a daemon. No PID file is written when --do-not-daemonize is given. When this switch is not present, no PID file is written unless the --log-directory switch is specified, in which case the PID is written to LOGPATH/pipeline.pid.

--do-not-daemonize

Force pipeline to stay in the foreground---it does not become a daemon. Useful for debugging.

Process Existing Files

--name-files

Cause pipeline to run its analysis over a specific set of files named on the command line. Once pipeline has processed those files, it exits. This switch cannot be mixed with the Daemon Mode and Logging and Daemon Configuration switches described above. When using files named on the command line, pipeline will not move or delete the files.

Help Options

--verify-configuration

Verify that the syntax of the configuration file is correct and then exit pipeline. If the file is incorrect or if it does not define any evaluations, an error message is printed and pipeline exits abnormally. If the file is correct, pipeline simply exits with status 0.

--print-schema-info

Print the information elements available based on the schemas that arrive. When using any data source other than SiLK flows, this feature requires data to arrive such that templates/schemas can be read and information elements made available. This option will not verify your configuration file.

--show-schema-and-verify

Print the information elements available based on the schemas that arrive, and verify the syntax of the configuration file. When using any data source other than SiLK flows, this feature requires data to arrive such that templates/schemas can be read and information elements made available.

--help

Print the available options and exit.

--version

Print the version number and information about how the SiLK library used by pipeline was configured, then exit the application.


ENVIRONMENT

SILK_CONFIG_FILE

This environment variable is used as the value for the --site-config-file when that switch is not provided.

SILK_COUNTRY_CODES

This environment variable allows the user to specify the country code mapping file that pipeline will use. The value may be a complete path or a file relative to the SILK_PATH. If the variable is not specified, the code looks for a file named country_codes.pmap in the location specified by SILK_PATH.

SILK_PATH

This environment variable gives the root of the install tree. As part of its search for the SiLK site configuration file, pipeline checks for a file named silk.conf in the directories $SILK_PATH/share/silk and $SILK_PATH/share. To find the country code prefix map file, pipeline checks those same directories for a file named country_codes.pmap.

SNARF_ALERT_DESTINATION

When pipeline is built with snarf support ( http://tools.netsa.cert.org/snarf/), this environment variable specifies the location to send the alerts. The --snarf-destination switch has precedence over this variable.


SEE ALSO

silk(7), rwflowappend(8), rwflowpack(8), rwreceiver(8), rwsender(8), rwfilter(1), rwuniq(1), syslog(3), logrotate(8), http://tools.netsa.cert.org/snarf, Analysis Pipeline Handbook, The SiLK Installation Handbook