Monitoring for Large-Scale Networks

Analysis Pipeline

Documentation

When using multiple data sources (or when invoking pipeline as a daemon using "service pipeline start"), a data source configuration file is required to instantiate them. This file is specified on the command line with the --data-source-configuration-file switch. When using a data source config file, no other data source/input related switches can be used on the command line.

The format of this configuration file is similar to the one used for filters, evaluations, etc. In the examples below, capitalized words are keywords, lowercase words, such as data source names, are up to the user.

When declaring data sources in the configuration file, the first one must be declared to be the primary data source. If there are multiple data sources the rest are declared as secondary data sources. The primary data source must be configured with a timing source. The secondary sources cannot be configured with timing sources.

The only difference in record processing between primary and secondary data sources is that records from the primary data source are responsible for advancing time. The reason they both cannot advance time is that there is no way to guarantee that the timing elements are synced, so Pipeline can only handle one timing source for the entire application. The internal Pipeline time will stay the same during processing of records from a secondary data source.

If the timing source in the primary data source is clock time, it does not rely on record contents, so time will advance regardless of which source is being used because Pipeline will use the local system time.

The same rules apply for combinations of configuration options in this data source configuration file as the command line options.

Each timing source has beginning and end statements along with a unique name

    PRIMARY DATA SOURCE nameOfSource
    END DATA SOURCE

    SECONDARY DATA SOURCE nameOfSource
    END DATA SOURCE

Each of the following keywords for configuration options goes inside the data source block:

Data Source Specification

SILK BUILDER - This data source has SiLK records (same as --silk)
YAF BUILDER - This data source has YAF records (same as --yaf)
IPFIX BUILDER - This data source has raw IPFIX records (same as --ipfix)

Data Input Specification

NAME FILES - Read filenames to process from the command line. If there are multiple data sources, only one can use NAME FILES. Available for SiLK, YAF, and IPFIX builders.
INCOMING DIRECTORY directoryToPoll - Poll the specified directory for files. This option requires an error directory for files improperly processed. Specify this with ERROR DIRECTORY errorDirectory. Available for SiLK and IPFIX builders.
TCP PORT portNumber - Read records from a TCP socket on the specified port. Available for YAF and IPFIX builders.
UDP PORT portNumber - Read records from a UDP socket on the specified port. Available for YAF and IPFIX builders.

Timing Options Specifiation. These are only to be used in a PRIMARY DATA SOURCE. None of these can be used for a SiLK data source.

TIME IS CLOCK - Pipeline will use the local system clock for its internal timing.
TIMING FIELD NAME timeFieldName - Use timeFieldName values for internal timing. This element must be in at least one schema used by this data source.
TIMING FIELD ENT number - Specify the information element enterprise ID of the element to be used as the timing source. Requires a TIMING FIELD ID as well.
TIMING FIELD ID number - Specify the information element ID of the element to be used as the timing source. Requires a TIMING FIELD ENT as well.

To specify the number of records to read before breaking to run evaluations, statistics, etc, use BREAK ON RECS number. This cannot be used with a SiLK builder. If used with YAF or IPFIX files, Pipeline will run evaluations after reading the specified number of records. It will also run evaluations at the end each file, regardless of the number of records read since the previous break.

The following are options available as additional ways to configure the polling of a directory (INCOMING DIRECTORY) for new files.

ARCHIVE DIRECTORY directoryPath - The directory to place successfully procesed files instead of deleting them.
POLLING INTERVAL number - The number of seconds that a file has to be of constant size before it is read. Default value: 15 seconds.
POLLING TIMEOUT number - The number of seconds to wait for new files from the directory to be polled before returning that there are no files and moving onto the next data source. Default value: 1 second.

This example creates a single data source that polls a directory for SiLK flow files and does not use an archive directory:

PRIMARY DATA SOURCE silkPolling
    SILK BUILDER
    INCOMING DIRECTORY "/data/pipelineIncoming"
    ERROR DIRECTORY "/data/pipelineError"
END DATA SOURCE

This tells pipeline to read a list of SiLK files from the command line:

PRIMARY DATA SOURCE silkNameFiles
SILK BUILDER
NAME FILES
END DATA SOURCE

This reads YAF data from a TCP socket and uses the flowEndMilliseconds as the timing fields, and breaks to run evaluations after 10,000 records:

PRIMARY DATA SOURCE yafTCP
    YAF BUILDER
    TCP PORT 180000
    TIMING FIELD NAME flowEndMilliseconds
    BREAK ON RECS 10000
END DATA SOURCE

This uses three data sources, reading YAF data from a UDP socket, and IPFIX data from a TCP port, and SiLK data by polling a directory. It uses the internal system clock time as the timing source:

PRIMARY DATA SOURCE yafTCP
    YAF BUILDER
    UDP PORT 180000
    TIME IS CLOCK
    BREAK ON RECS 10000
END DATA SOURCE

SECONDARY DATA SOURCE ipfixPoll
    IPFIX BUILDER
    TCP PORT 19500
    BREAK ON RECS 5000
END DATA SOURCE

SECONDARY DATA SOURCE silkPolling
    SILK BUILDER
    INCOMING DIRECTORY "/data/pipelineIncoming"
    ERROR DIRECTORY "/data/pipelineError"
END DATA SOURCE