When using multiple data sources (or when invoking pipeline as a daemon using "service pipeline start"), a data source configuration file is required to instantiate them. This file is specified on the command line with the --data-source-configuration-file switch. When using a data source config file, no other data source/input related switches can be used on the command line.

The format of this configuration file is similar to the one used for filters, evaluations, etc. In the examples below, capitalized words are keywords, lowercase words, such as data source names, are up to the user.

When declaring data sources in the configuration file, the first one must be declared to be the primary data source. If there are multiple data sources the rest are declared as secondary data sources. The primary data source must be configured with a timing source. The secondary sources cannot be configured with timing sources.

The only difference in record processing between primary and secondary data sources is that records from the primary data source are responsible for advancing time. The reason they both cannot advance time is that there is no way to guarantee that the timing elements are synced, so Pipeline can only handle one timing source for the entire application. The internal Pipeline time will stay the same during processing of records from a secondary data source.

If the timing source in the primary data source is clock time, it does not rely on record contents, so time will advance regardless of which source is being used because Pipeline will use the local system time.

The same rules apply for combinations of configuration options in this data source configuration file as the command line options.

Each timing source has beginning and end statements along with a unique name

    PRIMARY DATA SOURCE nameOfSource
    END DATA SOURCE

    SECONDARY DATA SOURCE nameOfSource
    END DATA SOURCE

Each of the following keywords for configuration options goes inside the data source block:

Data Source Specification

Data Input Specification

Timing Options Specifiation. These are only to be used in a PRIMARY DATA SOURCE. None of these can be used for a SiLK data source.

To specify the number of records to read before breaking to run evaluations, statistics, etc, use BREAK ON RECS number. This cannot be used with a SiLK builder. If used with YAF or IPFIX files, Pipeline will run evaluations after reading the specified number of records. It will also run evaluations at the end each file, regardless of the number of records read since the previous break.

The following are options available as additional ways to configure the polling of a directory (INCOMING DIRECTORY) for new files.

This example creates a single data source that polls a directory for SiLK flow files and does not use an archive directory:

PRIMARY DATA SOURCE silkPolling
    SILK BUILDER
    INCOMING DIRECTORY "/data/pipelineIncoming"
    ERROR DIRECTORY "/data/pipelineError"
END DATA SOURCE

This tells pipeline to read a list of SiLK files from the command line:

PRIMARY DATA SOURCE silkNameFiles
    SILK BUILDER
    NAME FILES
END DATA SOURCE

This reads YAF data from a TCP socket and uses the flowEndMilliseconds as the timing fields, and breaks to run evaluations after 10,000 records:

PRIMARY DATA SOURCE yafTCP
    YAF BUILDER
    TCP PORT 180000
    TIMING FIELD NAME flowEndMilliseconds
    BREAK ON RECS 10000
END DATA SOURCE

This uses three data sources, reading YAF data from a UDP socket, and IPFIX data from a TCP port, and SiLK data by polling a directory. It uses the internal system clock time as the timing source:

PRIMARY DATA SOURCE yafTCP
    YAF BUILDER
    UDP PORT 180000
    TIME IS CLOCK
    BREAK ON RECS 10000
END DATA SOURCE

SECONDARY DATA SOURCE ipfixPoll
    IPFIX BUILDER
    TCP PORT 19500
    BREAK ON RECS 5000
END DATA SOURCE

SECONDARY DATA SOURCE silkPolling
    SILK BUILDER
    INCOMING DIRECTORY "/data/pipelineIncoming"
    ERROR DIRECTORY "/data/pipelineError"
END DATA SOURCE