Pipeline has five related libraries. Three are required, two are recommended. Each must be installed before installing Pipeline.
As of version 5.11.4, fixbuf 2.0 or fixbuf 3.0 may be used to build pipeline. Ensure that SiLK, schemaTools, and pipeline are all built with the same fixbuf library.
Using IPFIX produced by YAF 3.0 has received minimal testing, is considered experimental, and is not recommended.
Download the source code from the download page. The file will be named analysis-pipeline-5.11.tar.gz. Use the following commands to unpack the source code and go into the source directory:
The configure script in the analysis-pipeline-5.11 directory is used to prepare the pipeline source code for your particular environment. The rest of this section explains the configure command option by option.
The first thing you must decide is the parent directory of the pipeline installation. Specify that directory in the --prefix switch.
If you do not specify --prefix, the /usr/local directory is used.
./configure --prefix=/usr
If SiLK is configured to handle IPv6, a schema will be built for both IPv4 and IPv6. If IPv6 is not installed, only an IPv4 schema will be used. If after installation, SiLK is recompiled with IPv6 or upgraded to a version with IPv6, Pipeline will need to be reconfigured, built, and installed, such that it will include the IPv6 schema.
If Pipeline uses the same prefix (or the same lack of prefix) that was used to install SiLK, schemaTools, fixbuf, and snarf (if installed), nothing else needs specified for configure to find the libraries. If any are different, see below for details on specifying the location of each for configure:
To configure the Analysis Pipeline source code, run the configure script with
the switches you determined above:
$ configure --prefix=/usr --with-libsnarf=/usr/lib/pkgconfig
--with-silk-config=/usr/bin/silk\_config
Once pipeline has been configured, you can build it:
$ make
To install pipeline, run:
$ make install
Depending on where you are installing the application, you may need to become
the root user first. To ensure that pipeline is properly installed,
try to invoke it:
$ pipeline --version
If your installation of SiLK is not in /usr, you may get an error similar to
the following when you run pipeline:
pipeline: error while loading shared libraries: libsilk-thrd.so.2:
cannot open shared object file: No such file or directory
If this occurs, you need to set or modify the LD_LIBRARY_PATH environment
variable (or your operating system’s equivalent) to include the directory
containing the SiLK libraries. For example, if SiLK is installed in /usr/local,
you can use the following to run pipeline:
$ export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
$ pipeline --version
Starting with version 5, pipeline uses schemas from the schemaTools library to describe the data. Earlier versions of pipeline had a priori knowledge of the data that was coming (SiLK IPv4 records). Now it comes with schema builders to handle three general types of data. It can handle both IPv4 and IPv6 SiLK flows. There is a schema builder which accepts YAF data, notably including deep packet inspection information), and will dynamically generate schemas based on the data that arrives. It also includes a builder that generates schemas based on raw IPFIX records that arrive. Based on command line options, pipeline will know which schema builder to request schemas from.
IPFIX uses information elements (IEs). Each IE has an enterprise ID, and an element ID. Standard, public elements have an enterprise ID equal to 0. These information elements are how Pipeline uniquely identifies fields. The standard IPFIX elements with their names, IDs, and descriptions can be found at http://www.iana.org/assignments/ipfix/ipfix.xhtml.
Pipeline can handle multiple types of data, and can handle multiple data sources at the same time. If using a single data source, the details can be specified just using command line switches. If using multiple data sources (or just one without using the command line), use the --data-source switch to provide a data source configuration file to pipeline. Details for the configuration file are here.
For each data source to be configured, there are three general aspects of it that need specified: The record type (silk, YAF, or ipfix), the data location (socket, directory polling, or file list), and the timing options. These are discussed in sections below.
If, regardless of the fields used, you wish to restrict the processing to one particular record type, you can specify a schema to use for the filter/eval/stat by adding the schema name, or schema number after the filter/eval/stat name. See sections filters and evals and stats for more information.
The --silk command line switch tells pipeline to use the SiLK schema builder. Getting SiLK data into pipeline is the same as in earlier versions. The first way to get SiLK data into pipeline is to have it poll a directory for new flow files. The --incoming-directory switch specifies the directory where flows will appear. These files will be deleted after processing, unless the --archive-directory is specified, in which case the files will be stored in the given directory. This input method will cause pipeline to be run as a daemon, unless --do-not-daemonize is specified. The other option is to specify the names of the flow files on the command line. Use the --name-files switch to tell pipeline to look for the list of files to process on the command line at the end of the switches. These files are NOT deleted after processing. The files should be listed in order oldest to newest.
There are built-in schemas for SiLK records using both IPv4 and IPv6. If SiLK is configured to handle IPv6, a schema will be built for each. If it is not installed, only an IPv4 schema will be used. If after installation, SiLK is recompiled with IPv6, or upgraded to a version with IPv6, Pipeline will need to be reconfigured, built, and installed, so it will include the IPv6 schema.
As a result, pipeline will be able to handle records of either type, including SiLK files that contain both types of records. The two schemas are identical up to the point of dealing with IP addresses. As a result, if an evaluation is calculating "SUM BYTES", it will be able to run on both record types. If something uses a particular type of IP address (v4 or v6), that filter/evaluation/statistic will only process those types of records.
As a sole, or PRIMARY data source, a SiLK data source has to use the flow end time as in versions 4.x, so it cannot have a custom timing source (--time-is-clock or a different field), and cannot use --break-on-recs.
If a SiLK data source is a secondary data source, and the primary data source uses --time-is-clock, the SiLK records will be processed using --time-is-clock, just as any other secondary data source would.
The --yaf switch tells pipeline to use the YAF schema builder. YAF data can be received in a list of files on the command line using --name-files, or from a tcp or udp socket. A timing source must be specified. If a socket is used, --break-on-recs must be used as well.
Even though YAF exports its information using IPFIX, it has its own switch and schema builder because the deep packet inspection information is nested in lists that are dynamically filled in with templates at runtime. The YAF schema builder knows what elements to expect in what template, and can export those using schemas to pipeline. The YAF schema builder also knows what to do with each template that arrives based on the template ID for efficiency reasons.
It's likely that the best way to get records to Pipeline is to add an exporter to the instance of Super Mediator that is connected to the main YAF instance. This will ensure that Pipeline will not interfere with the rest of the collection architecture.
The --ipfix switch tells pipeline to use the ipfix schema builder. IPFIX data can be recieved using a list of files on the command line, a udp or tcp socket connection, or by polling a directory. A timing source must be specified. If a socket is used, --break-on-recs must be used as well.
Independent of the data source (though there are restrictions), there are four ways to provide data to Pipeline. As with previous versions, Pipeline can poll a directory for incoming files (SiLK, YAF, and IPFIX). It can read a finite list of files from the command line (SiLK, YAF, and IPFIX). New to version 5 is the ability to read data records from UDP and TCP sockets (YAF and IPFIX). There are command line options to configure data input options. As with data sources, only one data input can be defined on the command line. With multiple data sources forcing multiple data inputs, they all must be defined in a data source config file.
Polling a directory and reading from either type of socket will cause Pipeline to run as a daemon by default unless --do-not-daemonize is specified on the command line. A list of files on the command line will never be daemonized.
Directory polling is available for SiLK, YAF and IPFIX data sources.
List of files is available for SiLK, YAF, and IPFIX data sources. --name-files indicates that the names of the files to be processed are listed on the command line at the end of the rest of the switches.
Reading from a socket is available for YAF and IPFIX data sources. Pipeline will reestablish a socket connection if it is the only active data source. If there are two sockets used (data source config file required), the first one that goes down will never be restarted, but the second one will listen again after it goes down. The --do-not-reestablish switch will prevent any connection from being reestablished, causing Pipeline to stop.
All socket connections require --break-on-recs to be used in conjunction with them.
Version 4.x of pipeline relied on flow end time values from SiLK records to advance pipeline’s internal timing. With the expansion of data types, and dynamic schema generation, there is no way to ensure that there is a sufficient timing source in the records. Another aspect taken for granted in earlier versions only using SiLK flow files is knowing when to be done filtering records and run the evaluations and statistic. In the 4.x versions, the end of the flow file provided this break. The options below are used to ensure that pipeline has an adequate timing source and that it knows when to run the evaluations and statistics. Only one timing source can be provided.
If you are using a SiLK data source, timing will be done the same way it was done in previous versions. As a result, none of the options in this section are allowed to be used.
--time-is-clock tells Pipeline to use the internal system clock as the timing source rather than values from the records like earlier versions. If there are analytic specific time windows used, this is not a good option for testing using --name-files, as processing will happen faster than if running in live streaming mode. For example, say it takes 1 second to process a file with 5 minutes worth of network traffic. If there are 10 files processed with --time-is-clock, pipeline's internal time will advance 10 seconds versus 50 minutes, which could throw off time window expectations.
--time-field-name=STRING or --time-field-ent=NUMBER and --time-field-id=NUMBER tell pipeline to use a field in a flow record as the timing source, such as YAF's flowEndMilliseconds. The field can be specified by name, or by the {enterprise id, element number} tuple. The information element must be of type DATETIME_(MILLI|MICRO|NANO| | )SECONDS, UNSIGNED_64 if the units field is (MILLI|MICRO|NANO)SECONDS, or UNSIGNED_32 if the units field is SECONDS.
The final option for a timing source is for it to come from the schema. A schema builder can specify that a particular element can be used as a timing source. --time-from-schema tells pipeline to get the timing element from the schema. This is more for future use as none of the schema builders that come in this release specify a timing source to be used in this manner.
If you are using multiple data sources, only the PRIMARY DATA SOURCE needs to have a timing source. If a specific information element is specified for the timing source, Pipeline's internal time will only advance while processing records from the primary data source. If --time-is-clock is specified in the primary data source, this will be used during processing of all data sources.
--break-on-recs=NUMBER tells pipeline how many records to process before breaking to run the evaluations. This feature is not permitted when using a SiLK data source. It is required when the data source uses a socket connection. It is optional when reading YAF or IPFIX files. If used in conjunction with non-SiLK files, pipeline will break when the record threshold is hit, and also at the end of the file.
When using multiple data sources (or when invoking pipeline as a daemon using "service pipeline start"), a data source configuration file is required to instantiate them. This file is specified on the command line with the --data-source-configuration-file switch. This switch takes the place of all other switches discussed in this section.
Informatation on the contents of the data source config files can be found here
Back to topThere are other setup configurations that are required on the command line.
There are two main sections to alerts: The flow record that generated the alert, and the data metrics depending on the evaluation or statistic. For SiLK data sources, the entire flow record will be included in the alert because there is only 1 hierarchical level to the record. IPFIX data sources can have lists of elements or sub templates in them. Only the top level will be included in the alert, the list contents will not. YAF records that have DPI information will also only have their top level included in the alert. There is no human readable and machine parsable way to include all of the different levels in a single line of an alert. This applies regardless of whether snarf or json-c is installed.
When the Analysis Pipeline is built with support for libsnarf, the SNARF_ALERT_DESTINATION environment variable is set to tell pipeline the address where a snarfd process is listening for alerts. The environment variable takes the form tcp://HOST:PORT which specifies that the snarfd process is listening on HOST at PORT.
Instead of specifying the SNARF_ALERT_DESTINATION environment variable, you may specify the location using the --snarf-destination switch.
When neither SNARF_ALERT_DESTINATION nor --snarf-destination is specified, pipeline prints the alerts encoded using JSON (JavaScript Object Notation).
If the Analysis Pipeline was built without libsnarf support, the alerts generated by pipeline are written to local files. The location of the alert files must be specified using the --alert-log-file and --aux-alert-file switches on the pipeline command line. Assuming the directory layout described above, one would add the following switches to the command line specified above:
If the Analysis Pipeline was built with the json-c library available, it is possible to have the alert log and aux alert log output the same data but formatted as JSON using the --json command line argument.
Alerted information is split between the two files in the following way:
The main reason for this separation of alert files is that if pipeline is being used for watchlisting, the main alert file will contain the records that hit the watchlist. This allows for easier processing of alerts that only contain records, versus records combined with data metric information.
You should configure the logrotate program to rotate both alert files
daily. Unlike the other log files that pipeline creates, this file is
not rotated automatically by the pipeline daemon. To configure
logrotate under Linux, create a new file named pipeline in
/etc/logrotate.d, and use the following as its contents:
/var/pipeline/log/alert.log {
missingok
compress
notifempty
nomail
rotate 1
daily
}
If not using JSON output formatting, the /var/pipeline/log/alert.log file contains pipe-delimited (|-delimited) text. This text can be read by a security information management (SIM) system such as ArcSight. The Analysis Pipeline includes the file pipeline.sdkfilereader.properties that can be used as a starting point to create a new ArcSight Log File FlexConnector that will monitor that alert.log file.
To use ArcSight, customize the pipeline.sdkfilereader.properties file and place a copy of the file (with the same filename) in the agent configuration directory on the machine running the ArcSight connector, CONNECTOR_HOME/current/user/agent/flexagent. If necessary, contact your ArcSight representative for instructions on how to get the Connector installation wizard. When prompted for the type of SmartConnector to install, select the entry for “ArcSight FlexConnector File”.
Pipeline logs periodic usage statistics. There are two parts to the usage updates. The first part contains the number of seconds since the last update, the number of records processed since the last update, and the number of files read since the last update. If using break-on-recs, the number of breaks takes the place of the number of files. If there are multiple data sources, the number of records and files processed is logged per data source in addition to the overall counts.
The second part of the usage update includes the number of bytes of memory used by each evaluation and statistic.
Each item in the update has a label, followed by a colon, then a value. Items are pipe delimited.
The interval for the status log updates is set by using --stats-log-interval on the command line. The values for this option are in minutes. The default interval is 5 minutes. Setting this value to zero will turn off this feature. The daemon config variable is STATS_LOG_INTERVAL.
When Pipeline is consuming more than just a list of files, it can be run as a daemon. If run from the command line using a socket or an incoming directory to provide data, it will turn itself into a daemon automatically unless --do-not-daemonize is specified. If running Pipeline as a daemon with service pipeline start, you must use a data source configuration file to specify data sources and input types, even if there is only a single source.
To provide easier control of the pipeline daemon in UNIX-like environments, an example control script (an sh-script) is provided. This control script will be invoked when machine is booted to start the Analysis Pipeline, and it is also invoked during shutdown to stop the Analysis Pipeline. Use of the control script is optional; it is provided as a convenience.
As part of its invocation, the control script will load a second script that sets shell variables the control script uses. This second script has the name pipeline.conf. Do not confuse this variable setting script (which follows /bin/sh syntax) with the /etc/pipeline/config/pipeline.conf configuration file, which contains the filters, evaluations, statistics, etc, and it loaded by the pipeline application and follows the syntax described in the sections below.
If you are using an RPM installation of pipeline, installing the RPM will put the control script and the variable setting script into the correct locations under the /etc directory, and you can skip to the variable setting section below.
If you are not using an RPM installation, the make install step above installed the scripts into the following location relative to pipeline’s installation directory. You will need to copy them manually into the correct locations.
share/analysis-pipeline/etc/init.d/pipeline is the control script. Do not confuse this script with the pipeline application.
share/analysis-pipeline/etc/pipeline.conf is the variable setting script used by the control script.
Copy the control script to the standard location for start-up scripts on
your system (e.g., /etc/init.d/ on Linux and other SysV-type systems).
Make sure it is named pipeline and has execute permissions. Typically,
this will be done as follows:
# cp
${prefix}/share/analysis-pipeline/etc/init.d/pipeline \
/etc/init.d/pipeline
# chmod +x /etc/init.d/pipeline
Copy the variable setting script file into the proper location.
Typically, this will be done as follows:
# cp
${prefix}/share/analysis-pipeline/etc/pipeline.conf \
${prefix}/etc
Edit the variable setting script to suit your installation. Remember that the variable setting script must follow /bin/sh syntax. While most of the variables are self-explanatory or can be derived from the documentation elsewhere in this chapter and pipeline’s manual page, a few variables deserve some extra attention:
At this point you should be able to use the control script as follows to
start or stop the pipeline:
# /etc/init.d/pipeline start
# /etc/init.d/pipeline stop
To automate starting and stopping the pipeline when the operating
system boots and shuts down, you need to tell the machine about the new
script. On RedHat Linux, this can be done using:
# chkconfig --add \pipeline
(If you have installed pipeline from an RPM, you do not need to perform this step.)
At this point, you should be able to start the pipeline using the
following command:
# service pipeline start
To specify the filters, evaluations, statistics, and lists, a configuration language is used. When pipeline is invoked, the --configuration-file switch must indicate the file containing all of the configuration information needed for processing.
Filters, evaluations, and statistics can appear in any order in the configuration file(s) as long as each item is defined before it is used. The only exception is named lists being referenced by filters. These can be referenced first, and defined afterwards. Since filters are used by evaluations and statistics, it is common to see filters defined first, then finally evaluations and statistics, with list configurations at the end.
In the configuration file, blank lines and lines containing only whitespace are ignored. Leading whitespace on a line is also ignored. At any location in a line, the octothorp character (a.k.a. hash or pound sign, #) indicates the beginning of a comment, which continues until the end of the line. These comments are ignored.
Each non-empty line begins with a command name, followed by zero or more arguments. Command names are a sequence of nonwhitespace characters (typically in uppercase), not including the characters # or ". Arguments may either be textual atoms (any sequence of alphanumeric characters and the symbols _, -, @, and /), or quoted strings. Quoted strings begin with the double-quote character ("), end with a double-quote, and allow for C-style backslash escapes in between. The character # inside a quoted string does not begin a comment, and whitespace is allowed inside a quoted string. Command names and arguments are case sensitive.
Every filter, evaluation, statistic, and list must have a name that is unique within the set of filters, evaluations, statistics, or lists. The name can be a double-quoted string containing arbitrary text or a textual atom.
To assist with finding errors in the configuration file, the user may specify the --verify-configuration switch to pipeline. This switch causes pipeline to parse the file, report any errors it finds, and exit without processing any files.
To print the contents of the arriving schemas, and also as a way to verify the viability of the data source configuration, the information elements that are available for processing can be displayed by specifying the --print-schema-info switch on the command line. For non-SiLK data sources, data must arrive (or --name-files must be used) for this to print the information. The SiLK schema is built in, but other data sources give Pipeline no apriori knowledge of their contents.
To print both the schema information and verify all configuration files, specify --show-schema-info-and-verify.
The configuration information can be contained in a single file, or it may
be contained in multiple files that are incorporated into a master file
using INCLUDE statements. Syntax:
INCLUDE "path-name"
Multiple levels of file INCLUDE statements are supported. Often the top
level configuration file is named pipeline.conf, but it may have any name.
Examples:
INCLUDE "/var/pipeline/filters.conf"
INCLUDE "evaluations.conf"
The ordering of blocks in the configuration file does have an impact on the data processing of pipeline. Comparisons (in filters) and checks (in evaluations) are processed in the order they appear, and to pass the filter or evaluations, all comparisons or checks must return a true value. It is typically more efficient to put the more discerning checks and comparisons first in the list. For example, if you are looking for TCP traffic from IP address 10.20.30.40, it is better to do the address comparison first and the protocol comparison second because the address comparison will rule out more flows than the TCP comparison. This reduces the number of comparisons and in general decreases processing time. However, some comparisons are less expensive than others (for example, port and protocol comparisons are faster than checks against an IPset), and it may reduce overall time to put a faster comparison before a more-specific but slower comparison.
Filters, evaluations, and statistics run independently, so their order doesn't matter.
All keywords and hardcoded SiLK field names in pipeline are to be entered in capital letters. Filter, evaluation, statistic, and other user-specified names can be mixed case.
Throughout pipeline documentation and examples, underscores have been used within keywords in some places, and spaces used in others. Both options are accepted. For example: ALERT_EACH_ONLY_ONCE and ALERT EACH ONLY ONCE are interchangeable. Even ALERT_EACH ONLY_ONCE is allowed. Underscores and spaces will each be used throughout this document as a reminder that each are available for use.
SECONDS, MINUTES, HOURS, and DAYS are all acceptable values for units of time. Combinations of time units can be used as well, such as 1 HOUR 30 MINUTES instead of 90 MINUTES.
Back to topAll fields in the data records can be used to filter data, along with some derived fields. With this version, the fields are not set ahead of time like in past versions. The available fields are based on the data sources and the schemas contained therein. If a SiLK data source is used, the fields will be the same as in previous versions, but now with IPv6 addresses available if the SiLK installation has enabled it.
Available fields can be combined into tuples, e.g. {SIP, DIP}, for more advanced analysis. These tuples are represented in the configuration file by listing the fields with spaces between them. When processed, they are sorted internally, so SIP DIP SPORT is the same as SPORT DIP SIP.
SchemaTools uses IPFIX information elements, so each element in a schema has an enterprise id and an element id attached to it. Some are part of the general set, with enterprise id equal to 0, and other are custom elements with a non zero enterprise id. In past versions of pipeline, the name of the element or field was the only way to identify elements. In version 5.1 and beyond, the {enterprise, id} tuple can be used to identify elements. This can be used to avoid confusion if names are different in different schemas. For example, for backwards compatibility, the SiLK builder still uses the name "SIP", while YAF and other IPFIX based records will use "sourceIPv4Address". If {0,8} is used, pipeline won’t care what the name of the element is in the schema.
With each specific element listed below, the enterprise id, element id tuple will be included in curly brackets after the name. Enterprise id 6871 is the enterprise id for CERT/SEI, so our custom elements will have this enterprise id.
SiLK data can still use "ANY IP", as described below in the SiLK section. In addition to the legacy groups of elements (IP and PORT), schemaTools provides each schema groups that can be used with ANY based on their IPFIX type value. The values are: OCTET_ARRAY, UNSIGNED_8, UNSIGNED_16, UNSIGNED_32, UNSIGNED_64, SIGNED_8, SIGNED_16, SIGNED_32, SIGNED_64, FLOAT_32, FLOAT_64, BOOLEAN, MAC_ADDRESS, STRING, IPV4_ADDRESS, IPV6_ADDRESS. Groups will only be included in a schema, and thus available to pipeline users as long as the group is non empty.
Beginning with version 5.3, "ANY" fields can be used anywhere in the configuration file that regular fields are allowed. This includes but is not limited to derived fields, FOREACH, PROPORTION, DISTINCT, FILTERS, OUTPUT LISTS, FIELD BOOLEANS, and INTERNAL FILTERS. "PAIR" fields can be used anywhere a 2-tuple can.
To see which elements will be available for the given data, the command line switch --print-schema-info can be used. In addition to a list of available elements, a list of groups and their contents will be printed.
For the dynamically generated schemas from the YAF and IPFIX data sources, the list of available information elements is only created after a connection to the data has been made. This can be either a socket connection with initial data transmission, opening a file from the command line, or the first file being read from a directory being polled. If there are multiple data sources, pipeline will wait to connect to the primary, and then go in order based on the configuration file connecting to the other sources. Pipeline will not begin processing data records from any source, without having connected to all sources, as pipeline does not process the main configuration file without knowledge of the available elements.
Pipeline can handle data in lists, notably the DPI information in YAF records. It can also handle the situation where there are repeated fields in a record. In either case where there can be multiple values for a particular field name, Pipeline will process each value. These types of fields may also be referred to as "loopable fields".
When using a loopable field in a comparison, if ANY of the values for the field meet the criteria of the filter, it will return true. For example, if the filter is implementing a DNS watchlist, if any of the dnsQName values in the record are in the watchlist, the entire record is marked as passing the filter.
When a loopable field is chosen as the FOREACH field, there will be a state listing for every value in the record for that field.
If a loopable field is used as the field for a primitive, all values will be used. For example, if computing the SUM of a loopable, all of the values will be included in the sum.
IP addresses and ports have directionality, source and destination. The keyword ANY can be used to indicate that the direction does not matter, and both values are to be tried (This can only be used when filtering). The ANY * fields can go anywhere inside the field list, the only restrictions are that the ANY must immediately precede IP, PORT, IP PAIR, or PORT PAIR, and that there are can only be one ANY in a field list. The available fields are:
Field Name | {ent,ID} | Descriptions |
ANY IP | Either the source address of destination address | |
IP PAIR | Either the {SIP, DIP} tuple or the {DIP, SIP} tuple. The ANY has been removed from referencing pairs due to internal processing issues. | |
ANY IPv6 | Either the source address of destination address v6 | |
IPv6 PAIR | Either the {SIP_V6, DIP_V6} tuple or the {DIP_V6, SIP_V6} tuple. The ANY has been removed from referencing pairs due to internal processing issues. | |
ANY_PORT | Either the source port or the destination port | |
PORT_PAIR | Either the {SPORT, DPORT} tuple or the {DPORT, SPORT} tuple. The ANY has been removed from referencing pairs due to internal processing issues. | |
APPLICATION | {6871,33} | The service port of the record as set by the flow generator if the generator supports it, or 0 otherwise. For Example, this owuld 80 if the flow generator recognizes the packets as being part of an HTTP session |
BYTES | {6871,85} | The count of the number of bytes in the flow record |
BYTES PER PACKET | {6871,106} | An integer division of the bytes field and the packets field. It is a 32-bit number. The value is 0 if there are no packets |
CLASSNAME | {6871,41} | The class name assigned to the record. Class are defined in the silk.conf file |
DIP | {0,12} | The destination IPv4 address |
DIP_V6 | {0,28} | The destination IPv6 address |
DPORT | {0,11} | The destination port |
DURATION | {0,161} | The dureation of the flow record, in integer seconds. This is the difference between ETIME and STIME |
END_SECONDS | {0,151} | The wall clock time when the flow generator closed the flow record in seconds |
ETIME | {0,153} | The wall clock time when the flow generator closed the flow record in milliseconds |
FLAGS | {6871,15} | The union of the TCP flags on every packet that comprises the flow record. The value can contain any of the letters F, S, R, P, A, U, E, and C. (To match records with either ACK or SYN|ACK set, use the IN_LIST operator.) The flags formatting used by SiLK can also be used to specify a set of flags values. S/SA means to only care about SYN and ACK, and of those, only the SYN is set. The original way Pipeline accepted flags values, the raw specification of flags permutation is still allowed. |
FLOW RECORD | This field references the entire flow record, and can only be used when checking the flow record against multiple filters using IN LIST (see below) | |
ICMPCODE | {0,177} | The ICMP code. This test also adds a comparison that the protocol is 1. |
ICMPTYPE | {0,176} | The ICMP type. This test also adds a comparison that the protocol is 1. |
INITFLAGS | {6871,14} | The TCP flags on the first packet of the flow record. See FLAGS. |
INPUT | {6871,10} | The SNMP interface where the flow record entered the router. This is often 0 as SiLK does not normally store this value. |
NHIP | {0,15} | The next-hop IPv4 of the flow record as set by the router. This is often 0.0.0.0 as SiLK does not normally store this value. |
NHIP | {0,62} | The next-hop IPv6 of the flow record as set by the router. This is often 0.0.0.0 as SiLK does not normally store this value. |
OUTPUT | {6871,11} | The SNMP interface where the flow record exited the router. This is often 0 as SiLK does not normally store this value. |
PACKETS | {6871,86} | The count of the number of packets. |
PMAP | See pmap section for details | |
PROTOCOL | {0,4} | The IP protocol. This is an integer, e.g. 6 is TCP |
SENSOR | {6871,31} | The sensor name assigned to the record. Sensors are defined in the silk.conf file. |
SESSIONFLAGS | {6871,16} | The union of the TCP flags on the second through final packets that comprise the flow record. See FLAGS |
SIP | {0,8} | The source IPv4 address |
SIP_V6 | {0,27} | The source IPv6 address |
SPORT | {0,7} | The source port |
START_SECONDS | {0,150} | The wall clock time when the flow generator opened the flow record in seconds |
STIME | {0,152} | The wall clock time when the flow generator opened the flow record in milliseconds |
TYPENAME | {6871,30} | The type name assigned to the record. Types are defined in the silk.conf file. |
YAF is capable of creating flow records used by SiLK, so most of the SiLK elements are available in YAF, though they use the standard IPFIX element names.
Element names and numbers of the potential fields in the core YAF record are listed in the YAF documentation at http://tools.netsa.cert.org/yaf/yaf.html, in the section labeled "OUTPUT: Basic Flow Record".
In addition to core flow fields, YAF exports information from deep packet inspection in dynamic lists. These are added to the schema as virtual elements from the yaf schema builder. Those elements are listed by name, element numbers, and are clearly described in the yaf online documentation at: http://tools.netsa.cert.org/yaf/yafdpi.html
As of version 5.3, there are three DPI fields exported by YAF that have had their information element changed by Pipeline to disambiguate them from fields in the core part of the records.
Two of these changes are with DNS records. When YAF exports the IP address returned in a DNS reponse, it uses sourceIPv4Address and sourceIPv6Address. To keep those separate from the source IP addresses used in the core of the records, those IP fields are changed to used fields named rrIPv4 and rrIPv6 for IPv4 and IPv6 respectively.
The other field changed from YAF export to Pipeline processing is the protocolIdentifier field in the DNSSEC DNSKEY record. This field has been changed to a field named DNSKEY_protocolIdentifier to keep it separate from the protocolIdentifier in the core flow record exported by YAF.
Certain fields are in the standard YAF record and can almost certainly be used. Derived fields available for use with flowStartMilliseconds and flowEndMilliseconds include: HOUR_OF_DAY, DAY_OF_WEEK, DAY_OF_MONTH, and MONTH.
FLOW_KEY_HASH is also available as all of the fields are part of the standard YAF record.
If YAF is exporting DPI elements, depending on their contents, the remainder of the derived fields can be used.
As of version 5.9, filters using more than one YAF DNS DPI field will be checked against each individual DNS record sequentially. Previously, each DNS field was checked against all DNS records in the DPI information for a match.
This table contains elements in the core YAF records that are always present. The names are case sensitive. Even if a field in this list has a corresponding SiLK field, these names and numbers must be used. This list is taken from the YAF documentation in the "OUTPUT" section. Associated SiLK fields are in parentheses where applicable. Reverse fields are not listed in this table as they are not guaranteed to be present.
Field Name | {ent,ID} | Description |
flowStartMilliseconds | {0,152} | Flow start time in milliseconds (STIME) |
flowEndMilliseconds | {0,153} | Flow end time in milliseconds, good choice for timing source to mimic SiLK processing. (ETIME) |
octetTotalCount | {0,85} | Byte count for the flow.(BYTES) |
packetTotalCount | {0,86} | Packet count for the flow.(PACKETS) |
sourceIPv4Address | {0,8} | Source IP address of the flow.(SIP) |
destinationIPv4Address | {0,12} | Destination IP address of the flow.(DIP) |
ANY IPV4_ADDRESS | Group of elements that are IP addresses, which will contain sourceIPv4Address and destinationIPv4Address.(ANY IP) | |
IPV4_ADDRESS PAIR | Group of elements that are IP addresses, which will contain sourceIPv4Address and destinationIPv4Address.(IP PAIR) | |
sourceIPv6Address | {0,27} | Source IPv6 address of the flow.(SIP_V6) |
destinationIPv6Address | {0,28} | Destination IPv6 address of the flow.(DIP_V6) |
ANY IPV6_ADDRESS | Group of elements that are IPv6 addresses, which will contain sourceIPv6Address and destinationIPv6Address.(ANY IPv6) | |
IPV6_ADDRESS PAIR | Group of elements that are IPv6 addresses, which will contain sourceIPv6Address and destinationIPv6Address.(ANY IPv6) | |
sourceTransportPort | {0,7} | Source TCP or UDP port of the flow.(SPORT) |
destinationTransportPort | {0,11} | DestinationTCP or UDP port of the flow.(DPORT) |
flowAttributes | {6871,40} | Bit 1, all packets have fixed size. Bit 2, out of sequence |
protocolIdentifier | {0,4} | IP protocol of the flow.(PROTOCOL) |
flowEndReason | {0,136} | 1: idle timeout. 2: active timeout 3: end of flow 4: force end 5: lack of resources |
silkAppLabel | {6871,33} | Application label, if YAF is run with --applabel.(APPLICATION) |
vlanId | {0,58} | VLAN tag of the first packet |
ipClassOfService | {0,5} | For IPv4, the TOS field, for IPv6, the traffic class |
tcpSequenceNumber | {0,184} | Initial TCP sequence number |
initialTCPFlags | {6871,14} | TCP flags of initial packet.(INITFLAGS) |
unionTCPFlags | {6871,15} | Union of the TCP flags of all packets other than the initial packet.(SESSIONFLAGS) |
IPFIX records can include any element available. There are no elements that can be assumed. You have to know the data to know or use the switch --print-schema-info.
The various derived fields are available based on the fields on the IPFIX records.
Prefix Maps (pmaps) are part of the SiLK tool suite and can be made using rwpmapbuild. Their output can be used just like any other field in pipeline. It can make up part of a tuple, be used in FOREACH, and be used in filtering. One caveat about pmaps being used to make up a tuple in field lists, is that the pmap must be listed first in the list for proper parsing. However, when referencing pmap values in a typeable tuple, it must go at the end. PMAPs take either an IP address, or a PROTOCOL PORT pair as inputs.
Using a PMAP in Pipeline is a two stage process in the configuration file. The first step is to declare the pmap. This links a user-defined field name to a pmap file, with the name in quotes. This field name will be used in alerts to reference the field, and in the rest of the configuration file to reference the pmap.
The declaration line is not part of a FILTER or EVALUATION, so it is by iteself, similar to the INCLUDE statements. The declaration line starts with the keyword PMAP, followed by a string for the name without spaces, and lastly, the filename in quotes.
PMAP userDefinedFieldName "pmapFilename"
Now that the PMAP is declared, the field name can be used throughout the file. Each time the field is used, the input to the pmap must be provided. This allows different inputs to be used throughout the file, without redeclaring the pmap.
userDefinedFieldName(inputFieldList)
For each type of pmap, there is a fixed list of inputFieldLists:
The examples above use the names of the fields from SiLK for simplicity. Any field of type IPV4_ADDRESS or IPV6_ADDRESS can be used for an IP pmap. The IPFIX elements for SIP and DIP are sourceIPv4Address and destinationIPv4Address.
The port-protocol pmaps must use the IPFIX elements of: protocolIdentifier (in place of PROTOCOL above), sourceTransportPort, and destinationTransportPort (in place of SPORT and DPORT above).
Below is an example that declares a pmap, then filters based on the result of the pmap on the SIP, then counts records per pmap result on the DIP
PMAP thePmapField "myPmapFile.pmap"
FILTER onPmap
thePmapField(SIP) == theString
END FILTER
STATISTIC countRecords
FILTER onPmap
FOREACH thePmapField(DIP)
RECORD COUNT
END STATISTIC
Field booleans are custom fields that consist of an existing field and a list of values. If the value for the field is in the value list, then the field boolean’s value is TRUE. These are defined similar to PMAPs, but use the keyword FIELD BOOLEAN. For example, to define a boolean named webPorts, to mean the source port is one of [80, 8080]:
FIELD BOOLEAN sourceTransportPort webPorts IN [80, 8080]
Now, webPorts is a field that can be used anywhere in the configuration file that checks whether the sourceTransportPort is in [80, 8080].
If used in filtering, this is the same as saying: sourceTransportPort IN LIST [80, 8080].
However, if used as a part of FOREACH, the value TRUE or FALSE will be in the field list, to indicate whether the sourceTransportPort is 80 or 8080.
Another example could be a boolean to check whether the hour of the day, derived from a timestamp, is part of the work day. There could be a statistic constructed to report byte counts grouped by whether the hour is in the workday, which is 8am to 5pm in this example.
FIELD BOOLEAN HOUR_OF_DAY(flowStartSeconds)
workday IN [8,9,10,11,12,13,14,15,16,17]
STATISTIC workdayByteCounts
FOREACH workday
SUM octetTotalCount
END STATISTIC
If the records include fields containing domain names, the following fields can be used. If used on non DNS string, there will not be an error when parsing the configuration file, but most will not return data as DNS dot separators are required for processing.
The field to be operated on is put in parentheses after the derived field name.
These fields can be used anywhere in a pipeline configuration file like any other field.
Derivation of fields can be nested as well, such as:
DNS_SLD(DNS_INVERT(DNS_NORMALIZE(dnsQName)))
For the following example domain name: tools.netsa.cert.org
These derived fields pull out human readable values from timestamps. The values they pull are just integers, but in filters, pipeline can accept the words associated with those values, e.g. JANUARY is translated to 0, as is SUNDAY. These fields work with field types: DATETIME_SECONDS, DATETIME_MILLISECONDS, DATETIME_MICROSECONDS, DATETIME_NANOSECONDS. Each will be converted to the appropriate units for processing. The system’s timezone is used to calculate the HOUR value.
The field to be operated on is put in parentheses after the derived field name.
These fields can be used anywhere in a pipeline configuration file like any other field.
The field to be operated on is put in parentheses after the derived field name.
These fields can be used anywhere in a pipeline configuration file like any other field.
All derived fields can use ANY fields, such as:
STRLEN(ANY STRING)
The Analysis Pipeline passes each flow record through each filter to determine whether the record should be passed on to an evaluation or statistic. There can be any number of filters, and each runs independently. As a result, each filter sees every flow record, and keeps its own list of flows that meet its criteria.
A filter block starts with the FILTER keyword followed by the name of the filter, and it ends with the END FILTER statement. The filter name must be unique across all filters. The filter name is referenced by evaluations, internal filters, and statistics.
Filters are initially marked internally as inactive, and become active when an evaluation or statistic references them.
Filters are composed of comparisons. In the filter block, each comparison appears on a line by itself. If all comparisons in a filter return a match or success, the flow record is sent to the evaluation(s) and/or statistic(s) that use the records from that filter.
If there are no comparisons in a filter, the filter reports success for every record.
Each comparison is made up of three elements: a field, an operator, and a compare value, for example BYTES > 40. A comparison is considered a match for a record if the expression created by replacing the field name with the field.s value is true.
Eight operators are supported. The operator determines the form that the compare value takes.
DPORT IN_LIST [21, 22, 80]
If there is a single field in the fieldList, and if that is an IP address, this bracketed list can contain IPSet files mixed with IP addresses that will all be combined for the filter:
SIP IN LIST ["/data/firstIPset.set", 192.168.0.0/16, "/data/secondIPset.set"]
An example is filtering for sip 1.1.1.1 with sport 80, and
2.2.2.2 with sport 443:
FILTER sipSportPair
SIP SPORT IN LIST [[1.1.1.1,80],
[2.2.2.2,443]]
END FILTER
fieldList IN LIST "/path/to/watchlist.file"
If the fieldList consists of one field and if it is of type IPV4_ADDRESS or IPV6_ADDRESS, the file MUST be a SiLK IPSet. A fieldList of just an IP cannot be any of the types described below.
A file can be used to house both types of bracketed lists described above, both the single and double bracketed lists. This has to be formatted exactly as if it was typed directly into the config file. The format is such that a user should be able to copy and paste the contents of files in this format into the config file and vice versa. The single line (there cannot be any newline characters in the list) of the bracketed list much have a new line at the end.
If the fieldList consists of a single field, a simple watchlist file can be used to hold the values. This format requires one value per line. The format of each value type is the same as if it was typed into the configuration file. Comments can be used in the file by setting the first character of the line to "#". The value in the field being compared against the watchlist must be an exact match to an entry in the file for the comparison to be true.
The exact match requirement can cause problems for DNS fields. Pipeline has no way to know that a particular field value is a DNS domain name string, such that it would return a match for "www.url.com" if "url.com" was in the list. To overcome this deficiency, a watchlist can put a particular string on the first line to tell pipeline to process this watchlist as a DNS watchlist. The first line of the file must be: "##format:dns". It must be the first line of the file. When processing the file, pipeline will normalize the field value, making it all lower case, and removing starting or ending dots.
If there is a single field in the fieldList, and if that is an
IP address, this bracketed list can contain IPSet files mixed with
IP addresses that will all be combined for the filter:
SIP IN LIST ["/data/firstIPset.set",
192.168.0.0/16, "/data/secondIPset.set"]
The name of a list that is filled by the outputs of an evaluation or an internal filter. The tuple in the filter must entirely match the tuple used to fill the list.
SIP DIP PROTO SPORT DPORT IN LIST createdListOfFiveTuples
For example, to do TCP sport 80 OR UDP dport 23:
FILTER tcp80
SPORT == 80
PROTOCOL == 6
END FILTER
FILTER udp23
DPORT == 23
PROTOCOL == 17
END FILTER
FILTER filterUsingTcp80OrUdp23
FLOW RECORD IN LIST [tcp80,udp23]
END FILTER
== Succeeds when the value from the record is equal to the compare value. This also encompasses IPv4 subnets. For example, the following will succeed if either the source or destination IP address is in the 192.168.x.x subnet:
ANY_IP == 192.168.0.0/16
The compare value can reference another field on the flow record. For example, to check whether the source and destination port are the same, use: SPORT == DPORT
As described above, Pipeline version 5 uses schemas to describe how the incoming data in structured. By default, filters are run on all data records that contain the fields necessary for processing. To restrict a filter to only handle records from particular schemas, list schema names or numbers after declaring the name of the filter. If the schema name has spaces, you must put the name in quotes.
IPv4 SiLK records are in the schema named: "SILK IPv4 Schema", number:
5114
IPv6 SiLK records are in the schema named: "SILK IPv6 Schema", number:
5116
To limit a filter to only v4 records:
FILTER myV4Filter "SILK IPv4 Schema"
...
END FILTER
or
FILTER myV4Filter 5114
...
END FILTER
To limit a filter to only v6 records:
FILTER myV6Filter "SILK IPv6 Schema"
...
END FILTER
or
FILTER myV6Filter 5116
...
END FILTER
Filters can be added to manifolds where they will be evaluated in order until the records meet the criteria of one of the filters, the processing on the record will stop. This creates an efficiency if the records that will pass different filters are mutually exclusive. Once a record meets the criteria for a filter in the manifold, the rest of the filters in the manifold will not be run on that record.
An example of two filters in a manifold:
FILTER myFirstFilter IN MANIFOLD myManifold
...
END FILTER
FILTER mySecondFilter IN MANIFOLD myManifold
...
END FILTER
FILTER myThirdFilter
...
END FILTER
In this example, any record that matched myFirstFilter's criteria will not be run against mySecondFilter. It will then be run against myThirdFilter as it is not part of the manifold, so it runs no matter what. Note that the order of the filters in the configuration file matter and that myManifold has no impact on the operation of filters outside the manifold such as myThirdFilter.
Looking for traffic where the destination port is 21:
FILTER FTP_Filter
DPORT == 21
END FILTER
Watchlist checking whether the source IP is in a list defined by the
IPset “badSourceList.set”:
FILTER WatchList-BadSourcesList
SIP IN_LIST
"badSourceList.set"
END FILTER
Compound example looking for an IP on a watch list communicating on TCP
port 21:
FILTER watchListPlusFTP
SIP IN_LIST
"badSourceList.set"
DPORT == 21
PROTOCOL == 6
END FILTER
Look for records with a dns query name with second level domain of
"cert"
FILTER certSLDs
DNS_SLD(dnsQName) ==
"cert"
END FILTER
There are two places where named lists can be created and populated so they can be used by filters: Internal Filters and Output Lists (which are discussed in evaluation specifics).
In each case, a field list is used to store the tuple that describes the contents of the data in the list. A filter can use these lists if the tuple used in the filters perfectly matches the tuple used to make the list.
An internal filter compares the incoming flow record against an existing filter, and if it passes, it takes some subset of fields from that record and places them into a named list. This list can be used in other filters. There can be any number of these lists.
Internal filters are different from output lists, because they put data into the list(s) immediately, so this contents of the list(s) can be used for records in the same flow file as the one that causes data to be put into the list(s). Output lists, populated by evaluations, are only filled, and thus take effect, for the subsequent flow files.
Internal filters are immediate reactions to encountering a notable flow record.
The fields to be pulled from the record and put into the list can be combined into any tuple. These include the ANY fields, and the output of Pmaps. The "WEB_REDIR" fields cannot be used here. Details on how to create an internal filter for specific use for WEB_REDIRECTION or HIGH_PORT_CHECK primitives is discussed below.
An internal filter is a combination of filters and lists, so both pieces need to be specified in the syntax. A key aspect of the internal filter declaration is to tell it which fields pulled from records that pass the filter, get put into which list. There can be more than one field-list combination per internal filter.
It is recommended that a timeout value be added to each statement which declares the length of time a value can be considered valid, but it is no longer required. To build a list from an internal filter without a timeout, leave the timeout portion of the configuration file blank.
Syntax
INTERNAL_FILTER name of this internal
filter
FILTER name of filter to use
fieldList list name timeout
END INTERNAL FILTER
Example, given an existing filter to find records to or from watchlist
INTERNAL_FILTER watchlistInfo
FILTER watchlistRecords
SPORT DPORT watchlistPorts 1 HOUR
SIP DIP SPORT DPORT PROTOCOL
watchlistFiveTuples 1 DAY
END INTERNAL_FILTER
This internal filter pulls {SPORT,DPORT} tuples from flows that pass the filter watchlistRecords, and puts them into a list called watchlistPorts, and those values stay in the list for 1 hour. It also pulls the entire five tuple from those records and puts then into a list called watchlistFiveTuples that stay in the list for 1 DAY.
HIGH_PORT_CHECK requires the use of internal filters as they scan for flow records to compare against that can be in the same flow file. The field list for each of these lists are keywords, that in addition to indicating the fields to be stored, tells pipeline how to store them. The keyword is HIGH_PORT_LIST.
Like filters, internal filters can also be put in manifolds to prevent execution of other internal filters in the manifold after one in the manifold has matched. Internal filter manifolds act the same way as filter manifolds, but for internal filters. See the above sections on filter manifolds for syntax. It should be noted that filters and internal filters are not put into the same type of manifolds, you can't have internal filters and filters in the same manifold.
Back to topAvailable operators to compare state values with thresholds include: <, <=, >, >=, and !=.
To get an overall description of primitives and their place in the configuration file, click here
RECORD COUNT |
CHECK THRESHOLD RECORD COUNT oper threshold END CHECK |
Count of flows seen by primitive |
SUM |
CHECK THRESHOLD SUM field oper threshold END CHECK |
Sum of the values for the given field. |
AVERAGE |
CHECK THRESHOLD AVERAGE field oper threshold END CHECK |
Average of the values for the given field. |
DISTINCT |
CHECK THRESHOLD DISTINCT field oper threshold END CHECK |
Count of distinct values for the given field. Field can be a field list to count distinct tuples. |
PROPORTION |
CHECK THRESHOLD PROPORTION field value oper threshold END CHECK |
Proportion of flows seen with the given value seen for the given field |
To get an overall description of primitives and their place in the configuration file, click here
EVERYTHING PASSES |
CHECK EVERYTHING PASSES END CHECK |
Alert on every flow |
BEACON |
CHECK BEACON COUNT minCount CHECK TOLERANCE int PERCENT TIME WINDOW minIntervalTimeVal END CHECK |
Finite State Beacon Detection |
RATIO |
CHECK RATIO OUTGOING integer1 TO integer2 LIST name of list from beacon # optional END CHECK |
Detect IP pairs with more outgoing than incoming traffic |
ITERATIVE COMPARISON | This primitive has been removed for version 5.3 | |
HIGH PORT CHECK |
CHECK HIGH_PORT_CHECK LIST listName END CHECK |
Look for passive traffic |
WEB REDIRECTION | This primitive has been removed for version 5.3 | |
SENSOR OUTAGE |
CHECK FILE_OUTAGE SENSOR_LIST [list of sensor names] TIME_WINDOW time units END CHECK |
Alert if a sensor stops sending flows. Requires a SiLK data source |
DIFFERENCE DISTRIBUTION |
STATISTIC diffDistExample DIFF DIST field END STATISTIC |
Output difference distribution (Statistic only) |
FAST FLUX | CHECK FAST FLUX IP_FIELD ipFieldName ASN definedPmapName DNS dnsQueryFieldName NODE MAXIMUM maxNodesStored END CHECK |
Alert if a fast flux network is detected |
PERSISTENCE | PERSISTENCE integer CONSECUTIVE HOURS USING timingFieldName END CHECK |
Alert if the FOREACH tuple was present for integer consecutive HOURS or DAYS |
CALCULATE STATS | CHECK CALCULATE STATS field OUTSIDE number OF STANDARD DEVIATIONS END CHECK |
Alert if the current bin value is greater than the specified number of standard deviations away from the mean. |
EWMA | CHECK EWMA field OUTSIDE number OF STANDARD DEVIATIONS SMOOTHING FACTOR number COMPARE [BEFORE|AFTER] END CHECK |
Alert if the current bin value is greater than the specified number of standard deviations away from the exponential weighted moving average. |
HAS CHANGED | CHECK HAS CHANGED WATCH field END CHECK |
Alert if the previous record to be processed by this primitive had a different value for the specified field than the current record. |
Evaluations and statistics comprise the second stage of the Analysis Pipeline. Each evaluation and statistic specifies the name of a filter which feeds records to the evaluation or statistic. Specific values are pulled from those flow records, aggregate state is accumulated, and when certain criteria are met alerts are produced.
To calculate and aggregate state from the filtered flow records, pipeline uses a concept called a primitive .
Evaluations are based on a list of checks that have primitives embedded in them. The aggregate state of the primitive is compared to the user defined threshold value and alerts are generated.
Statistics use exactly one primitive to aggregate state. The statistic periodically exports all of the state as specified by a user-defined interval.
New to version 4.2, if a statistic is utilizing FOREACH, and the state for a particular unique value group is empty, the value will not be included in an alert for the statistic. A statistic without FOREACH, will output the state value no matter what.
An evaluation block begins with the keyword EVALUATION followed by the evaluation name. Its completion is indicated by END EVALUATION.
Similarly, a statistic block begins with the keyword STATISTIC and the statistic's name; the END STATISTIC statement closes the block.
The remainder of this section describes the settings that evaluations and statistics have in common, and the keywords they share. A description of primitives will hopefully make the details of evaluations and statistics easier to follow.
Each of the following commands (except ID) go on their own line.
Each evaluation and statistic must have a unique string identifier. It can have letters (upper and lower case) and numbers, but no spaces. It is placed immediately following the EVALUATION or STATISTIC declaration:
EVALUATION myUniqueEvaluationName
...
END EVALUATION
STATISTIC myUniqueStatisticName
...
END STATISTIC
The ALERT TYPE is an arbitrary, user-defined string. It can be used as a general category to help when grouping or sorting the alerts. If no alert type is specified, the default alert type for evaluations and statistics is Evaluation. and Statistic, respectively. The value for the alert type does not affect pipeline processing.
Syntax:
ALERT TYPE alert-type-string
Evaluations and statistics must be assigned a severity level which is included in the alerts they generate. The levels are represented by integers from 1 to 255. The severity has no meaning to the Analysis Pipeline; the value is simply recorded in the alert. The value for the severity does not affect pipeline processing. This field is required.
Syntax:
SEVERITY integer
Evaluations and statistics (but not file evaluations) need to be attached to a filter, which provides them flow records to analyze. Each can have one and only one filter. The filter's name links the evaluation or statistic with the filter. As a result, the filter must be created prior to creating the evaluation or statistic.
Syntax:
FILTER filter-name
Alternatively if you want to have an evaluation or statistic process all records, you can use the keyword NO FILTER in place of the filter declaration.
Evaluations and statistics can compute aggregate values across all flow records, or they can aggregate values separately for each distinct value of particular field(s) on the flow records, grouping the flow records by the field(s). An example of this latter approach is computing something per distinct source address.
FOREACH is used to isolate a value (a malicious IP address), or a notable tuple (a suspicious port pair). The unique field value that caused an evaluation to alert will be included in any alerts. Using FOREACH in a statistic will cause the value for every unique field value to be sent out in the periodic update.
The default is not to separate the data for each distinct value. The field that is used as the key for the groups is referred to as the unique field, and is declared in the configuration file for the FOREACH command, followed by the field name:
FOREACH field
Any of the fields can be combined into a tuple, with spaces between the individual field names. The more fields included in this list, the more memory the underlying primitives need to keep all of the required state data.
The ANY IP and ANY PORT constructs can be used here to build state (maybe a sum of bytes) for both ips or ports in the flow. The point of this is to build some state for an IP or PORT regardless of whether it's the source or destination, just that it appeared. When referencing the IP or PORT value to build an output list, use SIP or SPORT as the field to put in the list.
Pmaps can also be used to group state. The state is grouped by the output of the pmap. Pmaps can also be combined with other fields to build more complex tuples for grouping state, such as pmap(SIP) PROTOCOL
To keep state per source IP Address:
FOREACH SIP
To keep state per port pair:
FOREACH SPORT DPORT
To keep state for both ips:
FOREACH ANY IP
As with filtering, the ordering of the fields in the tuple does not matter as they are sorted internally.
There are some limits on which fields can be used as some evaluations require certain that a particular field be used, and some primitives do not support grouping by a field.
File evaluations do not handle records, so the FOREACH statement is illegal.
By default, evaluations and statistics are marked as active when they are defined. Specifying the INACTIVE statement in the evaluation or statistic block causes the evaluation or statistic to be created, but it is marked inactive, and it will not be used in processing records. For consistency, there is also an ACTIVE statement which is never really needed.
Syntax:
INACTIVE
By default, evaluations and statistics are updated after each input file, or after a predefined number of records when using a socket. As of Pipeline 5.8, it is possible to specify a time based bin size in which records will be grouped and evaluated after a certain amount of time. This allows better control over primitives such as CALCULATE STATS and EWMA, which generate statistics over bins instead of over records like AVERAGE. Binning also improves performance when updates are not needed after every input file.
Syntax:
BIN SIZE timeval
Binning can be used with both statistics and evaluations. When used in a statistic, the BIN SIZE configuration must be placed inside the top level of the statistic definition. When used in an evaluation, the BIN SIZE configuration must be placed inside the check block.
The bin size is not allowed to be greater than the time window. Additionally, if the input files to pipeline cover consistent durations (e.g. two minutes), do not set the bin size equal to the file duration. Setting the bin size equal to the expected file size is unnecessary as it mimics the default behavior, and it can also cause unexpected behavior.
Introduced in version 5.10, Analysis Pipeline supports simple one line alerting functions for Tombstone and Yaf Stats records.
Syntax:
ALERT ON TOMBSTONE
AND/OR
ALERT ON YAF_STATS
Both lines will cause an alert log message with the record's data to be generated for each of the corresponding records that arrives.
Back to topThis section provides evaluation-specific details, building on the evaluation introduction and aggregate function description provided in the previous two sections.
Each evaluation block must contain one or more check blocks. The evaluation sends each flow record it receives to each check block where the records are aggregated and tests are run. If every check block test returns a true value, the evaluation produces an output entry which may become part of an alert.
Evaluations have many settings that can be configured, including the output and alerting stages, in addition to evaluation sanity checks.
In an evaluation, the check block begins with the CHECK statement which takes as a parameter the type of check. The block ends with the END CHECK statement. If the check requires any additional settings, those settings are put between the CHECK and END CHECK statements, which are laid out in the primitives section.
The FILE_OUTAGE check must be part of a FILE_EVALUATION block. All other checks must be part of a EVALUATION block.
When an evaluation threshold is met, the evaluation creates an output entry. The output entry may become part of an alert, or it may be used to affect other settings in the pipeline.
Output Timeouts
All information contained in alerts is pulled from lists of output entries from evaluations. These output entries can be configured to time out both to conserve memory and to ensure that information contained in alerts is fresh enough to provide value. The different ways to configure the alerting stage are discussed below.
One way to configure alerting is to limit the number of times alerts can be sent in a time window. This is a place where the output timeout can have a major effect. If alerts are only sent once a day, but outputs time out after one hour, then only the outputs generated in the hour before alerting will be eligible to be included in alerts.
When FOREACH is not used, output entries are little more than flow records with attached threshold information. When FOREACH is used, they contain the unique field value that caused the evaluation to return true. Each time this unique field value triggers the evaluation, the timestamp for that value is reset and the timeout clock begins again.
Taking an example of an evaluation doing network profile that is identifying servers. If the output timeout is set to 1 day, then the list of output entries will contain all IP addresses that have acted like a server in the last day. As long as a given IP address is acting like a server, it will remain in the output list and is available to be included in an alert, or put in a named output list as described in the output list section.
Syntax:
OUTPUT TIMEOUT timeval
OUTPUT TIMEOUT 1 DAY
Shared Output Lists
When FOREACH is used with an evaluation, any value in an output entry can be put into a named output list. If the unique field is a tuple made up of multiple fields, any subset of those fields can be put into a list. There can be any number of these lists. A timeout value is not provided for each list as the OUTPUT TIMEOUT value is used. When an output entry times out, the value, or subset of that tuple is removed from all output lists that contain it.
These lists can be referenced by filters, or configured seperately, as described in the list configsection.
To create a list, a field list of what the output list will contain must be provided. A unique name for this list must be provided as well.
Syntax:
OUTPUT LIST fieldList listName
If using FOREACH SIP DIP, each of the following lists can be created
OUTPUT LIST SIP listOfSips
OUTPUT LIST DIP listOfDips
OUTPUT LIST SIP DIP listOfIPPairs
Alert on Removal
If FOREACH is used, pipeline can be configured to send an alert when an output has timed out from the output entries list.
Syntax:
ALERT ON REMOVAL
Clearing State
Once the evaluation’s state has hit the threshold and an output entry has been generated, you may desire to reset the current state of the evaluation. For example, if the evaluation alerts when a count of something gets to 1000, you might want to reset the count to start at 0 again. Using CLEAR ALWAYS can give a more accurate measure of timeliness, and is likely to be faster.
To set the value of when to clear state, simply type on of the following into the body of the evaluation.
CLEAR ALWAYS
CLEAR NEVER
This field is now required as of v4.3.1. CLEAR NEVER used to be the default.
Too Many Outputs
There are sanity checks that can be put in place to turn off evaluations that are finding more outputs than expected. This could happen from a poorly designed evaluation or analysis. For example, an evaluation looking for web servers may be expected to find less then 100, so a sanity threshold of 1000 would indicate lots of unexpected results, and the evaluation should be shut down as to not take up too much memory or flood alerts.
Evaluations that hit the threshold can be shutdown permanently, or go to sleep for a specified period of time, and turned back on. If an evaluation is shut down temporarily, all state is cleared and memory is freed, and it will restart as if pipeline had just begun processing.
Syntax:
SHUTDOWN MORE THAN integer OUTPUTS [FOR timeval]
Examples to shutdown if there are more than 1000 outputs. One shuts it down forever, and the other shuts it down for 1 day and starts over
SHUTDOWN MORE THAN 1000 OUTPUTS
SHUTDOWN MORE THAN 1000 OUTPUTS FOR 1 DAY
Alerting is the final stage of the Analysis Pipeline. When the evaluation stage is finished, and output entries are created, alerts can be sent. The contents of all alerts come from these output entries. These alerts provide information for a user to take action and/or monitor events. The alerting stage in pipeline can be configured with how often to send alerts and how much to include in the alerts.
Based on the configurations outlined below, the alerting stage first determines if it is permitted to send alerts, then it decides which output entries can be packaged up into alerts.
There are two main sections to alerts: The flow record that generated the alert, and the data metrics depending on the evaluation or statistic. For SiLK data sources, the entire flow record will be included in the alert because there is only 1 hierarchical level to the record. IPFIX data sources can have lists of elements or sub templates in them. Only the top level will be included in the alert, the list contents will not. YAF records that have DPI information will also only have their top level included in the alert. There is no human readable and machine parsable way to include all of the different levels in a single line of an alert. This applies regardless of whether snarf is installed.
Extra Alert Field
If there is a field of interest that is not in the main/top level of the
schema, it can still be included in the record portion of the alert. This
is done with EXTRA ALERT FIELD. For example, to include any
dnsQName values to an alert when using a YAF data source (with dnsQName
buried in the DPI data), use:
EXTRA ALERT FIELD dnsQName
The fields in the DPI portion of a YAF record are marked as "loopable", so there can be multiple values for this field. All of these values will be included at the end of the alert.
In addition to YAF elements, any DERIVED field can be added, even if
what the value is derived from is in the core record. For example, to
add the day of the week of the STIME from a SiLK reocrd, use:
EXTRA ALERT FIELD DAY OF WEEK(STIME)
There is no maximum to the number of EXTRA ALERT FIELDs that can be added to an evaluation.
Extra Aux Alert Field
Along the same lines as EXTRA ALERT FIELD, this allows the user to add extra values to the alerts that go to the auxilliary alert file. There is no limit to the number of extra aux alert fields that can be used.
This cannot be used with snarf.
However, unlike the extra alert field above, the extra aux alert field is allowed to be an element from the core record, as those values do not get printed in the aux alert file.
To add the day of the week of the stime, and the sip from the record to
the aux alert file, use:
EXTRA AUX ALERT FIELD DAY OF WEEK(STIME)
EXTRA AUX ALERT FIELD SIP
How often to send alerts
Just because there are output entries produced by an evaluation does not mean that alerts will be sent. An evaluation can be configured to only send a batch of alerts once an hour, or 2 batches per day. The first thing the alerting stage does is check when the last batch of alerts were sent, and determine if sending a new batch meets the restrictions placed by the user in the configuration file.
If it determines that alerts can be sent, it builds an alert for each output entry, unless further restricted by the next section that affect how much to alert.
Syntax:
ALERT integer-count TIMES timeVal
This configuration option does not affect the number of alerts sent per time period, if affects the number of times batches of alerts can be sent per time period. That is why the configuration command says "alert N times per time period", rather than "send N alerts per time period", while the semantic differences are subtle, it has a great affect on what gets sent out.
To have pipeline send only 1 batch of alerts per hour, use:
ALERT 1 TIMES 1 HOUR
To indicate that pipeline should alert every time there are output entries for alerts, use:
ALERT ALWAYS
How much to alert
The second alert setting determines how much information to send in each alert. You may wish to receive different amounts of data depending on the type of evaluation and how often it reports. Consider these examples:
The amount of data to send in an alert is relevant only when the OUTPUT_TIMEOUT statement includes a non-zero timeout and multiple alerts are generated within that time window.
To specify how much to send in an alert, specify the ALERT keyword followed by one of the following:
The default is SINCE LAST TIME. If using an EVERYTHING PASSES evaluation, be sure to use ALERT EVERYTHING to ensure flows from files that arrive with less than a second between them are included in alerts.
The last option is to have an evaluation do its work, but to never send out alerts. If the goal of an evaluation is just to fill up a list so other pieces of pipeline can use the results, individual alerts may not be necessary. Another case is that the desired output of filling these lists is that the lists send alerts periodically, and getting individual alerts for each entry is not ideal. In these cases, instead of the options described above use:
DO NOT ALERT
Minimum Number of Records Before Alerting
A minimum number of records requirment can be added to an entire evaluation and / or a particular check when using FOREACH. The state will be aggregated, and data will time out according to TIME WINDOW, but the state value will not be compared against a threshold, thus preventing alerts send and outputs created, until the minimum number of records have been seen.
When using primitives such as AVERAGE, RATIO, or PROPORTION, alerts may be more meaningful if the user knew that there was a sufficient number of records processed. This allows the state value to settle, giving a more realistic picture of the network activity.
This feature can only be used with the following primitives:
RECORD COUNT, SUM, AVERAGE, DISTINCT, PROPORTION, and
RATIO.
This feature can be used in conjunction with Minimum Time before alerting. If both are used, both restrictions must be overcome to allow alerting.
The minimum number of records requirement can be applied at two different levels:
Minimum Time Passed Before Alerting
A minimum amount of time requirment can be added to an entire evaluation and / or a particular check when using FOREACH. This time value is the network time determined by data, no necessarily the clock time. It follows the standard pipeline internal time, the same used for the TIME WINDOW and other timeouts. The state will be aggregated, and data will time out according to TIME WINDOW, but the state value will not be compared against a threshold, thus preventing alerts send and outputs created, until the minimum amount of network time has passed.
When using primitives such as AVERAGE, RATIO, or PROPORTION, alerts may be more meaningful if the user knew that there was a sufficient amount of time that has passed. This allows the state value to settle, giving a more realistic picture of the network activity.
This feature can only be used with the following primitives:
RECORD COUNT, SUM, AVERAGE, DISTINCT, PROPORTION, and
RATIO.
This feature can be used in conjunction with Minimum Records before alerting. If both are used, both restrictions must be overcome to allow alerting.
The minimum amount of time requirement can be applied at two different levels:
The Evals and Stats section introduced the Analysis Pipeline concept of a statistic and described the settings that statistics share with evaluations. A statistic receives flow records from a filter, computes an aggregate value, and periodically reports that value.
STATISTICs no longer keep the last record that affects the state value, so there will not be any statistic updates in the regular legacy alert log file, only in the aux file.
There are two time values that affect statistics: how often to report the statistics, and the length of the time-window used when computing the statistics. The following example reports the statistics every 10 minutes using the last 20 minutes of data to compute the statistic:
UPDATE 10 MINUTES
TIME_WINDOW 20 MINUTES
Statistics support the aggregation functions from the primitives section. Unlike an evaluation, a statistic is simply reporting the function's value, and neither the CHECK statement nor a threshold value are used. Instead, the statistic lists the primitive and any parameters it requires.
Simple examples are:
Minimum Number of Records before Updating
A minimum number of records requirment can be added to an entire statistic and / or the primitive level when using FOREACH. The state will be aggregated, and data will time out according to TIME WINDOW, but the state value will not be compared against the update threshold, thus preventing updates sent and outputs created, until the minimum number of records have been seen.
When the update interval is the same as the time window, if the minimum records requirement is not reached, the entire state will be cleared, and the record count will be reset to 0. This allows the user to only get updates when there are enough records to make it useful. If the time window is greater than the update interval, or set to forever, the number of records will keep accumulating even if the state isn't included in the update.
When using primitives such as AVERAGE, or PROPORTION, alerts may be more meaningful if the user knew that there was a sufficient number of records processed. This allows the state value to settle, giving a more realistic picture of the network activity.
This feature can only be used with the following primitives:
RECORD COUNT, SUM, AVERAGE, DISTINCT, and PROPORTION.
The minimum number of records requirement can be applied at two different levels:
Even if FOREACH is used, the minimum records requirement will apply to the number of records seen by the entire statistic
STATISTIC statNonRCZeroRecs
FILTER
minStatNonRC
RECORD COUNT
UPDATE 1 SECOND
DO NOT UPDATE
UNTIL 9 RECORDS SEEN
END STATISTIC
The minimum records requirement can be applied to each state as determined by FOREACH
STATISTIC statNonRCZeroRecs
FILTER
minStatNonRC
FOREACH SIP
RECORD COUNT
UPDATE 1 SECOND
DO NOT UPDATE
UNTIL 9 RECORDS SEEN FOREACH
END STATISTIC
Named lists created by internal filters and evaluations can be given extra configuration such that they are responsible for sending updates and alerts independent or in lieu of the mechanism that populates them. If there is a list configuration block, there does not need to be an evaluation block for the configuration file to be accepted. As long as something in pipeline generates alerts, it will run.
Lists created by internal filters have their own timeouts, so they are responsible for removing out-dated elements on their own. Lists populated by evaluations keep track of the timing out of values within the evaluation, and tell the list to remove a value, so those lists know nothing of the timeouts. A result of this is that due to efficiency concerns, some of the alerting functionality described below is not available for lists created and maintained by internal filters. It is explicitly stated which features cannot be used.
This extra configuration is surrounded in a LIST CONFIGURATION block, similar to other pipeline mechanisms. The list to configure must already have been declared before the configuration block.
High Level Syntax:
LIST CONFIGURATION existingListName
options discussed
below
END LIST CONFIGURATION
Alerts sent due to a list configuration come from the lists, and have their own timestamps and state kept about their alerts. They are not subject to the alerting restrictions imposed on the evaluations that populate the list.
The full contents of the list can be packaged into one alert, periodically:
UPDATE timeval
This will send out the entire list every 12 hours
UPDATE 12 HOURS
An alert can be sent if the number of elements in the list meets a certain threshold, as it's possible that while the contents are important, and can be configured to be sent periodically, knowing the count got above a threshold could be more time sensitive.
ALERT MORE THAN elementThreshold ELEMENTS
This alert will only be sent the first time the number of elements crosses the threshold. There can also be a reset threshold that if the number of elements drops below this value, pipeline will once again be allowed to send an alert if the number of elements is greater than the alert threshold. There is no alert sent upon going below the reset threshold. The elements in the are no reset by this either.
ALERT MORE THAN elementThreshold ELEMENTS RESET AT resetThreshold
This example will send an alert if there are more than 10 elements in the list. No more alerts will be sent unless the number of elements drops below 5, and then it will alert is the number of elements goes above 10 again.
ALERT MORE THAN 10 ELEMENTS RESET AT 5
Pipeline can send an alert any time a value is removed form the list.
ALERT ON REMOVAL
Alerting on removal cannot be used by lists created by internal filters
In addition to being included in alerts, the contents of the lists can be saved to disk in a specified file. The fields used in the list determine the way the data is saved. With each update sent, this file will be completely overwritten with the current contents of the list. It will NOT be appended. To compute a "dict" of successive files, or to keep any sort of history, post-processing will need to be done outside of Pipeline.
If there is only a single field and it is of type IPV4_ADDRESS or IPV6_ADDRESS is will be saved as an IPSet file. Any other type will be saved as a watchlist with one data value per line in ascii format. If there are multiple fields, making a tuple, the data will be saved using the double square bracketed format used by "typeable tuples". For example, if the field list consists of SIP,SPORT, the format of the output file will be:
[[1.1.1.1, 80],[2.2.2.2,8080],[3.3.3.3,22]]
If the field(s) used has a textual representation that Pipeline can handle, these files can be used as watchlists with (NOT)IN_LIST in FILTERS
Lists used to hold SIP, DIP, or NHIP can be given a set of initial values by providing an ipset file. As of Pipeline 5.6, any field can be used to seed the lists, both internal filters and evaluations.
SEED pathToSeedFile
Files of all types, regular watchlists, single bracketted, and tuple bracketted lists can all be used to seed lists. When seeding lists, the seeded values become owned by the lists. Thus if they timeout like other elements would, they will be removed from the list even though they were part of the seed file. Elements used to seed the list have no priority over elements pipeline adds from processing. Elements used to seed the list are not permanant members of the list.
To specify the overwriting of the SEED file on update, use OVERWRITE_ON_UPDATE in the LIST CONFIGURATION block. Syntax:
SEED "path/to/seedFile.txt"
OVERWRITE ON UPDATE
To specify the file name for the list data to be sent to on update, but without seeding the list, use OUTPUT_FILE. The path to the file can be either local or absolute. The file extension does not matter.
Syntax:
OUTPUT FILE "path/to/outputFile.txt"
The use of OUTPUT_FILE implies that Pipeline is to overwrite this file on update, so OVERWRITE_ON_UPDATE cannot be used with OUTPUT_FILE.
OUTPUT_FILE can be used in concert with SEED. If so, the list will be initialized with values from the SEED file, but the contents of the list will ONLY be written to the OUTPUT_FILE, and the SEED file will remain unchanged.
To have Pipeline write the contents of a list to a file at each update interval but not send an alert containing the contents, use WRITE FILE WITHOUT ALERTING on its own line. To use this, there must be an OUTPUT FILE.
If this is used with ALERT MORE THAN X ELEMENTS, or ALERT ON REMOVAL, those alerts are still sent.
As with evaluations, lists can be configured to shut down if they become filled with too many elements. This is provided as a sanity check to let the user know it the configuration has a flaw in the analysis. If the number of elements meets the shutdown threshold, an alert is sent, the list is freed, and is disconnected from the mechanism that had been populating it.
SHUTDOWN MORE THAN shutdownThreshold ELEMENTS
As with evaluations, a severity level must be provided to give context to alerts. It is not used during processing, but included in alerts sent from the lists.
SEVERITY integerSeverity
Back to topNamed lists, and ipset files, can now be linked such that if an element is added to all of the lists in the bundle, Pipeline can send an alert, and if desired, add that element to another named list, which can be used in a LIST CONFIGURATION block described above.
The lists referenced in the list bundle must already have be created in the configuration file. All lists must be made up of the same fields. An IPSet file can be added to the bundle, provided that the field for the lists is SIP or DIP, and must be put in quotation marks.
High Level Syntax:
LIST BUNDLE listBundleNameEach list to be added to the bundle goes on its own line. This list must be created already in the configuration file by an evaluation or internal filter. If this list is to be made from an IPSet file, it must be in quotes.
Once an element has been found to be in all of the lists in a bundle, it is then able to be put in a new named list. This list can be used in LIST CONFIGURATION just like any other named list. There is no timeout needed for this, as the element will be removed from this list if it is removed from an element in the bundle.
OUTPUT LIST nameOfNewList
As with evaluations, a severity level must be provided to give context to alerts. It is not used during processing, but included in alerts sent from the lists.
SEVERITY integerSeverity
As with evaluations, you can force the list bundle to not alert, as maybe you just want the values that meet the qualifications of the list bundle to be put into another named list (using OUTPUT LIST above) , and get alerts of the contents that way. Just add DO NOT ALERT to the list of statements for the list bundle.
Let's say an evaluation creates a list named myServers, and an internal filter creates a list called interestingIPs, and there is an IPSet file is note named notableIPS.set. To include these lists in a bundle, and to put any IP that is in all lists into a new list named reallyImportantIPs, use the following:
LIST BUNDLE myExampleBundle