PySiLK is a Python extension for SiLK data objects. PySiLK requires Python v2.4, v2.5, or v2.6 and is available in SiLK 1.0.0 and later. There are two major categories of using PySiLK:
- In a Python script, PySiLK provides the SiLK module (i.e.,
import silk) which contains classes to represent the common SiLK objects like records, files, and IPsets.
- PySiLK provides a
--python-fileswitch to several of the tools in the SiLK tool suite to augment each tool's behavior. The argument to the
--python-fileswitch is a file containing calls to PySiLK functions that can register new command line arguments and can modify each tool as listed here:
calls a complex Python function on all evaluated records to partition the records into the pass and fail streams
defines new columns (fields) that can be displayed
defines new key fields for grouping (binning) the flow records
defines new key fields for sorting the flow records
defines new key fields for binning the records and defines new "aggregate" value fields to be computed over each bin
defines new key fields for binning the records and defines new "aggregate" value fields to be computed over each bin; in addition, the value field can be used to compute the top-N list
(As a special case of the rwfilter usage, PySiLK provides the
--python-expr argument which allows the user to partition the flow records using a short Python expression).
describes the objects that the PySiLK extension defines
describes using PySiLK from the SiLK tools listed above
contains both of the above in a single document
The SiLK module for Python contains several classes for representing common SiLK objects. For a complete list of the classes and associated methods, please refer to the PySiLK documentation.
Using the SiLK Module in a Python Script
Below is an example of a Python script that uses the
SilkFile class to represent a SiLK file and the
RWRec class to represent each individual record in the file. The script attempts to group all flows representing one direction of an FTP session and print them together. It takes as an argument the name of a file containing raw SiLK records sorted by start time and port number (
rwsort --fields=stime,sport). The script extracts from the file all flows that potentially represent FTP traffic. We define a possible FTP flow as any flow where:
- the source port is 21 (FTP control channel)
- the source port is 20 (FTP data transfer port )
- both the source port and destination port are ephemeral (data transfer)
If a flow record has a source port of 21, the script adds the source and destination address to the list of possible FTP groups. The script classifies each data transfer flow (source port 20 or ephemeral to ephemeral) according to its source and destination IP address pair. If a flow from the control channel with the same source and destination IP address exists the source and destination ports in the flow are added to the list of ports associated with the control channel interaction, otherwise the script lists the data transfer as being unclassified. After the entire file is processed, all FTP sessions that have been grouped are displayed.
--python-expr argument to rwfilter allows the user to specify a python expression to be evaluated on each record. By doing so, the user can generate complex filtering rules without piping several rwfilter commands together or using intermediate files. For example, if we would like to isolate all illegally formatted ICMP messages without using PySiLK, we would use the following:
--python-expr argument, we can reduce the filter to the following command. The expression uses the
RWRec class to represent each record and evaluates to true if and only if the protocol is 1 and the ICMP type/code information does not correspond to a known type/code combination:
In an MPI cluster deployment of SiLK, the
--python-file argument to rwfilter works but there are a few caveats:
- There is no "global" state---each node will have its own state independent of the other nodes.
finalizemethod will not work to print results.
- Your plug-in should not attempt to read or write any files since the nodes do not have access to your home directory. However, the plug-in can
importstandard Python library files.
--python-file argument allows the user to specify a file which contains Python function declarations. Unlike standalone PySiLK scripts (such as
groupFTP.py, above), files called with the
--python-file argument are not complete scripts. The scripts call the
register_filter() function which has as arguments one or two functions to be called by rwfilter:
rwfilter(rec) function runs on each record and returns a true or false value indicating whether the record passes the filter or fails the filter (note that the conditions set by the PySiLK script are added to those specified in the rwfilter command that called it to determine the final pass/fail status). The
finalize() function runs after all records have been processed.
To demonstrate the use of
--python-file in rwfilter, we walk through a Python script that evaluates the behavior of a set of IP addresses and determines if the host is likely to be an SMTP server or relay. We expect (based on traffic studies) that more than 85% of a legitimate SMTP servers' activity is devoted to sending or providing mail. If we find that the host exhibits this behavior, we include the IP address in a set called SMTP.set. Regardless of if the IP address is included in the set, we pass all records that appear to be legitimate mail flows.
We run the rwfilter command as follows:
This command first collects all records of type out and outweb that have a start date on April 21, 2008. Since there are no additional command line options to filter records all records are passed to the
rwfilter(rec) function in
rec is an instance of the object
RWRec, which represent the record being passed.
SMTP.py begins by importing the global variable
counts is a dictionary indexed by source IP address and contains an array of size two, where the first element is the total number of bytes that the IP address has transferred and the second element is the number of bytes that the source address has transferred that are likely to be related to mail delivery.
Using the source IP address from the record, the function retrieves the current byte counts from the
counts dictionary. If this is the first occurrence of the IP address, a new entry is added. The function then adds the byte count of this record to the total byte count and determines if the record is a mail delivery message. If it is a mail message, the function adds the bytes to the total of bytes transferred as mail and returns true. Otherwise, a value of false is returned.
After rwfilter processes all records it calls the
finalize() function, which evaluates the collection of IP addresses. If the percentage of bytes that the host transferred in mail operations is greater than 85% of the total bytes transferred, the IP address is added to a final set of SMTP servers. The final set of SMTP servers is then saved to the SMTP.set file, and rwfilter exits.