CERT
Software Assurance Secure Systems Organizational Security Coordinating Response Training
Skip to end of metadata
Go to start of metadata

Introduction

PySiLK is a Python extension for SiLK data objects. PySiLK requires Python v2.4, v2.5, or v2.6 and is available in SiLK 1.0.0 and later. There are two major categories of using PySiLK:

  • In a Python script, PySiLK provides the SiLK module (i.e., import silk) which contains classes to represent the common SiLK objects like records, files, and IPsets.
  • PySiLK provides a --python-file switch to several of the tools in the SiLK tool suite to augment each tool's behavior. The argument to the --python-file switch is a file containing calls to PySiLK functions that can register new command line arguments and can modify each tool as listed here:

    rwfilter

    calls a complex Python function on all evaluated records to partition the records into the pass and fail streams

    rwcut

    defines new columns (fields) that can be displayed

    rwgroup

    defines new key fields for grouping (binning) the flow records

    rwsort

    defines new key fields for sorting the flow records

    rwuniq

    defines new key fields for binning the records and defines new "aggregate" value fields to be computed over each bin

    rwstats

    defines new key fields for binning the records and defines new "aggregate" value fields to be computed over each bin; in addition, the value field can be used to compute the top-N list

(As a special case of the rwfilter usage, PySiLK provides the --python-expr argument which allows the user to partition the flow records using a short Python expression).

Additional documentation:

PySiLK: SiLK in Python

describes the objects that the PySiLK extension defines

SiLK Python plug-in

describes using PySiLK from the SiLK tools listed above

PySiLK Reference Guide html pdf

contains both of the above in a single document

SiLK Module

The SiLK module for Python contains several classes for representing common SiLK objects. For a complete list of the classes and associated methods, please refer to the PySiLK documentation.

Using the SiLK Module in a Python Script

Below is an example of a Python script that uses the SilkFile class to represent a SiLK file and the RWRec class to represent each individual record in the file. The script attempts to group all flows representing one direction of an FTP session and print them together. It takes as an argument the name of a file containing raw SiLK records sorted by start time and port number (rwsort --fields=stime,sport). The script extracts from the file all flows that potentially represent FTP traffic. We define a possible FTP flow as any flow where:

  • the source port is 21 (FTP control channel)
  • the source port is 20 (FTP data transfer port )
  • both the source port and destination port are ephemeral (data transfer)

If a flow record has a source port of 21, the script adds the source and destination address to the list of possible FTP groups. The script classifies each data transfer flow (source port 20 or ephemeral to ephemeral) according to its source and destination IP address pair. If a flow from the control channel with the same source and destination IP address exists the source and destination ports in the flow are added to the list of ports associated with the control channel interaction, otherwise the script lists the data transfer as being unclassified. After the entire file is processed, all FTP sessions that have been grouped are displayed.

groupFTP.py
#!/usr/bin/python2.4

# import the necessary modules
import silk
import sys


# Test that the argument number is correct
if (len(sys.argv) != 2):
  print "Must supply a SiLK data file."
  sys.exit()

# open the SiLK file for reading
rawFile=silk.SilkFile(sys.argv[1], silk.READ)

# Initialize the record structure
# Unclassified will be the record ephemeral to ephemeral
# connections that don't appear to have a control channel
interactions = {"Unclassified":[]}

# Count of records processed
count = 0

# Process the input file
for rec in rawFile:
    count += 1
    key="%15s <--> %15s"%(rec.sip,rec.dip)
    if (rec.sport==21):
      if (not interactions.has_key(key)):
        interactions[key] = []
    else:
      if (interactions.has_key(key)):
        interactions[key].append("%5d <--> %5d"%(rec.sport,rec.dport))
      else:
        interactions["Unclassified"].append(
            "%15s:%5d <--> %15s:%5d"%(rec.sip,rec.sport,rec.dip,rec.dport))


# Print the count of all records
print str(count) + " records processed"

# Print the groups of FTP flows
keyList = interactions.keys()
keyList.sort()
for key in keyList:

  print "\n" + key + " " + str(len(interactions[key]))
  if (key != "Unclassified"):
    for line in interactions[key]:
      print "   " + line
Example output of groupFTP.py
184 records processed

xxx.xxx.xxx.236 <--> yyy.yyy.yyy.231 3
      20 <--> 56180
      20 <--> 56180
      20 <--> 58354

Unclassified 158

rwfilter --python-expr

The --python-expr argument to rwfilter allows the user to specify a python expression to be evaluated on each record. By doing so, the user can generate complex filtering rules without piping several rwfilter commands together or using intermediate files. For example, if we would like to isolate all illegally formatted ICMP messages without using PySiLK, we would use the following:

rwfilter --start-date=2007/11/15 --protocol=1 --pass-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=1,2,7,19-29,37,38,41- \
  --pass-destination=badICMP1.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin \
  --icmp-type=0,4,6,8,10,13-18,30-35,36,39 --icmp-code=1- \
  --pass-destination=badICMP2.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=5 --icmp-code=4- \
  --pass-destination=badICMP3.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=9 --icmp-code=1-15 \
  --pass-destination=badICMP4.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=11 --icmp-code=2- \
  --pass-destination=badICMP5.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=12 --icmp-code=3- \
  --pass-destination=badICMP6.rwf --fail-destination=stdout | \
  rwfilter --input-pipe=stdin --icmp-type=40 --icmp-code=6- \
  --pass-destination=badICMP7.rwf

rwcat badICMP*.rwf --output-path=badICMPTotal.rwf

Using the --python-expr argument, we can reduce the filter to the following command. The expression uses the RWRec class to represent each record and evaluates to true if and only if the protocol is 1 and the ICMP type/code information does not correspond to a known type/code combination:

rwfilter --start-date=2007/11/15 --python-expr="(
  rec.protocol==1 and not(
    (rec.icmptype in
    [0,4,6,8,10,13,14,15,16,17,18,30,31,32,33,34,35,36,39] and rec.icmpcode < 1) or
    (rec.icmptype == 5 and rec.icmpcode < 4) or
    (rec.icmptype == 9 and rec.icmpcode in [0,16]) or
    (rec.icmptype == 11 and rec.icmpcode < 2) or
    (rec.icmptype == 12 and rec.icmpcode < 3 ) or
    (rec.icmptype == 40 and rec.icmpcode < 6) or
    (rec.icmptype == 3)
  ))"  --pass-destination=badICMPTotal.rwf

rwfilter --python-file

Warning

In an MPI cluster deployment of SiLK, the --python-file argument to rwfilter works but there are a few caveats:

  • There is no "global" state---each node will have its own state independent of the other nodes.
  • The finalize method will not work to print results.
  • Your plug-in should not attempt to read or write any files since the nodes do not have access to your home directory. However, the plug-in can import standard Python library files.

rwfilter's --python-file argument allows the user to specify a file which contains Python function declarations. Unlike standalone PySiLK scripts (such as groupFTP.py, above), files called with the --python-file argument are not complete scripts. The scripts call the register_filter() function which has as arguments one or two functions to be called by rwfilter: rwfilter(rec) and finalize(). The rwfilter(rec) function runs on each record and returns a true or false value indicating whether the record passes the filter or fails the filter (note that the conditions set by the PySiLK script are added to those specified in the rwfilter command that called it to determine the final pass/fail status). The finalize() function runs after all records have been processed.

To demonstrate the use of --python-file in rwfilter, we walk through a Python script that evaluates the behavior of a set of IP addresses and determines if the host is likely to be an SMTP server or relay. We expect (based on traffic studies) that more than 85% of a legitimate SMTP servers' activity is devoted to sending or providing mail. If we find that the host exhibits this behavior, we include the IP address in a set called SMTP.set. Regardless of if the IP address is included in the set, we pass all records that appear to be legitimate mail flows.

We run the rwfilter command as follows:

rwfilter --start-date=2008/4/21 --end-date=2008/4/21 --type=out,outweb \
     --sipset=possible_SMTP_servers.set --python-file=SMTP.py --print-statistics

This command first collects all records of type out and outweb that have a start date on April 21, 2008. Since there are no additional command line options to filter records all records are passed to the rwfilter(rec) function in SMTP.py. rec is an instance of the object RWRec, which represent the record being passed.

The rwfilter(rec) in SMTP.py begins by importing the global variable counts and smtpports. counts is a dictionary indexed by source IP address and contains an array of size two, where the first element is the total number of bytes that the IP address has transferred and the second element is the number of bytes that the source address has transferred that are likely to be related to mail delivery.

Using the source IP address from the record, the function retrieves the current byte counts from the counts dictionary. If this is the first occurrence of the IP address, a new entry is added. The function then adds the byte count of this record to the total byte count and determines if the record is a mail delivery message. If it is a mail message, the function adds the bytes to the total of bytes transferred as mail and returns true. Otherwise, a value of false is returned.

After rwfilter processes all records it calls the finalize() function, which evaluates the collection of IP addresses. If the percentage of bytes that the host transferred in mail operations is greater than 85% of the total bytes transferred, the IP address is added to a final set of SMTP servers. The final set of SMTP servers is then saved to the SMTP.set file, and rwfilter exits.

SMTP.py
# By importing the silk module in this way we don't have to use silk.<function> (as
# we did in previous examples) to when calling functions.
from silk import *

# Collection of ports commonly used by SMTP servers
smtpports = set([25, 109, 110, 143, 220, 273, 993, 995, 113])

# Minimum percentage of mail traffic before being considered a mail server
threshold = 0.85

# Collection of byte counts
counts = dict()

# This function is run over all records.
# Input:  An instance of the RWRec class representing the current record being
#         processesed
# Output: True or false value indicating if the record passes or fails the filter
def rwfilter(rec):

    # Import the global variables needed for processing the record
    global smtpports, counts

    # Pull data from the record
    sip = rec.sip
    bytes = rec.bytes

    # Get a reference to the current data on the IP address in question
    data = counts.setdefault(sip, [0, 0])

    # Update the total byte count for the IP address
    data[0] += bytes

    # Is the flow mail related?  If so add the byte count to the mail bytes
    if (rec.protocol == 6 and rec.sport in smtpports and
        rec.packets > 3 and rec.bytes > 120):
        data[1] += bytes
        return True

    # If not mail related, fail the record
    return False


# This is run after all records have been processed
def finalize():

    # Import the global vriables needed to evaluate the results
    global counts, threshold

    # The IP set of SMTP servers
    smtp = IPSet()

    # Iterate through all of the IP addresses.
    for ip, data in counts.iteritems():
        if (float(data[1]) / data[0]) > threshold:
            smtp.add(ip)

    # Generate the IPset of all smtp servers.
    smtp.save('smtp.set')

# Register these functions with rwfilter
register_filter(rwfilter, finalize=finalize)
  • No labels