super_mediator De-Duplication Tutorial

super_mediator has always been capable of performing de-duplication of DNS resource records: If enabled, super_mediator collects all DNS resource records captured and exported by yaf and caches unique name, value pairs for A, AAAA, NS, CNAME, PTR, SOA, MX, TXT, SRV, NX, and particular DNSSEC records.

The 1.1.0 release of super_mediator extended the capability to other types of deep packet inspection (DPI) data. (This is sometimes called general de-duplication to distinguish it from the DNS and TLS/SSL de-duplication that super_mediator also supports.) For example, it may be of interest to determine what user agent string a particular IP address used at any point in time or to identify unique IP, service-string pairs on your network. This type of information can assist in fingerprinting hosts or identifying vulnerable systems on the network.

The de-duplication feature discussed in this tutorial is different from the DEDUP_PER_FLOW de-duplication that is performed within each flow record and discussed in the super_mediator tutorial.

super_mediator de-duplication configured within the DEDUP_CONFIG block is done per IP address (by default it uses the source IP). super_mediator caches all unique IP, data pairs. This unique tuple is only flushed from memory when either the tuple is seen a certain number of times (set by MAX_HIT_COUNT) or the tuple has not been seen within a certain time window (set by FLUSH_TIMEOUT).

Configuration File

De-duplication of all data types must be configured through the super_mediator.conf configuration file. The DEDUP_CONFIG block must be associated with a single EXPORTER block.

If the EXPORTER is a TEXT exporter, the PATH defined in the EXPORTER block must be a valid directory. For each PREFIX line present within the DEDUP_CONFIG block, a separate file will be created with the file name prefix defined on the PREFIX line. The EXPORTER will only perform deduplication; no flow records are exported.

For a JSON EXPORTER, the PATH in the EXPORTER block must be a file, and both flow records and de-duplicated records are written to that file. The key for a flow record is "flow" and the key for a deduplicated record is "dedup".

For an IPFIX FILE EXPORTER, the PATH in the EXPORTER block is the file name to export data to. Both flow records and de-duplicated records are written to the file.

To have a JSON or IPFIX EXPORTER only write de-duplicated records, add DEDUP_ONLY to the EXPORTER block.

As usual, if ROTATE is present in the EXPORTER block for JSON or IPFIX, the PATH will be the file prefix used and the date and serial number will be appended to the file prefix in the form -YYYYMMDDHHMMSS-SSSSS.med.

Examples

Below are four configuration file and data examples of an EXPORTER that has a DEDUP_CONFIG block associated with it.

TEXT Exporter

The following is an example configuration for a TEXT exporter:

EXPORTER TEXT ROTATING_FILES "dedup"
    PATH "/data/dedup"
    ROTATE_INTERVAL 300
    LOCK
EXPORTER END

DEDUP_CONFIG "dedup"
    PREFIX "useragent" [111]
    PREFIX "server" DIP [110, 171]
    PREFIX "host" SIP [117]
    PREFIX "dns" [179]
DEDUP_CONFIG END

By default, the source IP address is cached, along with the data value identified by the CERT private enterprise information element ID(s) defined within the square brackets on the PREFIX line. For example, the first line in the DEDUP_CONFIG block de-duplicates based on the source IP, HTTP User-Agent (httpUserAgent) pair and stores those in /data/dedup/useragent-YYYYMMDDHHMMSS-SSSSS.txt.

Some information elements are associated with the server, and using the destination IP address makes more sense. For these cases, DIP must appear between the name and the element ID list. For example, the second line performs two de-duplicatations, one on destination IP, HTTP server response header (httpServerString) pairs and another on destination IP, SSH version number (sshVersion) pairs; both pairs are written to /data/dedup/server-YYYYMMDDHHMMSS-SSSSS.txt.

The third line above de-duplicates the source IP and HTTP host header (httpHost); the keyword SIP is allowed but unnecessary.

The fourth line de-duplicates source IP and the DNS query name (dnsName). The only information element ID valid for DNS is 179. To de-duplicate on DNS responses, refer to the super_mediator.conf manual page, specifically the DNS_DEDUP block.

Also note that general deduplication is limited for records containing TLS/SSL DPI. The only information element ID valid for TLS/SSL is 244 (sslCertSerialNumber). See the SSL de-duplication article for more information.

As a reminder, normal flow records are not written in this configuration. If that is desired, a separate EXPORTER block must be added.

The CSV data format for de-duplicated data values configured in the DEDUP_CONFIG block is:

first_seen | last_seen | IP_address | flowkeyhash | count | data
first_seen

The flowStartMilliseconds of the first flow that contained this unique pair. first_seen is a timestamp in the form "2012-01-23 04:45:13.897."

last_seen

The flowStartMilliseconds of the last flow before the record was flushed that contained this unique pair. last_seen is a timestamp in the form "2012-01-23 04:45:13.897."

IP_address

The IP address used in the unique key. By default this is the sourceIPv4Address or sourceIPv6Address of the flow. This behavior can be changed by adding "DIP" to the PREFIX line in the DEDUP_CONFIG block.

flowkeyhash

The 32-bit hash of the 5-tuple + vlan of the last flow that contained this unique pair. This value can be used to pivot into a PCAP data repository, if available. See this YAF PCAP tutorial for more information.

count

The number of times this unique pair was seen within the first_seen, last_seen time period.

data

The value retrieved from the incoming IPFIX stream identified by (one of) the information element ID defined in the square brackets on the PREFIX line within the DEDUP_CONFIG block. Be aware when using multiple IDs on a PREFIX line that CSV output file does not indicate which element was matched.

TLS/SSL de-duplicated data has a slightly different format. See this article for more information about TLS/SSL de-duplication.

See the following example data (lines wrapped for readability):

$ head -n 5 /data/dedup/host.20110128220025.txt
2011-01-28 21:45:34.904|2011-01-28 21:45:34.995|10.10.1.60|2640424260|4|www.google.com
2011-01-28 21:45:37.636|2011-01-28 21:45:37.636|10.10.0.205|696160288|1|api.twitter.com
2011-01-28 21:45:27.349|2011-01-28 21:45:43.933|10.10.0.196|2697798766|2|www.funtrivia.com
2011-01-28 21:45:40.508|2011-01-28 21:45:40.508|10.11.0.139|3092428228|1|ajax.googleapis.com
2011-01-28 21:46:51.836|2011-01-28 21:46:51.836|10.10.1.33|741168033|1|mirror.liberty.edu

$ head -n 5 /data/dedup/useragent.20110128220025.txt
2011-01-28 21:45:34.904|2011-01-28 21:45:34.995|10.10.1.60|2640424260|4|Mozilla/5.0 \
(X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko/2009033100 Ubuntu/9.04 (jaunty) Firefox/3.0.8
2011-01-28 21:45:37.636|2011-01-28 21:45:37.636|10.10.0.205|696160288|1|TwitterAndroid/1.0.5 \
(109) Nexus One/8 (HTC;passion)
2011-01-28 21:46:23.366|2011-01-28 21:46:23.426|10.13.0.63|3301408776|2|urlgrabber/3.9.1 yum/3.2.28
2011-01-28 21:46:06.001|2011-01-28 21:47:06.736|10.11.0.139|671893917|4|OpenTable/3.2 \
CFNetwork/485.12.7 Darwin/10.4.0
2011-01-28 21:47:06.458|2011-01-28 21:47:15.766|10.13.0.65|2005639129|6|ChessWithFriendsPaid/3.07 \
CFNetwork/485.12.7 Darwin/10.4.

$ head -n 5 /data/dedup/dns.20110128220025.txt
2011-01-28 21:50:31.285|2011-01-28 21:50:31.285|10.10.1.60|1463750989|1|suggestqueries.google.com.
2011-01-28 21:50:31.285|2011-01-28 21:50:31.285|10.10.1.60|739184970|1|id.google.com.
2011-01-28 21:50:36.205|2011-01-28 21:50:36.205|10.10.0.251|428413069|1|14-courier.push.apple.com.
2011-01-28 21:50:36.205|2011-01-28 21:50:36.205|10.10.1.31|1604129129|1|reviews-cdn.northerntool.com.
2011-01-28 21:50:36.205|2011-01-28 21:50:36.205|10.10.1.31|1247744361|1|answers.northerntool.com.

In the above examples, yaf generated an IPFIX file from a large PCAP file captured at a conference in 2011. super_mediator used the IPFIX file as input and de-duplicated unique IP, data pairs defined in the DEDUP_CONFIG block. Perhaps you are curious about the particular DNS query made by IP 10.10.1.60 to "id.google.com" and would like to see the PCAP of this particular DNS transaction. You can simply provide the flow key hash to yaf to generate the PCAP file for this particular flow:

$ yaf --in big.pcap --no-output --pcap=mydns.pcap --max-payload=2000 \
--hash=739184970 --verbose

$ tcpdump -r mydns.pcap
reading from file mydns.pcap, link-type EN10MB (Ethernet)
17:45:29.409353 IP 10.10.1.60.53168 > resolver1.level3.net.domain: 28965+ A? id.google.com. (31)
17:45:29.412611 IP resolver1.level3.net.domain > 10.10.1.60.53168: 28965 5/0/0 \
CNAME id.l.google.com., A 72.14.204.101, A 72.14.204.102, A 72.14.204.113, A 72.14.204.100 (114)

JSON Exporter

The following is an example configuration for a JSON exporter:

EXPORTER JSON SINGLE_FILE "dedup"
    PATH "/data/dedup"
EXPORTER END

DEDUP_CONFIG "dedup"
    PREFIX "useragent" [111]
    PREFIX "server" DIP [110, 171]
    PREFIX "host" [117]
    PREFIX "dns" [179]
    MAX_HIT_COUNT 10000
    FLUSH_TIMEOUT 600
DEDUP_CONFIG END

JSON exporters configured with a DEDUP_CONFIG block use the top-level key "dedup" for the de-duplicated records. The same columns defined above for TEXT (CSV) export will be present in the JSON record. super_mediator will use the PREFIX name for the data keyword.

$ cat /data/dedup/host.txt
{"dedup":{"firstSeen":"2011-01-28 21:46:13.580","lastSeen":"2011-01-28 21:47:45.852",\
"sourceIPv4Address":"10.13.0.70","yafFlowKeyHash":574225501,"smDedupHitCount":8,\
"host":"search.twitter.com"}}
{"dedup":{"firstSeen":"2011-01-28 21:47:46.994","lastSeen":"2011-01-28 21:47:46.994",\
"sourceIPv4Address":"10.10.1.59","yafFlowKeyHash":3289550532,"smDedupHitCount":1,\
"host":"ocsp.godaddy.com"}}
{"dedup":{"firstSeen":"2011-01-28 21:47:47.073","lastSeen":"2011-01-28 21:47:47.073",\
"sourceIPv4Address":"10.10.1.59","yafFlowKeyHash":2310975606,"smDedupHitCount":1,\
"host":"en-us.fxfeeds.mozilla.com"}}
{"dedup":{"firstSeen":"2011-01-28 21:47:47.073","lastSeen":"2011-01-28 21:47:47.073",\
"sourceIPv4Address":"10.10.1.59","yafFlowKeyHash":2310910070,"smDedupHitCount":1,\
"host":"fxfeeds.mozilla.com"}}
{"dedup":{"firstSeen":"2011-01-28 21:47:30.433","lastSeen":"2011-01-28 21:47:31.174",\
"sourceIPv4Address":"10.10.1.4","yafFlowKeyHash":1147488830,"smDedupHitCount":2,\
"host":"chibis.adotube.com"}}

IPFIX File Exporter

The following is an example configuration for a IPFIX file exporter:

EXPORTER IPFIX ROTATING_FILES "dedup"
    PATH "/data/dedup"
    ROTATE_INTERVAL 120
    LOCK
EXPORTER END

DEDUP_CONFIG "dedup"
    PREFIX "useragent" [111]
    PREFIX "server" DIP [110, 171]
    PREFIX "host" [117]
    PREFIX "dns" [179]
DEDUP_CONFIG END

The IPFIX file uses a separate template for each de-duplicated element. Here are two of the templates (output produced by ipfixDump):

--- Template Record --- tid:   261 (0x0105), fields: 9, scope: 0, name: md_dedup_httpUserAgent ---
  monitoringIntervalStartMilliSeconds        (359) <millisec>     [8]
  monitoringIntervalEndMilliSeconds          (360) <millisec>     [8]
  flowStartMilliseconds                      (152) <millisec>     [8]
  smDedupHitCount                       (6871/929) <uint64>       [8]
  sourceIPv6Address                           (27) <ipv6>        [16]
  sourceIPv4Address                            (8) <ipv4>         [4]
  yafFlowKeyHash                        (6871/106) <uint32>       [4]
  observationDomainName                      (300) <string>   [65535]
  httpUserAgent                         (6871/111) <string>   [65535]

--- Template Record --- tid:   258 (0x0102), fields: 9, scope: 0, name: md_dedup_httpHost ---
  monitoringIntervalStartMilliSeconds        (359) <millisec>     [8]
  monitoringIntervalEndMilliSeconds          (360) <millisec>     [8]
  flowStartMilliseconds                      (152) <millisec>     [8]
  smDedupHitCount                       (6871/929) <uint64>       [8]
  sourceIPv6Address                           (27) <ipv6>        [16]
  sourceIPv4Address                            (8) <ipv4>         [4]
  yafFlowKeyHash                        (6871/106) <uint32>       [4]
  observationDomainName                      (300) <string>   [65535]
  httpHost                              (6871/117) <string>   [65535]

super_mediator can read the IPFIX file it created:

$ super_mediator -o - -m TEXT /data/dedup-20150630135338-00000.med
2011-01-28 21:46:06.001|2011-01-28 21:47:06.736|10.11.0.139|671893917|4|OpenTable/3.2 \
CFNetwork/485.12.7 Darwin/10.4.0
2011-01-28 21:45:40.508|2011-01-28 21:47:04.890|10.11.0.139|3000427195|10|www.opentable.com
2011-01-28 21:47:06.736|2011-01-28 21:47:06.736|10.11.0.139|671893917|1|data.flurry.com
2011-01-28 21:47:08.890|2011-01-28 21:47:08.890|10.13.0.65|1959568909|1|ads.mobclix.com
2011-01-28 21:47:06.977|2011-01-28 21:47:08.985|10.13.0.65|2005835719|3|newtoyinc.com
2011-01-28 21:47:09.036|2011-01-28 21:47:09.036|10.13.0.65|379960495|1|s.mobclix.com

TCP Exporter

The following is an example configuration for a TCP exporter:

EXPORTER IPFIX TCP "dedup"
   HOSTNAME "localhost"
   PORT 18000
EXPORTER END

DEDUP_CONFIG "dedup"
    PREFIX "useragent" [111]
    PREFIX "server" DIP [110, 171]
    PREFIX "host" [117]
    PREFIX "dns" [179]
DEDUP_CONFIG END

To collect this information via TCP, another super_mediator can be used to listen for TCP connections on port 18000:

$ super_mediator -o - -m TEXT --ipfix-input=TCP --ipfix-port=18000 localhost
2011-01-28 21:59:00.189|2011-01-28 21:59:00.189|10.10.0.188|970020836|1|securityd \
(unknown version) CFNetwork/485.2 Darwin/10.3.1
2011-01-28 21:59:06.491|2011-01-28 21:59:12.815|10.10.0.188|4017770420|19|Apple%20Store/1.2.1 \
CFNetwork/485.2 Darwin/10.3.1
2011-01-28 21:59:12.815|2011-01-28 22:00:06.234|10.10.0.188|4016590772|4|CMC
2011-01-28 21:59:12.731|2011-01-28 22:00:25.761|10.10.0.188|3069422788|9|Apple iPhone v8A306 Maps v4.0.1
2011-01-28 21:58:57.019|2011-01-28 21:58:57.019|10.13.0.72|4253820902|2|Microsoft-CryptoAPI/5.131.2600.5512

MERGE_TRUNCATED

Sometimes the values in the DPI fields produced by yaf are truncated. This can occur because yaf collects a limited amount of payload data per flow record (as determined by the --max-payload option). It can also occur when reading a PCAP file if the snaplen argument is not large enough. Finally, yaf is limited by the amount of data it exports as DPI; these limits are set with the per_field_limit and per_record_limit values in the yafDPIRules.conf file.

To accommodate this truncation, super_mediator provides the MERGE_TRUNCATED keyword. When the DEDUP_CONFIG block contains that keyword, super_mediator treats DPI values as equal if they begin with the same substring.

For example, the following is the output for a TEXT EXPORTER that does not have the MERGE_TRUNCATED keyword present:

2011-01-28 21:46:40.534|2011-01-28 21:46:40.534|10.13.0.69|3132439844|1|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.1
2011-01-28 21:47:49.400|2011-01-28 21:47:49.400|10.13.0.69|3038288354|1|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; en-us) Apple
2011-01-28 21:47:49.400|2011-01-28 21:47:49.400|10.13.0.69|3038484962|1|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; en-us) AppleWebKit/53
2011-01-28 21:47:49.401|2011-01-28 21:47:49.401|10.13.0.69|3038419426|1|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; e
2011-01-28 21:47:49.402|2011-01-28 21:47:49.402|10.13.0.69|3038747106|1|Mozilla/5.0 (Macintosh; U;
2011-01-28 21:46:40.519|2011-01-28 21:47:49.402|10.13.0.69|3038616034|26|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4

Using the MERGE_TRUNCATED keyword will collapse all of the above records into:

2011-01-28 21:46:40.519|2011-01-28 21:47:49.402|10.13.0.69|3038616034|31|Mozilla/5.0 (Macintosh; U; \
Intel Mac OS X 10_6_6; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4