CERT
Software Assurance Secure Systems Organizational Security Coordinating Response Training
Child pages
  • Obfuscation of IP addresses using rwtuc
Skip to end of metadata
Go to start of metadata

rwtuc is VERY useful for obfuscating data to protect privacy. What could be useful is to translate addresses into an unused domain. There are three different CIDR/8 blocks that are easy to use:

   0.0.0.0/8 - IANA reserved test addresses
   10.0.0.0/8 - Private network addresses
   127.0.0.0/8 - Loopback addresses

The first two sometimes occur in network traffic (when private traffic is routed), but the last one will not be produced by the protocol stack on any of the common operating systems. It still sometimes occurs as a source address on the Internet, but this is crafted traffic.

There are three different ways to use these addresses. Subnet-preserving substitution translates subnets (either at the /16 or /24 level) into an obfuscated zone, but leaves the host information unchanged to allow structural analysis. Subnet-obfuscating substitution uses an arbitrary but fixed substitution for each host. This allows tracking consistent behavior on the host level, (including matching of incoming and outgoing flows), but makes it difficult to track network structure (including tracking of dynamically-allocated hosts). Host-random substitution uses an arbitrary and varying substitution for each occurrence of a host. This offers the most privacy protection, but it also blocks tracking consistent behavior on either the host or network-structure level.

Even though the data is obfuscated, anonymity cannot be fully guaranteed. If your recipient knows where the data originates, and something about that network (such as the addresses of common servers on that network), they can leverage that information to reduce or eliminate address obfuscation at the subnet-preserving or subnet-obfuscating levels. There are other methods (such as comparing traffic in the released data against traffic the recipients capture on their network) that may reduce the address obfuscation.

For example, this tip will use three different networks as those to be protected, containing a total of 10 hosts:

   0.1.2.0/24 -- production network
        0.1.2.1 -- production router
        0.1.2.5 -- production server
        0.1.2.7 -- production supervisory workstation

   0.1.3.0/24 -- office network
        0.1.3.1 -- office router
        0.1.3.4 -- secretarial workstation
        0.1.3.9 -- accounting database server

   0.1.4.0/24 -- border network
        0.1.4.1 -- border router
        0.1.4.5 -- email server
        0.1.4.7 -- dns server
        0.1.4.240 -- gateway to internal network

For subnet-preserving substitution, construct a simple sed script (see the Unix manual on sed(1) for more information). This example assumes the script is called "priv.sed", and contains:

priv.sed
   s/0\.1\.2\./127.0.1/g
   s/0\.1\.3\./127.0.2/g
   s/0\.1\.4\./127.0.3/g

These commands simply substitute the network portion of the address at the /24 level into an obfuscated zone. Now we can use this sed script with rwtuc to change flow information:

  rwcut --fields=1-11,13-29 myflows.raw |
     sed -f priv.sed | rwtuc --sensor=1 >obflows.raw

This obfuscates both the IP address fields at the subnet level and the sensor field.

For subnet-obfuscating substitution, construct a similar sed script that substitutes IP addresses, rather than just the network portion. This example assumes the script is called "priv2.sed" and contains the host addresses of interest and arbitrarily chosen substitutes:

priv2.sed
   s/0\.1\.2\.1/127.0.1.3/g
   s/0\.1\.2\.5/127.0.5.2/g
   s/0\.1\.2\.7/127.0.3.1/g
   s/0\.1\.3\.1/127.0.1.5/g
   s/0\.1\.3\.4/127.0.5.5/g
   s/0\.1\.3\.9/127.0.7.2/g
   s/0\.1\.4\.1/127.0.4.3/g
   s/0\.1\.4\.5/127.0.2.5/g
   s/0\.1\.4\.7/127.0.3.7/g
   s/0\.1\.4\.240/127.0.2.1/g

Again, we can use this sed script with rwtuc to change flow information:

  rwtuc --fields=1-11,13-29 myflows.raw \
     | sed -f priv2.sed | rwtuc --sensor=1 >ob2flows.raw

For host-random substitution, sed is not a good solution. A fairly simple python script can implement this substitution. Let's assume that this script is called "hostsub.py" and contains content such as:

hostsub.py
   #!/usr/bin/python

   import sys
   import random
   import re

   r = random.Random(None)
   addr=re.compile("\d+\.\d+\.\d+\.\d+")

   def makeaddr(iaddr):
       fourth = iaddr % 256
       third = int((iaddr % 65536)/256)
       second = int((iaddr % 16777216)/65536)
       return '127.'+str(second)+'.'+str(third)+'.'+str(fourth)

   def ipaddr(line):
       myline = line
       pos = 0
       while pos < len(myline):
           while addr.match(myline,pos) == None and pos<len(myline):
               pos = pos + 1
           if pos < len(myline):
               myline = myline[0:pos]+addr.sub(makeaddr(r.randint(0,16777216)),myline[pos:])
               m = addr.search(myline,pos)
               if m == None:
                   break
               else:
                   pos = m.end()+1
       return myline

   for line in sys.stdin:
       line=line[:-1]
       print ipaddr(line)

We can use this python script to obfuscate addresses:

  rwtuc --fields=1-11,13-29 myflows.raw \
    | ./hostsub.py | rwtuc --sensor=1 > ob3flows.raw

Similar methods (either fixed substitution or random substitution) can be used to obfuscate ports and protocols if needed. To obfuscate dates, one can preserve interval relationships by mapping the earliest date to a known date (Jan 1, 1970 is popular) and determining further dates by interval since the earliest date, or again use a random substitution. Obfuscation of volume information (number of packets, number of bytes, or duration of flow) is rarely needed, but again either a fixed substitution or random substitution may be applied if required.

The amount of obfuscation applied directly limits the utility of the data in analysis, so use care to minimize the obfuscation.

  • No labels

1 Comment

  1. Sed works well if dealing with small examples (say, for publication). For other uses that require anonymization (say, for course usage), sed is just plain too slow, as it is O(n^2). Use a python script with a dictionary (associative array), which is SIGNIFICANTLY faster.  I'll be amending this tool tip accordingly when I get the time.

    In addition, there are several other relevant anonymization topics that should be included in this tip:

     - Fixed-value replacement: (e.g., all sIP become 10.0.0.1, all dIP become 192.168.1.5)
     - Flow injection (adding manufactured records)

    - Flow deletion (removing records to break up flow patterns)

    - IP shifting (moving reported source/dest of the records to make it more difficult to obfuscate)

    - Combined strategies