Manipulating Data

In Rayon, data is organized into datasets, which are ordered collections of columns. A column is an ordered collection of data of the same type. All columns within a dataset should have the same number of members. Data in datasets can be accessed by column or by row, which is the collection of data members in each column of a dataset at a particular position, in the order of arrangement in the dataset.

Datasets

A Dataset object may be created from data in memory, or from data streamed from disk or the network. A dataset may also be exported to a file or stream.

Creating and using a Dataset object

Dataset objects are created from the Toolbox object. The new_dataset_from_filename and new_dataset_from_stream methods create datasets from external data like files or network streams. The new_dataset_from_columns and new_dataset_from_rows methods create datasets from data in memory. To create an empty Dataset object, pass an empty list into either the new_dataset_from_columns or new_dataset_from_rows method.

After a Dataset object has been instantiated, new columns can be added to it with the append_column method. Therefore, this:

# t is a Toolbox object
# c1, c2, and c3 are Column objects
d = t.new_dataset_from_columns([c1, c2, c3])

is equivalent to this:

d = t.new_dataset_from_columns([])
d.append_column(c1)
d.append_column(c2)
d.append_column(c3)

Headers can be added with the set_header method:

# Add header named "foo" with value "bar"
d.set_header("foo", "bar")

Column names can be added with the set_column_names method:

# d is a dataset object with 3 columns
d.set_column_names(["foo", "bar", "baz"])

Comparing Dataset objects

Dataset objects may be compared for equality. Two Dataset objects are considered equal to each other if and only if:

  • They contain an equal number of columns
  • Their lists of column names are identical (same length, contents and order).
  • They have equal columns in the same order. Column equality is determined by comparing each Column object.

Accessing Data in Dataset Objects

The Dataset object supports row-major and column-major iteration. The default iterator is row-major; for r in d iterates over all the rows in dataset d in the current sort order. To iterate over the rows in native sort order (regardless of the current sort order), use Dataset.iter_ignore_sorted. To get an iterator over all the columns in a Dataset object, use Dataset.iter_columns.

The Dataset object does not support array-based direct access: d[0] will raise an error. To refer to a specific column or row in a dataset, use the Dataset.get_column or Dataset.get_row methods. To refer to a specific field in a dataset, get the appropriate row or column and access the field directly:

# d is a dataset with 10 rows
# and columns foo, bar and baz

# Gets value of column foo in 5th row
d.get_row(4).foo
# ...or...
d.get_column('foo')[4]

get_column and get_row return Column and Row objects. See Columns and Rows for more information on these objects.

Sorting and Filtering Dataset objects

It is often convenient to work with data that is sorted in different ways, or with subsets of data. Sometimes, as with top n lists, we want to do both.

Datasets can be sorted using the Dataset.sort method. This method takes a key function which takes a Row as input, and returns a comparable value which is actually used in the sort. The Dataset.sort method changes the sort order of the Dataset object, which changes the order through which the data is iterated or accessed through the get_row method. It does not change the ordering of data on-disk; the native sort order (or insert order) can always be recovered by calling the Dataset.sort method with no arguments.

The Dataset object’s filtering methods also take a function. The filtering function takes a Row object as input, and returns True if the row passes the filter, False if it fails. (As a shortcut, True is equivalent to lambda r: True and False is equivalent to lambda r: False; these are sometimes used in conjunction with the limit parameter to select a portion of the dataset.) The Dataset.filter_pass and Dataset.filter_fail methods are utility methods to simplify invocations to Dataset.filter; if both the passing and failing sets of data are of interest, they can be captured in a single pass through the data using Dataset.filter_both.

To generate a top n list from a dataset, first sort the data on the desired column, then filter it with a limit parameter of n:

# Get the top n elements of d, sorted by num_hits
d.sort(key=lambda r: r.num_hits)
top_n_by_num_hits = d.filter_pass(True, limit=n)

Exporting Dataset objects

Dataset objects can be exported to disk, streams or strings in the format described in On-disk format. The methods for this are analogous to those described in Creating and using a Dataset object.

>>> d.to_file("bar.txt")
>>> ostrm = open("bar2.txt", 'w')
>>> d.to_stream(ostrm)
>>> print d3.to_string()
1|a
2|b
3|c

On-disk format

This section describes the format Rayon uses for data import and export. This format is fundamentally delimited text, with additional facilities for storing metadata with the dataset. The SiLK tools’ text output (if --no-titles and --no-columns are passed to the tool as options) is a subset of this format, and may be passed to Rayon unchanged.

Rayon’s default delimiter is the pipe character (|). The delimiting character must not appear in the input. An example of valid input is:

foo|1|2
bar|3|4

Whitespace around delimiters will be removed, so the following is equivalent to the above:

foo| 1| 2
bar| 3| 4

Whitespace lines are also ignored, so this is equivalent to the above:

foo|1|2

bar|3|4

The Rayon data format does not support infix or postfix comments. In the following, the text # this is not will be interpreted as content:

# this is a comment
foo | a | a
bar | b | b # this is not

Comments

Lines beginning with an octothorpe (#) are comments, and are ignored (with the exceptions outlined in Special Headers and Column Names). Therefore, this is also equivalent to the above:

foo| 1| 2
# this line will be ignored
bar| 3| 4

Headers

Headers are special comments at the top of the file (or beginning of the stream) that start with exactly two octothorpes (##). Headers are used to store metadata about a dataset, and may be accessed or changed using methods on the Dataset object.

A header contains a name, followed by a colon, followed by a value. There may be whitespace between the colon and either the name or value. Here is an example of a header:

## Description: This is an example dataset
foo| 1| 2
bar| 3| 4

This data set contains a header named Description. The header has the value This is an example dataset.

Header names may contain the upper- and lower-case letters, numbers, underscore (_) and hyphen (-). Header values may contain any of the ASCII character set between 0x20 and 0x7e, inclusive. Certain header names are reserved, specifically those in Special Headers and names beginning with Rayon-; notwithstanding these restrictions, users may create arbitrary headers as they see fit. Header case will be preserved, but header lookups are case-insensitive.

The first line of data will terminate the processing of headers; any subsequent lines beginning with any number of octothorpes will be treated as a comment.

Special Headers

Some headers have special meaning when parsed. For instance, the Delimiter header may be used to change the delimiting character of the file:

## Delimiter: ,
foo,1,2
bar,3,4

The following header names have special meanings:

Title
The title of the dataset
Delimiter
A single character to be used as the delimiting character between items in a row.
Typemap
A set of type names used to convert data in the file from text to a native data type.
Column-Names
A list of column names, delimited by the same character as the data. (See Column Names)

Column Names

Column names may be specified with the Column-Names header:

## Column-Names: label|value1|value2
foo|1|2
bar|3|4

This input will generate a dataset with the column names “label”, “value1” and “value2”, respectively. Column names can be used to access data in the dataset, and add readability to the on-disk representation. See the Dataset documentation for more details.

As a convenience, there is an alternate syntax for specifying column names. The last comment line before the first data row may optionally specify the names of the columns in the dataset. If the last comment line before the first data row is delimited with the delimiting character and contains as many elements as the first data line of the file, its contents will be used as the names of the dataset columns. The following is equivalent to the previous example:

# label| value1| value2
foo|1|2
bar|3|4

As with headers, whitespace between the comment character and the column name designation will be ignored, but multiple comment characters will probably give unwanted results. Thus, the following is legal:

#label| value1| value2
foo|1|2
bar|3|4

The following is also legal; the dataset ignores whitespace surrounding column names:

#label|value1|value2
foo|1|2
bar|3|4

Columns

A Column object represents a single column of data in a dataset. Column objects can be extracted from Dataset objects and recombined in different ways. It is also possible to iterate over the data in a Column and, for numeric data, to compute statistics such as the mean and variance.

Comparing Column objects

Column objects can be compared for equality. Two Column objects are equal if each item in one column compares as equal to the item in the other column at the corresponding index.

Accessing Data in Column objects

Column objects can be treated like arrays:

>>> from rayon.toolbox import Toolbox
>>> t = Toolbox.for_file()
>>> raw = [[1,2,3], ['a', 'b', 'c']]
>>> d = t.new_dataset_from_columns(raw, colnames=['foo', 'bar'])
>>> c = d.get_column('bar')
>>> c[0]
'a'
>>> len(c)
3
>>> list(c)
['a', 'b', 'c']

However, Column objects are immutable, so this won’t work:

>>> c[0] = 'z'
...
TypeError: 'Column' object does not support item assignment

Column Statistics

Column objects provide several statistical functions on the data. With the exception of uniq, these methods require that the Column object contain numeric data.

The following functions are available:

# c is a Column object
c.max()             # Largest value
c.mean()            # Average value
c.min()             # Smallest value in column
c.percentile(25)    # 25th percentile value
c.sample_stdev()    # Sample standard deviation
c.sample_variance() # Sample variance
c.stdev()           # Population standard deviation
c.variance()        # Population variance
c.uniq()            # List of all unique values

In general, if a statistical function is available from the Column object, it is better to use it than to compute it independently because the Column object will cache values and (where necessary) sorted order.

Rows

Rows are data containers representing one row in a dataset. They are returned from the get_row method and passed as an argument to the sorting and filtering functions used in sort, filter_pass and filter_fail methods.

The data in Row objects can be accessed either by index or, if the Dataset object contains column names, by name:

>>> from rayon.toolbox import Toolbox
>>> t = Toolbox.for_file()
>>> raw = [[1,2,3], ['a', 'b', 'c']]
>>> d = t.new_dataset_from_columns(raw, colnames=['foo', 'bar'])
>>> r = d.get_row(2)
>>> len(r)
2
>>> r[0]
3
>>> r[1]
'c'
>>> r['foo']
3
>>> r['bar']
'c'
>>> r.foo
3
>>> r.bar
'c'

Multiple values can be returned by passing a tuple of indices.:

>>> r[(0, 1)]
(3, 'c')
>>> r[('foo', 'bar')]
(3, 'c')
>>> r[(0, 'bar')]
(3, 'c')

It is also possible to iterate over Row objects:

>>> tuple(x for x in r)
(3, 'c')
>>> tuple(r)
(3, 'c')
>>> list(r)
[3, 'c']

Like Column objects, rows are immutable:

>>> r.bar = "baz"
...
TypeError: 'Row' object does not support item assignment

Melding and Flattening

Usually, it is best if each scalar value in a dataset is a member of its own column. Occasionally, however, we want a single column that contains multiple scalar values. For example, if we are displaying a barchart showing how many times each combination of n values was observed, the permutation will need to be a single column containing multiple values.

To get the data in the right place, we must first import the data in the normal way. Say the data on disk represents connections to either TCP or UDP ports in a set of network traces, and looks like this:

# proto|port|network|count
TCP|8080|A|1009
UDP|8080|A|1001388
TCP|25|A|4396
TCP|53|B|230
UDP|25|A|4
...

If we simply plot port against, say count, TCP and UDP ports 8080 will be plotted in the same place, which is probably not what we want. Since each observation is of the form “W connections to port X via protocol Y on network Z), we might very well wish we had a 2-column dataset, where one column was count and the other was a composite of proto, port and network.

To get this we first import the data normally, then extract a special Column object containing the proto, port and network columns as a single unit, using the meld method. We can then put this in a new dataset:

# t is a Toolbox object
d = t.dataset_from_file("our-data.txt")
key = d.meld('proto', 'port', 'network')
count = d.get_column('count')
d2 = t.dataset_from_columns(
    [key, count],
    colnames=['key', 'count'])

The key column is now made up of tuples containing the original Column objects’ data:

>>> key[0]
("TCP", 8080, "A")

These Column objects can be passed into plots that use scales which understand them. For example, they may be used as the categories in an hbar plot.

The flatten method will return a new Dataset object, with all melded columns reverted to multiple columns in the data. For exaple, the following will make d3, which is identical to d:

d3 = d2.flatten(
    new_colnames=['proto', 'port', 'network']
d3.append_column(d2.get_column('count'), 'count')