Data

This chapter describes the objects Rayon provides for importing, exporting and manipulating data. These objects were not designed for general-purpose numeric analysis on large datasets, but rather for moving data into and out of Rayon for visualization with reasonable efficiency.

For more information on datasets (including the syntax of Rayon’s text-based data format), see Manipulating Data in Building Visualizations.

Datasets

toolbox.new_dataset_from_stream(stream[, colnames : list, typemap : dict, delimiter="|"])
toolbox.new_dataset_from_filename(filename[, colnames : list, typemap : dict, delimiter="|"])
toolbox.new_dataset_from_columns(columns[, colnames : list, header : list, sortkey : function])
toolbox.new_dataset_from_rows(rows[, colnames : list, header : list, sortkey : function])

Dataset objects are collections of columnar data, represented by Column objects. One can iterate over the data either row-wise or column-wise; Row objects provide a row-based view of a Dataset object’s data. Dataset objects may be created using the new_dataset_from_filename (for filenames) or new_dataset_from_stream (for file or file-like objects like the StringIO.StringIO object) methods of the Toolbox object.

Dataset objects are mutable; new columns can be added or removed. Since Column objects are immutable, however, rows cannot be modified. It is possible to create new Dataset objects from existing ones with filtered data.

Dataset objects may also be non-destructively sorted by adding sort key functions. The native sort order of the object is the order of the underlying columns.

Metadata about Dataset objects is stored in key-value pairs called headers. The keys and values for headers must both be strings. Headers can be added, removed or modified as desired.

append_column(new_col : Column or iterable[, name : str])

Append new_col to this object. new_col must be either a Column object or an iterable sequence (list, tuple, iterator, etc.).

If this Dataset object already contains columns, new_col must have the same number of elements as the existing columns. new_col is appended to the “right” of the other columns, e.g.:

>>> # t is a Toolbox object
>>> raw = [[1,2,3], ['a', 'b', 'c'], [3,2,1]]
>>> d = t.new_dataset_from_columns(raw)
>>> d.append_column([4,5,6])
>>> d.get_num_columns()
4
>>> [i for i in d.get_column(3)]
[4, 5, 6]

It is an error to append a column to a Dataset object without a column name if the Dataset object already contains column names:

>>> # t is a Toolbox object
>>> raw = [[1,2,3], ['a', 'b', 'c'], [3,2,1]]
>>> colnames = ['foo', 'bar', 'baz']
>>> d = t.new_dataset_from_columns(raw, colnames=colnames)
>>> d.append_column([4,5,6])
RayonDataException: must specify a column name

The column will be inserted in native sort order, regardless of the current sort order:

>>> # t is a Toolbox object
>>> raw = [[1,2,3], ['a', 'b', 'c'], [3,2,1]]
>>> d = t.new_dataset_from_columns(raw)
>>> d.sort(lambda r: r[2])
>>> d.append_column([4,5,6])
>>> d.sort()
>>> [[i for i in r] for r in d]
[[3, 'c', 1, 6], [2, 'b', 2, 5], [1, 'a', 3, 4]]

To append a column to a sorted Dataset object and have it appear in sorted order, first clone the Dataset object using filter, then append the column to the new Dataset:

>>> from rayon.data import *
>>> raw = [[1,2,3], ['a', 'b', 'c'], [3,2,1]]
>>> d = Dataset.from_columns(raw)
>>> d.sort(lambda r: r[2])
>>> d2 = d.filter(True)
>>> d2.append_column(Column[4,5,6])
>>> [[i for i in r] for r in d2]
[[3, 'c', 1, 4], [2, 'b', 2, 5], [1, 'a', 3, 6]]
del_header(name : str)

Remove the header named name from the list of headers. If name does not exist, this method does nothing.

filter_both(filter_fn[, insert_sorted=False, copy_sort_key=True, copy_header=True, copy_colnames=True, limit : int, fail_limit : int]) → 2-tuple of Dataset

Produces a tuple of two new Dataset objects based on this one, with rows passing filter_fn in the first, and rows failing filter_fn in the second. This method is equivalent to calling a, b = (data.filter_pass(fn, ...), data.filter_fail(fn, ...)), but is more efficient.

filter_fn is a function taking a row object and returning True or False; filter_fn may also be one of the values True or False, meaning to pass (True) or fail (False) every row in the dataset.

If insert_sorted is True, the native sort order of the filtered datasets will be the parent dataset’s current sort order. If insert_sorted is False, the native sort order of the filtered datasets will be the parent dataset’s native sort order, regardless of the current sort order.

If copy_sort_key is True, the current sort order from the parent dataset will be copied into the filtered datasets. This does not change the native sort order.

If copy_header is True, the parent dataset’s headers will be copied into the output datasets. Otherwise, the output datasets will have no headers.

If copy_colnames is True, the parent dataset’s column names will be copied into the output datasets. Otherwise, the output datasets will have no column names.

If supplied, limit is the maximum number of passing rows to return from this filter. By default, all rows matching filter_fn will be returned in the first Dataset object.

If supplied, fail_limit is the maximum number of failing rows to return from this filter. By default, all rows failing filter_fn will be returned in the second Dataset object.

filter_fail(filter_fn[, insert_sorted=False, copy_sort_key=True, copy_header=True, copy_colnames=True, limit: int]) → Dataset

Produces a new Dataset based on this one, containing only rows failing filter_fn.

filter_fn is a function taking a row object and returning True or False; filter_fn may also be one of the values True or False, meaning to pass (True) or fail (False) every row in the dataset.

If insert_sorted is True, the native sort order of the filtered dataset will be the parent dataset’s current sort order. If insert_sorted is False, the native sort order of the filtered dataset will be the parent dataset’s native sort order, regardless of the current sort order.

If copy_sort_key is True, the current sort order from the parent dataset will be copied into the filtered dataset. This does not change the native sort order.

If copy_header is True, the parent dataset’s headers will be copied into the output dataset. Otherwise, the output dataset will have no headers.

If copy_colnames is True, the parent dataset’s column names will be copied into the output dataset. Otherwise, the output dataset will have no column names.

If supplied, limit is the maximum number of rows to return from this filter. By default, all rows matching filter_fn will be returned.

filter_pass(filter_fn[, insert_sorted=False, copy_sort_key=True, copy_header=True, copy_colnames=True, limit: int]) → Dataset

Produces a new Dataset based on this one, containing only rows passing filter_fn.

filter_fn is a function taking a row object and returning True or False; filter_fn may also be one of the values True or False, meaning to pass (True) or fail (False) every row in the dataset.

If insert_sorted is True, the native sort order of the filtered dataset will be the parent dataset’s current sort order. If insert_sorted is False, the native sort order of the filtered dataset will be the parent dataset’s native sort order, regardless of the current sort order.

If copy_sort_key is True, the current sort order from the parent dataset will be copied into the filtered dataset. This does not change the native sort order.

If copy_header is True, the parent dataset’s headers will be copied into the output dataset. Otherwise, the output dataset will have no headers.

If copy_colnames is True, the parent dataset’s column names will be copied into the output dataset. Otherwise, the output dataset will have no column names.

If supplied, limit is the maximum number of rows to return from this filter. By default, all rows matching filter_fn will be returned.

flatten([new_colnames : iterable]) → Dataset

Creates a copy of this dataset, converting any composite Column objects into one Column object for each element in the composite. Note that this may change the column indices of subsequent columns.

If new_colnames is supplied, it should be an iterable of strings, which will be used as column names for the new dataset, with the first new column getting the first name from new_colnames, and so on. If the number of column names in new_colnames does not match the number of columns in the flattened dataset, a RayonDataException will be raised. If new_colnames is None (the default), this method will try to use the names of the parent dataset as the names of the output dataset. (This will only be possible if the Dataset object contains no melded columns.) If that isn’t possible, the output dataset will not contain column names.

See also

meld

get_column(colname_or_index : int or str[, ignore_sort=False]) → Column

Given a column index or name, returns the corresponding Column object.

colname_or_index is either the index (an integer) or name (a string) referring to the column.

If ignore_sort is False, return the column in the order of the current sort. Otherwise, the column will be sorted in insert order, even if the dataset is currently sorted by some key.

get_column_index_from_name(name : int or str)

Returns the numeric index corresponding to a column referenced by name. name can be numeric index, in which case it is simply returned if it refers to an index in this dataset. (Consequently, this method can be used to convert a value that may be a name or an index to an index.)

It is an error to call this method on an empty dataset.

get_column_name_from_index(index : int or str)

Returns the name corresponding to a column referenced by idx. idx can be a name, in which case it is simply returned if it refers to a column in this dataset. (Consequently, this method can be used to convert a value that may be a name or an index to a name. Note, however, that it is an error to call this method if the dataset has no column names.)

get_column_names() → list of str

Returns a list of names of all columns in the dataset, in order from lowest index to highest. If no column names are specified, this method returns an empty list.

If synthesize_indices is True, this method will return a list of synthetic indices that can be used to access the columns in the dataset. This guarantees that the return value of this method may be used to access the columns of the dataset.

get_header(name : str) → str

Gets the value of the header named name

get_headers() → iterator

Returns an iterator of 2-tuples over the headers of this dataset. Each 2-tuple is a (name, value) pair representing one header. Order is undefined.

get_num_columns() → int

Returns the number of columns in the dataset

get_num_rows() → int

Returns the number of rows in the dataset.

get_row(index : int[, ignore_sort=False]) → Row

Returns the row in the dataset corresponding to index. Negative numbers index from the end of the dataset, as in Python lists.

If ignore_sort is False, return the row corresponding to index according to the dataset’s current sort order. If ignore_sort is True, return the row corresponding to index according to the dataset’s native sort order.

iter_columns() → iterator

Iterate over the columns in the dataset, in order from the lowest index to the highest.

iter_ignore_sorted() → iterator

Iterate over the rows in the dataset according to the dataset’s native sort order, regardless of the current sort order. (To get an iterator over the rows of the dataset in the current sort order, use Python’s iter function.)

meld([*names_or_indices]) → MultiColumn

Creates a single composite Column object from the multiple columns referenced in names_or_indices. This is useful, for example, in data where several columns together form a unique key.

If exactly one name/index is supplied, this method returns a normal Column object, and is equivalent to calling the get_column method with that name/index.

If names_or_indices is not supplied, a Column object will be created consisting of all columns in the dataset.

partition(key_colname_or_index[, insert_sorted=True, copy_header=True, copy_colnames=True]) → [(keyval, Dataset)]

Creates several new Dataset objects, partitioned along the unique values of one or more of the parent dataset’s columns. The new datasets are returned as a tuple of pairs; the first pair is the common value in the partitioning column; the second is a new dataset containing the matching rows.

The following examples illustrate the use of partition on a dataset d with the following rows:

red   | 1 | 0
red   | 1 | 1
green | 1 | 2
green | 1 | 3
blue  | 1 | 4
blue  | 1 | 5

The following example partitions d on the first column, producing three datasets (one for each color), each containing two rows:

>>> from pprint import pprint
>>> p0 = d.partition(0)
>>> pprint(p0)
(('red', <rayon.data.Dataset object at 0x...>),
 ('green', <rayon.data.Dataset object at 0x...>),
 ('blue', <rayon.data.Dataset object at 0x...>))
>>> print(p0[0][1].to_string())
red|1|0
red|1|1

Partitioning d on the second column produces a single dataset containing all the rows:

>>> p1 = d.partition(1)
>>> pprint(p1)
((1, <rayon.data.Dataset object at 0x10058eb90>),)
>>> print p1[0][1].to_string()
red|1|0
red|1|1
green|1|2
green|1|3
blue|1|4
blue|1|5

As each value in the third column is unique, partitioning d on this column produces produces six datasets of one row each:

>>> p2 = d.partition(2)
>>> pprint(p2)
((0, <rayon.data.Dataset object at 0x...>),
 (1, <rayon.data.Dataset object at 0x...>),
 (2, <rayon.data.Dataset object at 0x...>),
 (3, <rayon.data.Dataset object at 0x...>),
 (4, <rayon.data.Dataset object at 0x...>),
 (5, <rayon.data.Dataset object at 0x...>))
>>> print p2[0][1].to_string()
red|1|0
>>> print p2[1][1].to_string()
red|1|1
>>> print p2[2][1].to_string()
green|1|2

You can partition on multiple columns. To do so, pass in a sequence of column names or indices. In this case, the key will be a tuple of unique combinations of the requested key columns, in the order in which they were passed in:

>>> p3 = d.partition((0, 1))
>>> pprint(p3)
((('red', 1), <rayon.data.Dataset object at 0x...>),
 (('green', 1), <rayon.data.Dataset object at 0x...>),
 (('blue', 1), <rayon.data.Dataset object at 0x...>))
>>> p4 = d.partition((1, 0))
>>> pprint(p4)
(((1, 'red'), <rayon.data.Dataset object at 0x...>),
 ((1, 'green'), <rayon.data.Dataset object at 0x...>),
 ((1, 'blue'), <rayon.data.Dataset object at 0x...>))

The result of partitioning a Dataset object on a zero-length sequence is a single dataset containing all the rows of the original:

>>> d.partition(tuple())
(((), <rayon.data.Dataset object at 0x1006a4690>),)
>>> print d.partition(tuple())[0][1].to_string()
red|1|0
red|1|1
green|1|2
green|1|3
blue|1|4
blue|1|5

key_colname_or_index is the name or index of the column on which to partition, or a tuple of names or indices, as described above.

If insert_sorted is True, the rows will be inserted into the partitioned datasets in the order indicated by the parent dataset’s current sort order. If insert_sorted is False, the rows will be inserted into the partitioned datasets according to the parent dataset’s native sort order, regardless of the current sort order.

If copy_header is True, the parent dataset’s headers will be copied into the output dataset. Otherwise, the output dataset will have no headers.

If copy_colnames is True, the parent dataset’s column names will be copied into the output dataset. Otherwise, the output dataset will have no column names.

This function returns a list of 2-tuples (keyval, dataset), where keyval is one of the unique values in the column referred to by key_colname_or_index, and dataset is a Dataset object containing the rows in the parent dataset that contain that value in key_colname_or_index.

set_column_names(colnames : list of str)

Replaces the Dataset object’s column names (if they exist) with colnames.

colnames is an iterable of strings. The number of elements in colnames should correspond to the number of columns in the dataset.

set_header(name : str, value : str)

Sets the value of the header named name to value

sort([key : function])

Changes the current sort order according to the key. If key is None, revert to native sort order.

key is a function (or callable object) taking a row object and returning a comparable value. (That is, a value for which the <, >, <=, >= and == operators have meaning, that can be compared with others of its type to determine ordering.) key can also be a class whose constructor takes a row object and returns an instance suitable for comparison. (i.e., definining __lt__, __gt__, etc.)

Some examples:

# Sorts on the first 2 columns of d
d.sort(key=lambda r: (r[0], r[1]))

# This is equivalent to the above...
class Comparator(object):
    def __init__(self, r):
        self.to_cmp = (r[0], r[1])
    def __lt__(self, other):
        return self < other
    def __gt__(self, other):
        return self > other
    def __le__(self, other):
        return self <= other
    def __ge__(self, other):
        return self >= other
    def __eq__(self, other):
        return self == other
    def __ne__(self, other):
        return self != other

d.sort(key=Comparator)

# ...as is this
def CallableComparator(object):
    def __call__(self, row):
        return r[0], r[1]

d.sort(key=CallableComparator())
to_file(fname : str[, delimiter="|"])

Writes the contents of this dataset to a file. The file will be created if it does not exist, and will be overwritten if it does exist. The format of the output is described in On-disk format.

Generally, this method supports writing any opaque datatype that can be represented as a string, provided that the string representation does not contain the character used as the field delimiter.

At this time, composite columns (such as those returned by the meld method) are not supported; to write datasets with melded columns to disk, first flatten them using the flatten method.

This method is equivalent to calling:

# d is the dataset
return d.to_stream(open(fname, 'w'))

fname is the name of the file to write.

delimiter should be a single character that will be used in the stream to delimit fields in a record.

to_stream(stream : fobj[, delimiter="|"])

Writes the contents of a dataset to a stream. The format of the output is described in On-disk format.

Generally, this method supports writing any opaque datatype that can be represented as a string, provided that the string representation does not contain the character used as the field delimiter.

At this time, composite columns (such as those returned by the meld method) are not supported; to write datasets with melded columns to disk, first flatten them using the flatten method.

stream is a file-like object that supports writing.

delimiter should be a single character that will be used in the stream to delimit fields in a record.

to_string([delimiter="|"]) → str

Writes the contents of this dataset to a string, then returns the string.

Generally, this method supports writing any opaque datatype that can be represented as a string, provided that the string representation does not contain the character used as the field delimiter.

At this time, composite columns (such as those returned by the meld method) are not supported; to write datasets with melded columns to disk, first flatten them using the flatten method.

This method is equivalent to calling:

# d is the dataset
sio = StringIO()            
d.to_stream(sio)
return sio

delimiter should be a single character that will be used in the stream to delimit fields in a record.

Columns

toolbox.new_column_from_data(raw_data : iterable)
toolbox.new_column_from_constant(constant, length : int)

Column objects are immutable collections of columnar data. Column objects are used by the Dataset object to store its data and are returned from some methods of the Dataset object. Column objects may also be created using the new_column_from_data and new_column_from_constant methods of the Toolbox object.

For purposes of item access, membership testing and iteration, Column objects can be treated as Python tuples. Column objects may also be added to each other if their data types match. If the data in the Column object is numeric, several methods can provide statistical information.

max() → number

Returns the largest element in the column data.

If the data in this column is not numeric, this method will raise an error.

mean() → number

Returns the mean of the data in the column.

If the data in this column is not numeric, this method will raise an error.

min() → number

Returns the smallest element in the column data.

If the data in this column is not numeric, this method will raise an error.

percentile(p : int) → number

Returns the value of the p th percentile element in the column data. p should be an integer between 1 and 100.

If the data in this column is not numeric, this method will raise an error.

sample_stdev() → number

Returns the sample standard deviation of the column data. (The data is assumed to represent a sample of a larger population.)

If the data in this column is not numeric, this method will raise an error.

sample_variance() → number

Returns the sample variance of the data. (The data is assumed to represent a sample of a larger population.)

If the data in this column is not numeric, this method will raise an error.

stdev() → number

Returns the population standard deviation of the column data. (The data is assumed to represent the total population.)

If the data in this column is not numeric, this method will raise an error.

uniq() → list

Returns a list of all unique values in column. If return_sorted is True, return the items in sorted order. Otherwise, the are returned in the order in which they appear in the column.

variance() → number

Returns the population variance of the data in the column data. (The data is assumed to represent the total population.)

If the data in this column is not numeric, this method will raise an error.

Rows

Row objects provide a row-major “view” on a Dataset object’s data. They provide access to data by index and (when a dataset has column names) by name, e.g.:

# d is a dataset with three columns,
# named "foo", "bar", and "baz".
# the first row of d is ['a', 1, 'y']
>>> r.get_row(0)
>>> r[0]
'a'
>>> r.foo
'a'