netsa.script.golem — Golem Script Automation

Overview

The Golem Script Automation framework is a specialized extension of the NetSA Scripting Framework for constructing automated analysis scripts. Such scripts might be launched periodically from a cron job in order to produce regular sets of results in a data repository. In addition to the golem-specific extensions, netsa.script.golem offers the same functionality as netsa.script.

At its heart, Golem is a template engine combined with some synchronization logic across analytic time bins. The templates define the output paths for result data and the command line sequences and pipelines used to generate these results. Template variables are known as tags. Golem provides certain tags by default and others can be added or modified via the golem configuration functions.

Golem Automation is designed to allow developers to build scripts that easily consume the output results of other golem scripts without the need for detailed knowledge of the implementation details of the external script (e.g. how often it runs or what pathnames it uses for populating its data repository). When provided the path to the external script in its configuration, a golem script will interrogate the external script for information regarding its output results, automatically synchronize across processing windows, and make the result paths available for use in command templates as input paths.

In addition to analysis automation, Golem scripts also offer a query mode so that results in the data repository can be easily examined or pulled to local directories.

Golem offers a number of shortcuts and convenience functions specific to the SiLK Flow Analysis suite, but is not limited to using SiLK for analysis.

See the examples at the end of this chapter to learn how to write a golem script. See the description of template and tag usage for details on tags provided for use within templates. See the API reference for more thorough documentation of the features and interface. See the CLI reference for the standard command line parameters that golem enables.

Command Line Usage

Golem-enabled scripts offer standard command line parameters grouped into three categories: basic, repository-related, and query-related. Parameters must be enabled by the script author. Parameters can be enabled individually or by category. The add_golem_params function will enable all parameters.

Golem scripts also have the standard netsa.script command line options --help, --help-expert, and --verbose.

Basic Parameters

The following golem command line parameters control the filtering of processing windows, as well as some input and output behavior. These options affect both repository and query operations.

--last-date <date>

Date specifying the last time bin of interest. The provided date will be rounded down to the first date of the processing interval in which it resides. Default: most recent

--first-date <date>

Date specifying the first time bin of interest. The provided date will be rounded down to the first date of the processing interval in which it resides. Default: value of --last-date

--intervals <count>

Process or query the last count intervals (as defined by set_interval within the script header) of data from the current date. This will override any values provided by --first-date or --last-date.

--skip-incomplete

Skip processing intervals that have incomplete input requirements (i.e. ignore the date if any source dependencies have incomplete results).

--overwrite

Overwrite output results if they already exist.

--<loop-select>

Each template loop that has been defined in the golem script (via the add_loop function) has an associated parameter that allows a comma-separated selection of a subset of that loop’s values. For example, if a sensor loop was defined under the tag sensor, the parameter would be --sensor by default. If grouping was enabled it would be --sensor-group instead.

Repository Parameters

Repository parameters control the core data processing and generation of data in the result repository, typically run via a cron job.

--data-process

This is the main parameter that makes the script do its work. It generates and stores incomplete or missing analysis results in the data repository, skipping those that are complete (unless --overwrite is given).

--data-status

Show the status of repository processing bins, for the provided options. No processing is performed.

--data-queue

List the dates of all pending repository results for the provided options. No processing is performed.

--data-complete

List the dates of all completed repository results, for the provided options. No processing is performed.

--data-inputs

Show the status of input dependencies from other golem scripts or defined templates, for the provided options. Use -v or -vv for less abbreviated paths. No processing is performed.

--data-outputs

Show the status of repository output results for the provided options. Use -v or -vv for less abbreviated paths. No processing is performed.

Query Parameters

Query parameters control how local copies of results are stored, both when copying from the repository or when performing a fresh analysis.

--output-path <path>

Generate a single query result in the specified output file for the given parameters. Also accepts ‘-‘ and ‘stdout’.

--output-select <arg1[,arg2[,...]]>

For golem scripts that have more than one output defined, limit output results to the comma-separated names provided.

--output-dir <path>

Copy query result files in the specified output directory for the given parameters. The files will follow the same naming scheme as specified for the repository. These names can be previewed via the --show-outputs option.

--show-inputs

Show the status of input dependencies from other golem scripts or defined templates for the provided options. Use -v or -vv for less abbreviated paths. No processing is performed.

--show-outputs

Show relative output paths that would be generated within --output-dir, given the options provided. Use -v or -vv for less abbreviated paths. No processing is performed.

Metadata Functions

Golem scripts are an extension of netsa.script. As such, golem scripts offer the same functions as netsa.script. Script authors are encouraged to use the following metadata functions:

netsa.script.golem.set_title(script_title : str)

Set the title for this script. This should be the human-readable name of the script, and denote its purpose.

netsa.script.golem.set_description(script_description : str)

Set the description for this script. This should be a longer human-readable description of the script’s purpose, including simple details of its behavior and required inputs.

netsa.script.golem.set_version(script_version : str)

Set the version number of this script. This can take any form, but the standard major . minor (. patch ) format is recommended.

netsa.script.golem.set_package_name(script_package_name : str)

Set the package name for this script. This should be the human-readable name of a collection of scripts.

netsa.script.golem.set_contact(script_contact : str)

Set the point of contact email for support of this script, which must be a single string. The form should be suitable for treatment as an email address. The recommended form is a string containing:

Full Name <full.name@contact.email.org>
netsa.script.golem.set_authors(script_authors : str list)

Set the list of authors for this script, which must be a list of strings. It is recommended that each author be listed in the form described for set_contact.

netsa.script.golem.add_author(script_author : str)

Add another author to the list of authors for this script, which must be a single string. See set_authors for notes on the content of this string.

Please see netsa.script for additional functions, such as those used to add custom command line parameters.

Configuration Functions

Golem scripts are configured by calling functions from within a particular imported module. You can either import netsa.script.golem directly:

from netsa.script import golem

Or import as netsa.script.golem.script:

from netsa.script.golem import script

In either case, the imported module will provide identical functionality. All functions available within netsa.script are also available from within golem, with the addition of the golem-specific functions and classes. Please consult the netsa.script documentation for details on functions offered by that module.

The following functions configure the behavior of golem scripts.

netsa.script.golem.set_default_home(path : str[, path : str, ...])

Sets the default base path for this golem script in cases where the GOLEM_HOME environment variable is not set. Multiple arguments will be joined together as with os.path.join. If the provided path is relative, it is assumed to be relative to the directory in which script resides.

The actual home path will be decided by the first available source in the following order:

  1. The GOLEM_HOME environment variable
  2. The default home if set by this function
  3. The directory in which the script resides

Subsequent path settings (e.g. set_repository) will be relative to the script home if they are not absolute paths.

netsa.script.golem.set_repository(path : str[, path : str, ...])

Sets the default path for the output results data repository. Multiple arguments will be joined together as with os.path.join. Output results will be stored in this directory or in a subdirectory beneath, depending on how each output template is specified. Relative paths are considered to be relative to the home path.

netsa.script.golem.set_suite_name(name : str)

Set the short suite name for this script, if it belongs to a suite of related scripts. This should be a simple label suitable for use as a component in paths or filenames. Defaults to None.

netsa.script.golem.set_name(name : str)

Set the short name for this script. The name should be a simple label suitable for use as a component in paths or filenames. This defaults to the basename of the script file itself, minus the ‘.py’ extension, if present.

netsa.script.golem.set_interval([days : int, minutes : int, hours : int, weeks : int])

Set how often this golem script is expected to generate results.

The interval roughly corresponds to how often the script should be run (such as from a cron job). Golem scripts will only process data for incomplete intervals over a provided date range, unless told otherwise via the --overwrite option.

netsa.script.golem.set_span([days : int, minutes : int, hours : int, weeks : int])

Set the span over which this golem script will expect input data for each processing interval. Defaults to the processing interval.

The span will manifest as how much data is being pulled from a SiLK repository or possibly how many outputs from another golem script are being consumed. For example, a script having a 4 week span might run once a week, pulling 4 weeks worth of data each time it runs.

See Intervals and Spans Explained for more details.

netsa.script.golem.set_skew([days : int, minutes : int, hours : int, weeks : int])

Shift the epoch from which all time bins and spans are anchored. For example, given a golem script with an interval of one week, skew can control which day of the week processing occurs.

netsa.script.golem.set_lag([days : int, minutes : int, hours : int, weeks : int])

Set the lag for this golem script relative to the current date and time. Defaults to 0.

As an example, 3 hours is a typical value for data to finish accumulating in a given hour within a SiLK repository. Setting lag to 3 hours effectively shifts the script’s concept of the current time that far into the past.

netsa.script.golem.set_realtime(enable=True)

Set whether or not this golem script will report output results in real time or not. Defaults to False.

Normally a golem script will wait until a processing interval has completely passed before performing any processing or reporting any output results. With set_realtime enabled, the golem script will consider the current processing bin to be the most recent, even if it extends to a future date.

Enabling realtime has a side effect of setting lag to zero.

netsa.script.golem.set_tty_safe(enable=True)

Set whether or not query results are safe to send to the terminal. Defaults to False.

netsa.script.golem.set_passive_mode(enable=False)

Controls whether or not an output option is required before running the main loop of the script. Defaults to False. Normally at least one repository-related or query-related option must be present or the script aborts. This is useful for scripts that require query behavior by default or are maintaining the repository in a custom fashion. If enabled, script authors should explicitely check whether or not repository updates were requested via the command line prior to updating the repository.

netsa.script.golem.add_tag(name : str, value : str or func)

Set a command template tag with key name. The provided value can be callable, in which case it is resolved once the main function is invoked. Tags can reference other tags. See template and tag usage for more information.

netsa.script.golem.add_loop(name : str, value : str list or func[, group_by : str or func, group_name : str, sep=', '])

Add a template tag under key name whose values cycle through those provided by values, either as an iterable or callable. In the latter case, the values are not resolved until the main function is invoked.

Optional keyword arguments:

group_by
Specifies how to group entries from values into a single loop entry. If the provided value is callable, the function should accept a single entry from values and return the group label for that entry or the original string, e.g. ‘LAB2’ might return ‘LAB’. If the value is a dictionary, it will resolve to the mapped value if present. If it is an iterable of prefix matches, they will be converted into a regular expression to be applied to the beginning of each entry. In all cases, if there is no match or result, the original entry becomes its own group label.
group_name
This is the name of the template tag under which the group label appears, defaulting to the value of name appended with ‘_group’. In the above example, if name is ‘sensor’, then %(sensor)s might resolve to ‘LAB1,LAB2,LAB3’ whereas %(sensor_group)s would merely resolve to ‘LAB’.
sep
Depending on how these loops are visited, the template value under name might contain multiple values from values. Under these circumstances, the resulting string is joined by the value of ‘sep’ (default: ‘,’)

Note that adding a loop will automatically add an additional query command line parameter named after the name for limiting which values to process within the loop. If grouping was requested, an additional parameter named after the group_name is also added. If these parameters are not desired or need to be modified, use the modify_golem_param function.

See template usage for more information on templates and loop values.

netsa.script.golem.add_sensor_loop([name='sensor' : str, sensors : str list or func, group_by : str or func, group_name : str, auto_group=False])

This is a convenience function for adding a template loop tag under key name whose values are based on sensors defined in a SiLK repository. Special note is taken that these loop values represent SiLK sensors, so any netsa.script.Flow_params tags that are defined (see add_flow_tag) will have their sensors parameter automatically bound (if not explicitly bound to something else) to the last sensor loop defined by this function.

Optional keyword arguments:

name
The tag name within the template. Defaults to ‘sensor’, accessible from within templates as the tag %(sensor)s
sensors
The source of sensor names, specified either as an iterable or callable. Defaults to the get_sensors function, which will interrogate the local SiLK repository once the main function is invoked.
group_by
Same as with add_loop
group_name
Same as with add_loop
auto_group
Causes group_by to be set to the get_sensor_group function, a convenience function that strips numbers, possibly preceded by an underscore, from the end of sensor names. This can provide serviceable sensor grouping for descriptive sensor names (e.g. ‘LAB0’, ‘LAB1’, ‘LAB2’) but will not be of much use if they are generically named (e.g. ‘S0’, ‘S1’, etc).
netsa.script.golem.add_flow_tag(name : str[, flow_class : str, flow_type : str, flowtypes : str, sensors : str, start_date : str, end_date : str, input_pipe : str, xargs : str, filenames : str])

Add a netsa.script.Flow_params object as a template tag under key name. The rest of the keyword arguments correspond to the same parameters accepted by the netsa.script.Flow_params constructor and serve to map these fields to either specific template tags in the golem script or to the literal string if it is not present in the template tags.. If not otherwise specified, the start_date and end_date attributes are bound to the values of %(golem_start_date)s and %(golem_end_date)s, respectively, for each loop iteration. Additionally, if any sensor-specific loops were specified via add_sensor_loop, the sensors parameter defaults to the tag associated with the last defined sensor loop (typically %(sensor)s). The resulting netsa.script.Flow_params object is associated with a tag entry specified by name. The values of tags associated with netsa.script.Flow_params attributes in this way are still accessible under their original tag names.

The following optional keyword arguments are available to map template tag values to their corresponding attributes in the flow params object: flow_class, flow_type, flowtypes, sensors, start_date, end_date, input_pipe, xargs, and filenames.

See template and tag usage for more information on templates.

netsa.script.golem.add_output_template(name : str, template : str[, scope : int, mime_type : str, description : str])

Define an output template tag key name with the provided template. The provided template can use any of the tags available for each processing iteration. Wildcards are acceptable in the template specification. Absolute paths are not allowed since outputs must reside in the data repository.

The following optional keyword arguments are available. Pass any of them a value of None to disable entirely:

scope
Defines how many intervals of this output are required to represent a complete analysis result in cases where the output from a single interval represents a partial result. For example, a golem script might have an interval of 1 day whereas a “complete” set of results is 7 days worth relative to any particular day.
mime_type
The expected MIME Content-Type of the output file, if any.
description
A long-form text description, if any, of the contents of this output file.

See template and tag usage for more information on templates.

Templates are not required to reference all distinguishing tags and can therefore be ‘lossy’ across loop iterations if such a thing is desired.

netsa.script.golem.add_input_template(name : str, template : str[, required=True, mime_type : str, descriptions : str])

Define an input template tag under key name. This is useful for defining inputs not produced by other golem scripts. Wildcards are allowed in the template specification.

The provided template can use the any of the set of tags available on each processing iteration. The resolved string is available to command templates under key name.

Optional keyword arguments:

required
When False will ignore missing inputs once the template is resolved. Defaults to True.
count
Specifies how many intervals (as defined by this script) over which to resolve the input template. For example, if this script has an interval of one week, a count of 4 will resolve to the last 4 days of the week in question.
offset
Specify how far backwards (in intervals) to anchor this input. If a count has been specified, the offset shifts the entire count of intervals.
cover
If True, a count will be calculated such that the template will be resolved over all intervals covered by the span of this script. For example, a script with an interval of one day and span of seven, the input template will be resolved for all seven days in the span.
source_name
Specify what produced this input For informational purposes when listing script inputs.
mime_type
The expected MIME Content-Type of the input.
description
A long-form text description of the expected contents of this input.

See template and tag usage for more information on templates.

netsa.script.golem.add_golem_input(golem_script : str, name : str [, output_name : str, count : int, cover=False, offset : int, span : timedelta, join_on : str or str list, join : dict, required=True)

Specify a tag, under key name, that represents the path (or paths) to an output of an external golem script. Only a single output can be assosciated at at time – if the external script has multiple outputs defined, then additional calls to this function are necessary for each output of interest.

For each output template, efforts are made to synchronize across time intervals and loop tags as appropriate. By default, this is the output of the most recent interval of the other golem script that corresponds to the interval currently under consideration within the local script. Covering multiple intervals of the input is controlled by cover, count, or span. If, for example, you’re pulling inputs with an hourly interval into a script with a daily interval, the cover parameter should probably be used, otherwise only a single hour would be pulled in.

Matching loop tags are automatically joined unless the join parameter is provided, in which case the only joins that happen are the ones provided.

Optional keyword arguments:

output_name
The tag name used for this output in the external script if it differs from the value of name used locally for this input. By default the names are assumed to be identical.
count

Specifies how many intervals (as defined by the external script) of output data are to be used as input. By default, the most recent interval of the other golem script that corresponds to the local interval currently under consideration is provided.

For example, if the other golem script has an interval of one week, a count of 4 will provide the last 4 weeks of output from that script.

offset

Specify how far back (in units of the other script’s interval) to reference for this input. Defaults to the most recent corresponding interval. If a count has been specified, the offset shifts the entire count of intervals.

For example, if the other script’s interval is one week, an offset of -1 will reference the output from the week prior to the most recent week. Negative and positive offsets are equivalent for these purposes, they always reach backwards through time.

cover

If True, a count is calculated that will fully cover the local interval under consideration. This option cannot be used with the count or offset options. Defaults to False.

For example, if the local interval is one week, and the other interval is one day, then this is equivalent to specifying a count of 7 (days).

span
A datetime object that represents a span of time that covers the intervals of interest. Based on the other script’s interval, a count is calculated that will cover the provided span. Cannot be used simultaneously with count, offset, or cover.
join

A dictionary or iterable of tuples that provides an equivalence mapping between template loops defined in the other golem script and locally defined loops. If no join mapping is provided, an attempt is made to join on loops sharing the same name. If the join parameter is provided, no auto-join is performed.

For example, if the other script defines a template loop on the %(my_sensor)s tag, and the local script defines a loop on %(sensor)s, a mapping from ‘my_sensor’ to ‘sensor’ will ensure that for each iteration over the values of %(sensor)s the input tag value is also sensor-specific based on its %(my_sensor)s loop. Without this association, the input tag would resolve to all outputs across all sensors, regardless of which sensor or sensor group was currently under local consideration. Iterations with no valid mapping are ignored, as opposed to when a valid association exists but the other output is missing.

required
If False or 0 and the expected output from the other golem script is missing, continue processing rather than raising an exception. If specified as a positive integer, it means at least that many of the other inputs should exist, otherwise an exception is thrown. By default, at least one input is required. It is up to the developer to handle missing inputs (i.e. empty template tags) appropriately in these cases.
netsa.script.golem.add_input_group(name : str, group : list)

Group the given inputs under the provided tag as a single GolemArgs object. The result can be used the same as the regular inputs.

netsa.script.golem.add_output_group(name : str, group : list)

Group the given outputs under the provided tag as a single GolemArgs object. The result can be used the same as the regular outputs, including as inputs to other golem scripts.

netsa.script.golem.add_query_handler(name : str, query_handler : func)

Define a query handler under tag key name to be processed by the callable query_handler. The output will only be generated dynamically when specifically requested via query-related parameters.

The provided callable will be passed the name of this query and a ‘tags’ dictionary for use with templates. And additonal tag %(golem_query_tgt)s will be provided in the standard dictionary that contains the output path for this query. The function is responsible for creating this output.

netsa.script.golem.add_self_input(name : str, output_name : str[, count : int, offset : int, span : timedelta])

Specify an input template tag, under key name, associated with this script’s own output from prior intervals. Since output tag names will necessarily collide, name is the new tag name for the input and output_name is the name of the output template.

Optional keyword arguments:

count
Same as with add_golem_input
offset
Same as with add_golem_input, except defaults to -1. If self-referencing for purposes of delta-encoding, 0 should probably be specified.
span
Same as with add_golem_input

Assuming a local loop over the template tag %(sensor)s, the following example:

>>> script.add_self_input('result', 'prior_result')

...is equivalent to this:

>>> script.add_golem_input(script.get_script_path(), 'prior_result',
>>>     output_name='result',
>>>     offset=-1,
>>>     join_on=['sensor'],
>>>     required=False)

Note

If this output happens to have a scope the developer should ensure that this self-reference means what is intended for ‘most recent’ output.

For example, if this golem script has an interval of 1 day, but the output has a scope of 28 (days), the example above would capture the prior 28 days of output due to the offset of -1.

If the script is delta-encoding its current result with its own prior results, however, probably what is desired would be the prior 27 days, in which case offset should be specified as 0. The 28th day in this scope scenario is the very result currently being generated, which does not exist yet, and therefore will not appear in the template as a ‘prior’ result. After processing is complete, however, it will appear in the collected outputs if a different golem script references this interval as input.

netsa.script.golem.add_golem_source(path : str)

Adds a directory within which to search for other golem scripts. Directories are searched in the following order: 1) paths within the colon-separated list of directories in the GOLEM_SOURCES environment variable; 2) any paths added by this function (multiple invocations allowed); 3) this script’s own directory as reported by the get_script_dir function, and finally 4) this script’s home directory as reported by the get_home function, if different than the script directory. These directories are only searched if the external script is specified as a relative path.

netsa.script.golem.get_script_path() → str

Returns the normalized absolute path to this script.

netsa.script.golem.get_script_dir() → str

Returns the normalized absolute path to the directory in which this script resides.

netsa.script.golem.get_home() → str

Returns the current value of this script’s home path if it has been set. Otherwise, defaults to the contents of the GOLEM_HOME environment variable (if set), or finally, the directory in which this script resides.

netsa.script.golem.get_repository() → str

Returns the current path for this script’s data output repository, or None if not set.

Parameter Functions

The following functions are used to enable and modify the standard command-line parameters available for golem scripts. No command line parameters are enabled by default, so at least one of these should be invoked in a typical golem script:

netsa.script.golem.add_golem_params([without_params : str list])

Enables all golem command line parameters. Equivalent to individually invoking each of add_golem_basic_params, add_golem_repository_params, and add_golem_query_params. Optionally accepts a list of parameters to exclude.

netsa.script.golem.add_golem_basic_params([without_params : str list])

Enables basic golem command line parameters. Optionally accepts a list of parameters to exclude.

netsa.script.golem.add_golem_query_params([without_params : str list])

Enables query-related golem command line parameters. Optionally accepts a list of parameters to exclude.

netsa.script.golem.add_golem_repository_params([without_params : str list])

Adds repository-related golem command line parameters. Optionally accepts a list of parameters to exclude.

netsa.script.golem.add_golem_param(name : str[, alias : str])

Enables a particular golem command line parameter. Accepts the same optional keyword parameters as the modify_golem_param function with the exception of enable, which is implied. An example is the alias parameter, which which can be provided to change the default parameter string (for example, aliasing --last-date to --date in cases where date ranges are not desired).

netsa.script.golem.modify_golem_param(name : str[, enabled : bool, alias : str, help : str, ...])

Modifies the settings for the given golem script parameter. Accepts new values for the golem-specific keywords enabled and alias, along with the usual netsa.script parameter keywords (e.g. help, required, default, default_help, description, mime_type)

Processing and Status Functions

The following functions are intended for use within the main function during processing or examination of status.

netsa.script.golem.execute(func)

Executes the main function of a golem script. This should be called as the last line of any golem script, with the script’s main function (whatever it might be named) as its only argument.

Warning

It is important that most, if not all, actual work the script does is done within this function. Golem scripts (as with all NetSA Scripting Framework scripts) may be loaded in such a way that they are not executed, but merely queried for metadata instead. If the golem script performs significant work outside of the main function, metadata queries will no longer be efficient. Golem scripts must use this execute function rather than netsa.script.execute.

netsa.script.golem.process([golem_view : GolemView])

Returns a GolemProcess wrapper around the given golem view, which defaults to the main script view. The result behaves much like a GolemTags object. Iterating over it returns a dictionary of resolved template tags while performing system level interactions (such as checking for input existence and creating output directories and/or files) in preparation for whatever processing the developer specifies in the processing loop. Views that have already completed processing are ignored.

Also takes an optional ‘exception_handler’ keyword argument which must be a function that accepts an exception and a golem view object (advanced use only).

netsa.script.golem.loop([golem_view : GolemView])

Returns a GolemTags view of the given golem view, which defaults to the main script view. No system level processing happens while iterating over or interacting with this view.

netsa.script.golem.inputs([golem_view : GolemView])

Returns a GolemInputs view of the given golem view, which defaults to the main script view. No system level processing happens while iterating over or interacting with this view.

netsa.script.golem.outputs([golem_view : GolemView])

Returns a GolemOutputs view of the given golem view, which defaults to the main script view.. No system level processing happens while iterating over or interacting with this view.

netsa.script.golem.is_complete([golem_view : GolemView])

Examines the outputs of the optionally provided GolemView object, which defaults to the main script view, and examines the status of the outputs for each processing interval. If all appear to be complete, returns True, otherwise False.

netsa.script.golem.script_view()

Returns the currently defined global GolemView object.

netsa.script.golem.current_view([golem_view : GolemView])

Returns a version of the given GolemView object, which defaults to the main script view, for the most recent interval available.

Utility Functions

The following functions provide some SiLK-specific tools and other potentially useful features for script authors.

netsa.script.golem.get_sensors() → str list

Retrieves a list of sensors as defined in the local SiLK repository configuration.

netsa.script.golem.get_sensor_group(sensor : str) → str

Convenience function, such as can be passed as a value for the group_by parameter in add_sensor_loop, for extracting a sensor ‘group’ out of a sensor name. Groups are determined by extracting prefixes made of ‘word’ characters excluding ‘_’. For example, two sensors called ‘LAB0’ and ‘LAB1’ would be grouped under ‘LAB’.

netsa.script.golem.get_sensors_by_group([grouper : func, sensors : str list])

Convenience function that uses the callable grouper to construct a tuple of named pairs of the form (group_name, members) suitable for use in constructing a dictionary. Defaults to the get_sensor_group function.

netsa.script.golem.get_args() → GolemArgs

Returns a GolemArgs object containing any non-parameter arguments that were provided to the script on the command line.

Additional Functions

Please see the following sections in the netsa.script documentation for details regarding other functions available within netsa.script.golem:

Usage of Tags, Loops, and Templates

Golem is a templating engine. Based on a script’s configuration, the processing loop will iterate over time bins (processing intervals) and the values of any other loops defined for tags in the script. For each iteration, a dictionary of template tags is produced for use with such utilities as netsa.util.shell. A typical golem script looks something like this:

#!/usr/bin/env python

from netsa.script import golem
from netsa.util import shell

# golem configuration and command line parameter
# configuration up here

def main():

    # All 'real' work should happen in this function.
    # Before invoking the processing loop, perhaps do some
    # configuration and prep work

    for tags in golem.process():

        # Set up per-iteration prep work, such as perhaps
        # some temp files

        # maybe modify contents of tags
        tags['temp_file'] = ...

        # set up command templates

        cmd1 = ...
        cmd2 = ...
        ...

        # run the commands
        shell.run_parallel(cmd1, cmd2, vars=tags)

# pass main function to module for invocation and handling
golem.execute(main)

The following template tags are automatically available to each iteration over a golem processing loop:

Tag Name Contents
golem_name golem script name
golem_suite suite name, if any
golem_span timedelta obj for span
golem_interval timedelta obj for interval
golem_span_iso iso string repr of golem_span
golem_interval_iso iso string repr of golem_interval
golem_repository data directory for this script
   
golem_bin_date start datetime for this interval
golem_bin_year interval datetime component ‘year’
golem_bin_month interval datetime component ‘month’
golem_bin_day interval datetime component ‘day’
golem_bin_hour interval datetime component ‘hour’
golem_bin_second interval datetime component ‘second’
golem_bin_microsecond interval datetime component ‘microsecond’
golem_bin_iso iso string repr for golem_bin_date
golem_bin_basic iso basic string repr for golem_bin_date
golem_bin_silk silk string repr for golem_bin_date
   
golem_start_date start datetime for this span
golem_start_year start datetime component ‘year’
golem_start_month start datetime component ‘month’
golem_start_day start datetime component ‘day’
golem_start_hour start datetime component ‘hour’
golem_start_second start datetime component ‘second’
golem_start_microsecond start datetime component ‘microsecond’
golem_start_iso iso string repr for golem_start_date
golem_start_basic iso basic string repr for golem_start_date
golem_start_silk silk string repr for golem_start_date
   
golem_end_date end datetime for this span
golem_end_year end datetime component ‘year’
golem_end_month end datetime component ‘month’
golem_end_day end datetime component ‘day’
golem_end_hour end datetime component ‘hour’
golem_end_second end datetime component ‘second’
golem_end_microsecond end datetime component ‘microsecond’
golem_end_iso iso string repr for golem_end_date
golem_end_basic iso basic string repr for golem_end_date
golem_end_silk silk string repr for golem_end_date
   
golem_next_bin_date next bin datetime for this span
golem_next_bin_year next bin datetime component ‘year’
golem_next_bin_month next bin datetime component ‘month’
golem_next_bin_day next bin datetime component ‘day’
golem_next_bin_hour next bin datetime component ‘hour’
golem_next_bin_second next bin datetime component ‘second’
golem_next_bin_microsecond next bin datetime component ‘microsecond’
golem_next_bin_iso iso string repr for golem_next_bin_date
golem_next_bin_basic iso basic string repr for golem_next_bin_date
golem_next_bin_silk silk string repr for golem_next_bin_date
   
golem_view the GolemView object which produced these tags
golem_inputs a collected dictionary of all inputs that are defined in the tags
golem_outputs a collected dictionary of all outputs that are defined in the tags

For intervals, time bins are bounded by golem_bin_date and golem_end_date. Spans, on the other hand, are bounded by golem_start_date and golem_end_date. For details on how intervals and spans relate to one another and format precisions, see Intervals and Spans Explained.

In addition to the standard golem tags defined above, all other tags, loops, inputs, and outputs defined in the initial script configuration are available for use in templates. For example, assume a script makes the following declarations:

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')
script.add_tag('month', "%(golem_month)02d/%(golem_year)d")
script.add_sensor_loop()
script.add_flow_tag('in_flow',  flow_type='in_types')
script.add_flow_tag('out_flow', flow_type='out_types')
script.add_output_template('juicy_set', ext='.set',
    description="Target 'juicy' set to generate.",
    mime_type='application/x-silk-ipset')

During each iteration of the processing loop, the tags dictionary will now include the following additional template entries:

in_types
"in,inweb"
out_types
"out,outweb"
month
"%(golem_month)02d/%(golem_year)d" \
    % (tags['golem_month'], tags['golem_year'])

sensor

the current iteration value of script.get_sensors()
in_flow
Flow_params(
    start_date = tags['golem_start_date'],
    end_date   = tags['golem_end_date'],
    sensors    = tags['sensor'],
    flow_type  = tags['in_types'])
out_flow
Flow_params(
    start_date = tags['golem_start_date'],
    end_date   = tags['golem_end_date'],
    sensors    = tags['sensor'],
    flow_type  = tags['out_types'])
juicy_set
"%(golem_name)s/%(sensor)s/"                             \
"%(golem_name)s.%(sensor)s.%(golem_start_date_iso)s.set" \
    % (tags['golem_name'], tags['sensor'],               \
       tags['golem_start_date_iso'])

Sometimes, depending on how loops and dependencies are arranged and how views are being manipulated, an input or output tag for the current view might contain multiple values (e.g. multiple filenames). In these cases, the resolved values are bundled into a GolemArgs object, which in turn resolves as a string of paths separated by spaces in the final command template.

Intervals and Spans Explained

Intervals and spans represent two different concepts.

An interval is a processing interval which represents how frequently the script is intended to produce results. This is roughly analogous to how frequently the script might be invoked via a cron job, except that golem scripts will back-fill missing results upon request and ignore intervals that appear to already have results present. An interval is always represented by the first timestamp contained within the interval.

A span, on the other hand, is a data window which represents how much input data the script is expected to consume, whether it be from a SiLK repository, results from other golem scripts, or other sources.

By default, the span of a golem script is the same as its interval. Not much surprising happens when the values are equal. They can be different, however. For example, a script might have a weekly interval yet consume 4 weeks worth of data for each of those weeks. Alternatively, a script might run every 4 weeks yet consume only a day’s worth of data, akin to a monthly snapshot.

Intervals are anchored relative to a particular epoch. Intervals are always relative to midnight of the first Monday after January 1st, 1970 which was January 5th. Weeks therefore begin with Monday and multiples of weeks are always relative to that particular Monday. If another day of the week is desired, use the set_skew configuration function.

Spans are always anchored relative to the end of the processing interval.

In the tags dictionary provided for each processing loop, the interval is represented by the golem_bin_date entry and the span is represented by golem_start_date and golem_end_date. Given a 3 week interval and a 4 week span, for example, these values are aligned like so:

interval:             bin_date             next_bin_date
                         |                       |
                 |-------|-------|-------|-------|
                 |                               |
span:       start_date                        end_date

Note that end_date is not inclusive—its actual value is the value of next_bin_date minus one millisecond. next_bin_date, on the other hand, can be handy if you want to represent your results files by the end of the processing interval (“as of” next_bin_date as opposed to “begining with” bin_date).

Each of these entries are represented by datetime.datetime objects along with an assortment of formatted string representations. If both the interval and span have a magnitude of at least a day or more, the formatted string variations look like so:

Variation Format
iso YYYY-MM-DD
basic YYYYMMDD
silk YYYY/MM/DD

If either the interval or span is less than a day, hours are included:

Variation Format
iso YYYY-MM-DDTHH
basic YYYYMMDDTHH
silk YYYY/MM/DDTHH

In all of the examples covered in this documentation, result templates are all based on the golem_bin_date values, i.e. the processing interval. As the example diagram above illustrates, this may not be intuitive as to what data is represented in the results. It is up to script authors to decide how to name their results, but they should choose a convention, stay consistent with it, and document the decision.

Examples

Trivial Example

The following is a simple example using the Golem API that demonstrates basic templating. The script monitors an “incoming” directory for daily text files containing IP addresses and converts them into rwset files. A line by line explanation follows after the script:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell

script.set_title('Daily IP Sets')
script.set_description("""
    Convert daily IP lists into rwset files.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('daily_set')
script.set_interval(days=1)
script.set_span(days=1)

script.add_golem_params()

script.set_repository('dat')

script.add_input_template('daily_txt',
    "/data/incoming/daily.%(golem_bin_iso)s.txt",
    description="Daily IP text files",
    mime_type='text/plain')

script.add_output_template('daily_set',
    "daily/daily.%(golem_bin_iso)s.set",
    description="Daily IP Sets",
    mime_type='application/x-silk-ipset')

def main():
    cmd = "rwsetbuild %(daily_txt)s %(daily_set)s"
    for tags in script.process():
        shell.run_parallel(cmd, vars=tags)

script.execute(main)

Here is the breakdown, line by line:

from netsa.script.golem import script
from netsa.util import shell

The first two lines import golem itself as well as netsa.util.shell, which assists in constructing command line templates and executing the resulting system commands and pipelines.

Next are some lines for configuring meta information about the script:

script.set_title('Daily IP Sets')
script.set_description("""
    Convert daily IP lists into rwset files.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

Setting the title, description, and other meta-data of the script is the same as with the regular netsa.script metadata functions.

Now for the golem-specific configuration:

script.set_name('daily_set')

Though optional, every golem script should have a short name, suitable for inclusion withing directory paths and filenames. It will be made available for use in templates as the %(golem_name)s tag. For groups of related scripts, the set_suite_name function is also available.

Next, the script must be told the size of its processing intervals and the size of its data window:

script.set_interval(days=1)
script.set_span(days=1)

These two parameters, interval and span, are the core configuration parameters for any golem script. The interval represents ‘how often’ this script is expected to generate results. Typically this would correspond to the schedule by which the script is invoked via a cron job. The span represents how far back the script will look for input data. The interval and span do not have to match as they do here—for example, a script might have a ‘daily’ interval which processes one week of data for each of those days.

Next, the script author will almost always want to enable the standard golem command line parameters:

script.add_golem_params()

There are three general categories of parameters (basic, repository-related, and query-related) which can be separately enabled; the line above enables all of these.

Next, the script can be told where its results will live:

script.set_repository('dat')

This line defines the location of the scripts output data repository. Note that some scripts can be designed for query purposes only and will therefore not need to define a repository location.

If the path provided is a relative path, it is assumed to be relative to the script’s home path. See the set_default_home function for details on how the home path is configured or determined.

The script then defines a template for its input data:

script.add_input_template('daily_txt',
    "/data/incoming/daily.%(golem_bin_iso)s.txt",
    description="Daily IP text files",
    mime_type='text/plain')

This template assumes that the incoming files will correspond to the standard ISO-formatted datetime (‘YYYY-MM-DD’).

Next, an output template is defined:

script.add_output_template('daily_set',
    "daily/daily.%(golem_bin_iso)s.set",
    description="Daily IP Sets",
    mime_type='application/x-silk-ipset')

For each day of processing, a single rwset file will be generated. Once again, the standard ISO-formatted date is chosen for the template.

In both the input and output templates, the script uses the tag %(golem_bin_iso)s. This tag is an implicit template tag, automatically available for use within golem scripts. Other timestamps are also available, including portions of each timestamp (such as year, month, and day) for constructing more elaborate templates. For more details about the rest of these ‘implicit’ template tags, see Usage of Tags, Loops, and Templates.

Now for the actual processing loop:

def main():
    cmd = "rwsetbuild %(daily_txt)s %(daily_set)s"
    for tags in script.process():
        shell.run_parallel(cmd, vars=tags)

script.execute(main)

All ‘real work’ in a golem script should take place in a main function, which is subsequently passed to the execute function. In order for golem scripts to work properly, this must always be the case. Using the netsa.script.execute function instead will not work for a golem script.

The main entry point for looping across template values is the process function. This construct does a number of things, including creating output paths and checking for the existence of required inputs. On each iteration, a dictionary of template tags with resolved values is provided.

Every golem script has an intrinsic loop over processing intervals. In this example, our processing interval is once a day. If, via the command line parameters --first-date and --last-date, a window of 1 week had been specified, it would result in seven main iterations with the tag %(golem_bin_iso)s corresponding to a string representation of the beginning timestamp of each daily interval covered in the requested range. Unless told otherwise, a golem script will skip iterations which have already generated results.

Basic Example

The Trivial Example defined an input template that described daily text files without any explanation about where or how those files were produced. The following example assumes that the resulting set of addresses represent “observed internal hosts” and illustrates how the daily set files might be produced directly from queries to a SiLK repository. In order to do so, the script relies on a couple of SiLK command line tools:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell
from netsa import files

script.set_title('SiLK Daily Active Internal Hosts')
script.set_description("""
    Daily inventory of observed internal host activity.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('daily_set')
script.set_interval(days=1)
script.set_span(days=1)

script.add_golem_params()

script.set_repository('dat')

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')

script.add_output_template('internal_set',
    "internal/daily/daily.%(golem_bin_iso)s.set",
    description="Daily Internal host activity",
    mime_type='application/x-silk-ipset')

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()
        cmd1 = [
            "rwfilter --start-date=%(golem_start_silk)s"
                " --end-date=%(golem_end_silk)s"
                " --type=%(in_types)s"
                " --proto=0-255 --pass=stdout",
            "rwset --sip=%(out_fifo)s"]
        cmd2 = [
            "rwfilter --start-date=%(golem_start_silk)s"
                " --end-date=%(golem_end_silk)s"
                " --type=%(out_types)s"
                " --proto=0-255 --pass=stdout",
            "rwset --dip=%(in_fifo)s"]
        cmd3 = [
            "rwsettool --union --output-path=%(internal_set)s"
                " %(out_fifo)s %(in_fifo)s"]
        shell.run_parallel(cmd1, cmd2, cmd3, vars=tags)

script.execute(main)

There are a couple of new techniques to note with this script. Below the standard meta-configuration are the following lines:

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')

These two statements add a couple of simple template tags. All templates will now have access to the tags %(in_types)s and %(out_types)s, which will resolve to the strings 'in,inweb' and 'out,outweb', respectively. This is equivalent to manually adding these entries to the tags dictionary down in the processing loop; predefining them here is a matter of style preference.

Next comes the main processing loop, which illustrates some more advanced usage of the netsa.util.shell module:

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()

As mentioned earlier, a tags dictionary is provided for each processing interval and sensor. In the two lines within the processing loop, some additional tags are added to the dictionary. The lines illustrate how the netsa.files module can be used to create temporary named pipes so that data can be fed from one command to another.

These new template additions are then used in the construction of some command templates used to pull data from the SiLK repository:

cmd1 = [
    "rwfilter --start-date=%(silk_start)s"
        " --end-date=%(silk_end)s"
        " --type=%(in_types)s"
        " --proto=0-255 --pass=stdout",
    "rwset --sip=%(out_fifo)s"]
cmd2 = [
    "rwfilter --start-date=%(silk_start)s"
        " --end-date=%(silk_end)s"
        " --type=%(out_types)s"
        " --proto=0-255 --pass=stdout",
    "rwset --dip=%(in_fifo)s"]
cmd3 = [
    "rwsettool --union --output-path=%(internal_set)s"
        " %(out_fifo)s %(in_fifo)s"]
shell.run_parallel(cmd1, cmd2, cmd3, vars=tags)

The first two command templates utilize the template definitions defined earlier, %(in_types)s and %(out_types)s, along with the date ranges associated with each processing loop. Each of these commands sends its results into its respective named pipe. Finally, the third command uses rwsettool to create a union from the output of these named pipes and creates the rwset file defined by the output template. All three commands are run in parallel using the facilities of the netsa.util.shell module.

Basic Golem Dependency Example

The Basic Example provides a golem script that produces daily rwset files produced from queries to a SiLK repository. What if a weekly, rather than daily, summary of IP addresses is desired? One option would be to adjust the processing interval and span of the script, thereby pulling an entire week’s worth of data from SiLK in the calls to rwfilter. An alternative is to utilize the daily sets from the original script as inputs and construct a weekly summary via the union of the daily sets for the week in question.

One of the core features of golem scripts is that they can be assigned as inputs to one other. Details such as how often the inputs are produced, the naming scheme, and synchronization across time bins is sorted out automatically by the golem scripts involved. Assume that the script in the Basic Example is called daily_set.py. The following example illustrates how to configure the dependency on this external script:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell

script.set_title('Weekly Active Internal Host Set')
script.set_description("""
    Aggregate daily internal activity sets over the last week.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('weekly_set')
script.set_interval(weeks=1)
script.set_span(weeks=1)

script.add_golem_params()

script.set_repository('dat')

script.add_golem_input('daily_set.py', 'daily_set', cover=True)

script.add_output_template('weekly_set',
    "weekly/weekly.%(golem_bin_iso)s.set",
    description="Aggregated weekly sets.",
    mime_type='application/x-silk-ipset')

def main():
    cmd = "rwsettool --union --output-path=%(weekly_set)s %(daily_set)s"
    for tags in script.process():
        shell.run_parallel(cmd, vars=tags)

script.execute(main)

The first thing to note is that this script has a different interval and span:

script.set_interval(weeks=1)
script.set_span(weeks=1)

The script will produce weekly results and will expect to consume a week of data while doing so.

The input template is now defined as a dependency on the external script like so:

script.add_golem_input('daily_set.py', 'daily_set', cover=True)

The first argument is the name of the external script. For details on how relative paths to scripts are resolved, see the add_golem_source function.

The second argument is the output as defined within that external script. Golem scripts can have multiple outputs, so the specific output desired must be explicitly defined.

The third argument, the cover parameter, controls how this external output is synchronized across local processing intervals—in this case the processing interval of 1 week will be ‘covered’ by 7 days worth of inputs.

After the configuration of the weekly output template comes the main processing loop:

def main():
    cmd = "rwsettool --union --output-path=%(weekly_set)s %(daily_set)s"
    for tags in script.process():
        shell.run_parallel(cmd, vars=tags)

script.execute(main)

The output tag %(weekly_set)s is based on the %(golem_bin_iso)s timestamp, which in this case ends up being the date of the first Monday of each week in question. The %(daily_set)s tags represents 7 days of results—this will resolve to 7 individual filenames separated by whitespace in the eventual call to rwsettool.

Loop, Interval and Span Example

Golem scripts can define additional loops addition to the intrinsic loop over processing intervals. The following script is a modification of the script in the Basic Example which builds daily inventories by directly querying the SiLK repository. Rather than construct a monolithic inventory across all sensors, this version will construct inventories on a per-sensor basis by defining a template loop over a list of sensor names. Finally, it will illustrate the difference between intervals and spans by using a less frequent interval and a larger data window:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell
from netsa import files

script.set_title('Active Internal Hosts')
script.set_description("""
    Per-sensor inventory of observed internal host activity over a
    four week window of observation, generated every three weeks.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('internal')
script.set_interval(weeks=3)
script.set_span(weeks=4)

script.add_golem_params()

script.set_repository('dat')

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')

script.add_loop('sensor', ["S0", "S1", "S2", "S3"])

script.add_output_template('internal_set',
    "internal/internal.%(sensor)s.%(golem_bin_iso)s.set",
    description="Internal host activity",
    mime_type='application/x-silk-ipset')

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()
        cmd1 = [
            "rwfilter --start-date=%(golem_start_silk)s"
                " --end-date=%(golem_end_silk)s"
                " --type=%(in_types)s"
                " --sensors=%(sensor)s"
                " --proto=0-255 --pass=stdout",
            "rwset --sip=%(out_fifo)s"]
        cmd2 = [
            "rwfilter --start-date=%(golem_start_silk)s"
                " --end-date=%(golem_end_silk)s"
                " --type=%(out_types)s"
                " --sensors=%(sensor)s"
                " --proto=0-255 --pass=stdout",
            "rwset --dip=%(in_fifo)s"]
        cmd3 = [
            "rwsettool --union --output-path=%(internal_set)s"
                " %(out_fifo)s %(in_fifo)s"]
        shell.run_parallel(cmd1, cmd2, cmd3, vars=tags)

script.execute(main)

The first thing to note is the new interval and span definitions:

script.set_interval(weeks=3)
script.set_span(weeks=4)

The script will produce results every 3 weeks and will expect to consume 4 weeks of data while doing so. This is the first example in which the interval and span are not equal. For more detail on the implications of this see Intervals and Spans Explained.

Further down in the script is the new loop definition:

script.add_loop('sensor', ["S0", "S1", "S2", "S3"])

With the addition of this line, for each 3-week processing interval, the script will return a separate tags dictionary for each sensor, setting the value of the sensor entry accordingly. Logically speaking this is equivalent to having two embedded ‘for’ loops, one for intervals and one for sensors.

This newly defined %(sensor)s tag is then used in the modified definition of the output template:

script.add_output_template('internal_set',
    "internal/internal.%(sensor)s.%(golem_bin_iso)s.set",
    description="Internal host activity",
    mime_type='application/x-silk-ipset')

Next comes the main processing loop. Note that it is identical to the processing loop in the earlier incarnation of the script. The interval and span were changed, an extra loop was introduced, and the output template was modified, but the essential processing logic remains unchanged.

SiLK Integration Example

The Golem API and NetSA Scripting Framework include a number of convenience functions and classes for interacting with a SiLK repository. The Loop, Interval and Span Example can be simplified using a few of these features as illustrated below:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell
from netsa import files

script.set_title('Active Internal Hosts')
script.set_description("""
    Per-sensor inventory of observed internal host activity over a
    four week window of observation, generated every three weeks.
""")
script.set_version("0.2")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('internal')
script.set_interval(weeks=3)
script.set_span(weeks=4)

script.add_golem_params()

script.set_repository('dat')

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')

script.add_sensor_loop()

script.add_flow_tag('in_flow',  flow_type='in_types')
script.add_flow_tag('out_flow', flow_type='out_types')

script.add_output_template('internal_set',
    "internal/internal.%(sensor)s.%(golem_bin_iso)s.set",
    description="Internal host activity",
    mime_type='application/x-silk-ipset')

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()
        cmd1 = [
            "rwfilter %(out_flow)s --proto=0-255 --pass=stdout",
            "rwset --sip=%(out_fifo)s"]
        cmd2 = [
            "rwfilter %(in_flow)s --proto=0-255 --pass=stdout",
            "rwset --dip=%(in_fifo)s"]
        cmd3 = [
            "rwsettool --union --output-path=%(internal_set)s"
                " %(out_fifo)s %(in_fifo)s"]
        shell.run_parallel(cmd1, cmd2, cmd3, vars=tags)

script.execute(main)

The first difference to note is that rather than manually defining a loop over sensors, the following shorthand is used:

script.add_sensor_loop()

This line sets up a loop on the template tag sensor as before, but the list of sensors is automatically determined from the SiLK repository itself (see the mapsid command). The script also remembers that this particular loop involves sensors.

The next modification to note is the definition of two special SiLK-related compound tags:

script.add_flow_tag('in_flow',  flow_type='in_types')
script.add_flow_tag('out_flow', flow_type='out_types')

These statements create template entries bound to netsa.script.Flow_params objects which serve to simplify the construction of rwfilter command line templates.

Each call to add_flow_tag implicitly binds the start_date and end_date object attributes to the value of the template tags golem_start_silk and golem_end_silk. Given that a sensor-specific loop was declared earlier, the function calls will also bind the sensors attribute to the value of the sensor tag for each loop.

Additional tags can be bound to netsa.script.Flow_params attributes using keyword arguments. In this example, the in_types and out_types tags defined earlier in the script are bound to the flow_type attribute of each object.

The rest of the script proceeds as before, except that in the processing loop the rwfilter command templates are far more compact:

cmd1 = [
    "rwfilter %(out_flow)s --proto=0-255 --pass=stdout",
    "rwset --sip=%(out_fifo)s"]
cmd2 = [
    "rwfilter %(in_flow)s --proto=0-255 --pass=stdout",
    "rwset --dip=%(in_fifo)s"]

The %(out_flow)s and %(in_flow)s tags will each expand into four parameters in the eventual command string.

Synchronization Example

The following example will build a daily inventory of internal addresses that exhibit activity on source port 25. In order to limit the pool of addresses under consideration, it will utilize the internal inventory results generated by the SiLK Integration Example. Furthermore, it will utilize some additional SiLK-related tools in order to organize results into ‘sensor groups’ rather than under individual sensors. The following assumes that the prior inventory script is called internal.py:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell
from netsa import files

script.set_title('Daily Internal Port 25 Activity')
script.set_description("""
    Daily per-sensor-group inventory of observed internal host
    activity on port 25.
""")
script.set_version("0.1")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('p25')
script.set_interval(days=1)
script.set_span(days=1)

script.add_golem_params()

script.set_repository('dat')

script.add_sensor_loop(auto_group=True)

script.add_tag('out_type', 'out,outweb')
script.add_flow_tag('out_flow', flow_type='out_type')

script.add_golem_input('internal.py', 'internal_set', join_on='sensor')

script.add_output_template('p25_set',
    "p25/p25.%(sensor_group)s.%(golem_bin_iso)s.set",
    description="Daily IP set for internal port 25 activity",
    mime_type='application/x-silk-ipset')

def main():
    for tags in script.process():
        tags['in_fifo'] = files.get_temp_pipe_name()
        cmd1 = [
            "rwsettool --union"
                " --output-path=%(in_fifo)s
                " %(internal_set)s"]
        cmd2 = [
            "rwfilter %(out_flow)s"
                " --proto=6"
                " --sport=25"
                " --packets=2-"
                " --sipset=%(in_fifo)s"
                " --pass=stdout",
            "rwset --sip-set=%(p25_set)s"]
        shell.run_parallel(cmd1, cmd2, vars=tags)

script.execute(main)

First, the script is configured to generate once per day using a span of one day:

script.set_interval(days=1)
script.set_span(days=1)

Next, the sensor loop is configured:

script.add_sensor_loop(auto_group=True)

This invocation of add_sensor_loop uses a new named parameter, auto_group, which loops over groups of related sensors rather than individual sensors. Normally, a single template tag sensor is added. When grouping is enabled for a sensor loop another tag sensor_group is added in addition to the sensor tag. So, for example, if there are three sensors in a group labeled ‘LAB0’, ‘LAB1’, and ‘LAB2’, these two template tags would expand into strings like so:

Tag Value
%(sensor)s LAB0,LAB1,LAB2
%(sensor_group)s LAB

See the add_loop function for the details of how the above features work for generic, non-sensor-related, loops.

Next the script sets up the input dependency from the script in the Basic Example called internal.py:

script.add_golem_input('internal.py', 'internal_set', join_on='sensor')

When golem scripts use other golem script results as inputs, they are automatically synchronized across processing intervals. The basic rule is to synchronize on the latest external interval containing an end-point less than or equal to the end-point of the local interval under consideration.

The synchronization of any loops other than intervals must be explicitly configured. In this case, the join_on parameter is used to indicate that the external sensor loop and local sensor loop should align on each value of the sensor tag. This synchronization happens per-sensor and does not affect the eventual sensor grouping behavior.

Next, the output template is defined. Note the use of the sensor_group tag rather than sensor:

script.add_output_template('p25_set',
    "p25/p25.%(sensor_group)s.%(golem_bin_iso)s.set",
    description="Daily IP set for internal port 25 activity",
    mime_type='application/x-silk-ipset')

Followed by the processing loop:

def main():
    for tags in script.process():
        tags['in_fifo'] = files.get_temp_pipe_name()
        cmd1 = [
            "rwsettool --union"
                " --output-path=%(in_fifo)s
                " %(internal_set)s"]
        cmd2 = [
            "rwfilter %(out_flow)s"
                " --proto=6"
                " --sport=25"
                " --packets=2-"
                " --sipset=%(in_fifo)s"
                " --pass=stdout",
            "rwset --sip-set=%(p25_set)s"]
        shell.run_parallel(cmd1, cmd2, vars=tags)

script.execute(main)

Since sensors are being grouped, the %(internal_set)s tag for each loop potentially represents multiple input files, one for each individual sensor. The first command defines a template for rwsettool that sends a union of these per-sensor sets into the named pipe. The second command pipeline uses this merged set to filter the initial flows being examined by the rwfilter query.

When invoked on a regular basis, this script will produce a daily subset of the most recent per-sensor-group inventory for those internal IP addresses that have exhibited activity on source port 25.

Self Dependency Example

The Synchronization Example demonstrates how to configure an input dependency on the results of another golem script. It is also possible to configure dependencies on a golem script’s own past results.

Recall that the SiLK Integration Example is configured with a 3-week interval and 4-week span. The 3-week interval was chosen due to the resource-intensive query across 4 weeks of data. Whereas this does produce internal inventories, the information is potentially less accurate over time (particularly during the final few days of the 3-week processing interval).

The inventory script can be modified to consume its own outputs and produce delta encoded results on a daily basis:

#!/usr/bin/env python

from netsa.script.golem import script
from netsa.util import shell
from netsa import files

script.set_title('Active Internal Hosts')
script.set_description("""
    Daily per-sensor inventory of observed internal host activity,
    delta-encoded using the prior four weeks of results.
""")
script.set_version("0.3")
script.set_contact("H.P. Fnord <fnord@example.com>")
script.set_authors(["H.P. Fnord <fnord@example.com>"])

script.set_name('internal')
script.set_interval(days=1)
script.set_span(days=1)

script.add_golem_params()

script.set_repository('dat')

script.add_tag('in_types',  'in,inweb')
script.add_tag('out_types', 'out,outweb')

script.add_sensor_loop()

script.add_flow_tag('in_flow',  flow_type='in_types')
script.add_flow_tag('out_flow', flow_type='out_types')

script.add_output_template('internal_set',
    "internal/internal.delta.%(sensor)s.%(golem_bin_iso)s.set",
    description="Delta set of internal host activity",
    mime_type='application/x-silk-ipset',
    scope=28)

script.add_self_input('prior_set', 'internal_set', offset=0)

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()
        cmds = []
        cmds.append([
            "rwfilter %(out_flow)s --proto=0-255 --pass=stdout",
            "rwset --sip=%(out_fifo)s"])
        cmds.append([
            "rwfilter %(in_flow)s --proto=0-255 --pass=stdout",
            "rwset --dip=%(in_fifo)s"])
        cmds.append([
            "rwsettool --union --output-path=%(current_set)s"
                " %(out_fifo)s %(in_fifo)s"])
        if tags['prior_set']:
            tags['current_set'] = files.get_temp_pipe_name()
            cmds.append([
              "rwsettool --difference"
                  " --output-path=%(internal_set)s"
                  " %(current_out)s %(prior_set)s"])
        else:
            tags['current_set'] = tags['internal_set']

        shell.run_parallel(vars=tags, *cmds)

script.execute(main)

The goal is to generate a viable internal inventory on a daily basis with minimal overhead. The naive approach would be to define and interval of 1 day and leave the span as 4 weeks. This would pull 4 weeks of data every single day and construct a full inventory for that day. This is inefficient in terms of processing and storage. Instead, this script introduces a new concept called scope. Scope is used to indicate situations where a single interval of processing does not represent a complete analysis result.

First, the basics are configured:

script.set_interval(days=1)
script.set_span(days=1)

The script produces a daily result and expects to consume a single day’s worth of ‘regular’ data while doing so. Next, the script must define its daily output template:

script.add_output_template('internal_set',
    "internal/internal.delta.%(sensor)s.%(golem_bin_iso)s.set",
    description="Delta set of internal host activity",
    mime_type='application/x-silk-ipset',
    scope=28)

This declaration shows the use of the new scope parameter. The scope indicates the number of processing interval outputs required to represent a complete result. Here, the scope is defined as 28 intervals (days in this case).

Now when other golem scripts use this script output as an input dependency, they will see 4 weeks of files relative to each day of interest. This also applies in cases where a golem script asks itself for prior results. An example of this is shown next:

script.add_self_input('prior_set', 'internal_set', offset=0)

This self-referential input dependency maps internal_set to a new template tag called prior_set.

By default, self-referential inputs have an offset of -1 which excludes the results for the current processing interval. In cases such as this, where the goal is delta-encoding, the offset should be 0. (The daily result being generated for the current day represents addresses not present in the last 27 days).

Next is the main processing loop:

def main():
    for tags in script.process():
        tags['out_fifo'] = files.get_temp_pipe_name()
        tags['in_fifo']  = files.get_temp_pipe_name()
        cmds = []
        cmds.append([
            "rwfilter %(out_flow)s --proto=0-255 --pass=stdout",
            "rwset --sip=%(out_fifo)s"])
        cmds.append([
            "rwfilter %(in_flow)s --proto=0-255 --pass=stdout",
            "rwset --dip=%(in_fifo)s"])
        cmds.append([
            "rwsettool --union --output-path=%(current_set)s"
                " %(out_fifo)s %(in_fifo)s"])
        if tags['prior_set']:
            tags['current_set'] = files.get_temp_pipe_name()
            cmds.append([
              "rwsettool --difference"
                  " --output-path=%(internal_set)s"
                  " %(current_out)s %(prior_set)s"])
        else:
            tags['current_set'] = tags['internal_set']

        shell.run_parallel(vars=tags, *cmds)

script.execute(main)

The core logic is similar to the earlier version. A new template tag, current_set is added to the tags dictionary for each iteration. Depending on circumstances, the value of this tag is set to one of two things: If no prior results are available, a regular rwset is constructed just as it was before. If prior results are available, however, the difference is taken between the current day’s results and the union of up to 27 days of prior results.

This technique allows the reconstruction of an accurate 4-week internal inventory, for any particular day, by taking the union over the 28 days ending on that day.

Having made these changes, what now needs to be changed in the script from the Synchronization Example which depends on these internal sets as input?

Not a single thing.

The script in the Synchronization Example is already performing a union with rwsettool on the tag %(internal_set)s in order to merge data across sensors into sensor groups. Due to the scope declaration, the %(internal_set)s tag will now also include paths to the files for each of the 28 days required to reconstruct results.

Classes

GolemView

class netsa.script.golem.GolemView(golem : Golem[, first_date : datetime, last_date : datetime])

A GolemView object encapsulates a golem script model and is used to view and manipulate it in various ways. These different views are primarily accessed through the loop, outputs, and inputs methods.

Optional keyword arguments:

last_date
The interval containing this datetime object is the last to be considered for processing. (default: most recent)
first_date
The interval containing this datetime object is the first to be considered for processing. (default: last_date)
golem

The golem script model which this view manipulates.

first_bin

A datetime object representing the first processing interval for this view, as determined by the first_date and last_date parameters during construction. Defaults to last_bin.

last_bin

A datetime object representing the last processing interval for this view, as determined by the last_date and first_date parameters during construction. Defaults to the ‘most recent’ interval that does not overlap into the future, taking into account lag.

start_date

A datetime object representing the beginning of the first data span covered by this view. Spans can be larger (or smaller) than the defined interval, so this value is not necessarily equal to first_bin.

end_date

A datetime object representing the end of the last data span covered by this view. If the span is less than or equal to the interval, this is equal to last_bin + interval - 1 microsecond, otherwise it is equal to last_bin + span - 1 microsecond.

using([golem : Golem, first_date : datetime, last_date : datetime])

Return a copy of this GolemView object, optionally using new values for the following keyword arguments:

golem
Use a different golem script model.
first_date
Select a different starting time bin based on the provided datetime object.
last_date
Select a different ending time bin based on the provided datetime object.
bin_dates() → datetime iter

Provide an iterator over datetime objects representing all processing intervals represented by this view.

bins() → GolemView iter

Provide an iterator over GolemView objects for each interval represented by this view.

group_by(key : str[, ...]) → (str tuple, GolemView) iter

Returns an iterator that yields a tuple with a primary key and GolemView object grouped by the provided keys. Each primary key is a tuple containing the current values of the keys provided to group_by. Iterating over the provided view objects will resolve any remaining loops if any remain that were not used for the provided key.

by_key(key : str) → (str, GolemView) iter

Similar to group_by but takes only a single key as an argument. Returns and iterator that yields view objects for each value of the key; iterating over the provided view objects will resolve any remaining loops, if present.

product() → GolemView iter

Fully resolve the loops defined within this view. The ‘outer’ loop is always over intervals, followed by any other loops in the order in which they were defined. Each view thus provided is therefore fully resolved, with no loops remaining.

bin_count() → int

Return the number of intervals represented by this view, as defined by first_bin and last_bin.

loop_count() → int

Return the number of non-interval iterations represented by this view that are produced by resolving any defined loops.

sync_to(other : GolemView [, count : int, offset : int, cover=False) → GolemView

Given another GolemView object, return a version of self that has been synchronized to the given view object.

Optional keyword arguments:

count
Synchronize to this many intervals of the given object (default: 1)
offset
Synchronize to this many interval offsets behind the given object (default: 0)
cover
Calculate a count necessary to cover all intervals represented by the given object (overrides count and offset)
trail
Force the end_date of the new view to always be less than or equal to the end_date of the given view.
loop() → GolemTags

Return a GolemTags object representing this view.

outputs() → GolemOutputs

Return a GolemOutputs object representing this view.

inputs() → GolemInputs

Return a GolemInputs object representing this view.

__len__() → int

Return the number of fully-resolved iterations represented by this view, over intervals as well as any defined loops.

__iter__() → GolemView iter

Iterates over the views returned by the product method.

GolemTags

class netsa.script.golem.GolemTags(golem : Golem[, first_date : datetime, last_date : datetime])

Bases: netsa.script.golem.GolemView

A GolemTags object is used to examine resolved template tags produced by looping over intervals and other defined loops.

As well as the methods and attributes of GolemView, the following additional and overridden methods are available:

tags() → dict

Return a dictionary of resolved template tags for the current view (flattens tags across the loops that would result from invoking the product method).

__iter__() → dict iter

GolemOutputs

class netsa.script.golem.GolemOutputs(golem : Golem[, first_date : datetime, last_date : datetime])

Bases: netsa.script.golem.GolemView

A GolemOutputs object is used to examine resolved output templates, either for a specific iteration or aggregated across multiple iterations.

As well as the methods and attributes of GolemView, the following additional and overridden methods are available:

expand() → GolemArgs

Returns a GolemArgs object representing all resolved output templates for the current view.

__len__() → int

Return the number of resolved output templates for the current view.

__iter__() → str iter

Iterate over each resolved output template for the current view.

GolemInputs

class netsa.script.golem.GolemInputs(golem : Golem[, first_date : datetime, last_date : datetime])

Bases: netsa.script.golem.GolemView

A GolemInputs object is used to examine resolved input templates, either for a specific iteration or aggregated across multiple iterations.

As well as the methods and attributes of GolemView, the following additional and overridden methods are available:

expand() → GolemArgs

Returns a GolemArgs object representing all resolved input templates for the current view.

members() → GolemOutputs iter

Iterate over each golem script that provides inputs for this golem script, returning each as a synchronized GolemOutputs object.

__len__() → int

Return the number of resolved input templates for the current view.

__iter__() → str iter

Iterate over each resolved input template for the current view.

GolemArgs

class netsa.script.golem.GolemArgs(item : str or str iter[, ...])

A GolemArgs object encapsulates a list of resolved input or output templates destined to be used as a parameter in a tags dictionary. The constructor takes any number of strings, or string iterators, and flattens them into a unique sorted tuple. Individual items can be accessed and iterated over like a tuple. One keyword argument is accepted, sep, which will be used to join the items when rendered as a string. It defaults to a single space.

If a space is the separator, the object will resolve to a string of space-separated values and will properly resolve when passed to the netsa.util.shell module for command and pipeline execution.

Note that some file-related python functions (such as open) will complain if passed a single-value GolemArgs object (representing a single file name) without having first explicitly converted it to a string via str or index 0.

The length of a GolemArgs object represents the number of items it contains. These can be accessed via an index like a list. Two objects can be added and subtracted from one another, as with sets.

GolemProcess

class netsa.script.golem.GolemProcess(gview : GolemView[, overwrite_outputs=False, skip_complete=True, keep_empty_outputs=False, skip_missing_inputs=False, optional_inputs : dict])

A utility class for performing system-level interactions (such as checking for required inputs, pre-existing outputs, creating output paths, etc) while iterating over the provided view.

exception_handler
Function for processing GolemException events. Takes the exception and the current GolemView as arguments.
overwrite_outputs
Delete existing outputs prior to processing. (default: False)
keep_empty_outputs
Consider zero-byte output results to be valid, otherwise they will be ignored or deleted when encountered, regardless of the value of overwrite_results. (default: False)

Most methods and attributes available from the GolemTags class are available through this class as well, with some behavioral changes as noted below. The following methods are in addition to those available from GolemTags:

is_complete() → bool

Returns a boolean value indicating whether processing has been completed for the intervals represented by this view.

status(label : str) → (str, bool) iter

Iterate over items within the given tag, returning a tuple containing the item string and its current status. Status is typically the size in bytes of each input or output, or None if it does not exist.

The following methods have slightly different behavior than that of GolemTags:

using([gview : GolemView, overwrite_outputs=False, skip_complete=True, keep_empty_outputs=False, skip_missing_inputs=False, create_slots=True, optional_inputs : dict])

Return a copy of this GolemProcess object, possibly replacing certain attributes corresponding to the keyword arguments in the constructor.

product() → GolemProcess iter

Return a GolemProcess object for each iteration over the processing intervals and loops defined for this process view, possibly performing system level tasks along the way (such as creating output paths and performing input checks). Iterations where processing is complete will be skipped, unless overwrite_outputs has been enabled for this object.

bins() → GolemProcess iter

Provide an iterator over GolemProcess objects for each processing interval represented by this view Iterations for which processing is complete will be skipped, unless overwrite_outputs has been enabled for this object.

group_by(key : str[, ...]) → (str tuple, GolemProcess) iter

Returns an iterator that yields a tuple with a primary key and a GolemProcess object, grouped by the provided keys. Each primary key is a tuple containing the current values of the keys provided. Iterating over the resulting process objects process objects will resolve any remaining loops remaining in that view, if any. Views for which processing is complete will be skipped, unless overwrite_outputs has been enabled for this object.

by_key(str) → GolemProcess iter

Similar to group_by but takes a single key as an argument. Returns and iterator that yields GolemProcess objects for each value of the key. Views for which processing is complete will be skipped, unless overwrite_outputs has been enabled for this object.

__iter__() → dict iter

Iterate over the views produced by the product method, yielding a dictionary of fully resolved template tags. Iterations for which processing is complete will be skipped, unless overwrite_outputs has been enabled for this object.