Runnable API

eHive exposes an interface for Runnables (jobs) to interact with the system:

query their own parameters (see Parameter Handling),

control its own execution and report issues,

run system commands,

trigger some dataflow events (e.g. create new jobs).

Reporting and logging

Jobs can log messages to the standard output with the $self->say_with_header($message, $important) method. However they are only printed when the debug mode is enabled (see below) or when the $important flag is switched on. They will also be prefixed with a standard prefix consisting of the runtime context (Worker, Role, Job).

The debug mode is controlled by the --debug X option of beekeeper and runWorker. X is an integer, allowing multiple levels of debug, although most of the modules will only check whether it is 0 or not.

$self->warning($message) calls $self->say_with_header($message, 1) (so that the messages are printed on the standard output) but also stores them in the database (in the log_message table).

To indicate that a Job has to be terminated earlier (i.e. before reaching the end of write_output), you can call:

$self->complete_early($message) to mark the Job as DONE (successful run) and record the message in the database. Beware that this will trigger the autoflow.
$self->complete_early($message, $branch_code) is a variation of the above that will replace the autoflow (branch 1) with a dataflow on the branch given.
$self->throw($message) to log a failed attempt. The Job may be given additional retries following the analysis’ max_retry_count parameter, or is marked as FAILED in the database.

System interactions

All Runnables have access to the $self->run_system_command method to run arbitrary system commands (the SystemCmd Runnable is merely a wrapper around this method).

run_system_command takes two arguments:

The command to run, given as a single string or an arrayref. Arrayrefs are the preferred way as they simplify the handling of whitespace and quotes in the command-line arguments. Arrayrefs that correspond to straightforward commands, e.g. ['find', '-type', 'd'], are passed to the underlying system function as lists. Arrayrefs can contain shell meta-characters and delimiters such as > (to redirect the output to a file), ; (to separate two commands that have to be run sequentially) or | (a pipe) and will be quoted and joined and passed to system as a single string.
An hashref of options. Accepted options are:
- use_bash_pipefail: Normally, the exit status of a pipeline (e.g. cmd1 | cmd2 is the exit status of the last command, meaning that errors in the first command are not captured. With the option turned on, the exit status of the pipeline will capture errors in any command of the pipeline, and will only be 0 if all the commands exit successfully.
- use_bash_errexit: Exit immediately if a command fails. This is mostly useful for cases like cmd1; cmd2 where by default, cmd2 would always be executed, regardless of the exit status of cmd1.
- timeout: the maximum number of seconds the command is allowed to run for. The exit status will be set to -2 if the command had to be aborted.

During their execution, jobs may certainly have to use temporary files. eHive provides a directory that will exist throughout the lifespan of the Worker with the $self->worker_temp_directory method. The directory is created the first time the method is called, and deleted when the Worker ends. It is the Runnable’s responsibility to leave the directory in a clean-enough state for the next Job (by removing some files, for instance), or to clean it up completely with $self->cleanup_worker_temp_directory.

By default, this directory will be put under /tmp, but it can be overriden by setting the -worker_base_tmp_dir option for workers through their resource classes. This can be used to:

use a faster filesystem (although /tmp is usually local to the machine),
use a network filesystem (needed for distributed applications, e.g. over MPI). See Temporary files in the How to use MPI section.

Dataflows

eHive is an event-driven system whereby agents trigger events that are immediately reacted upon. The main event is called “dataflow” (see Dataflows for more information). A dataflow event is made up of two parts: An event, which is identified by a “branch number”, with an attached data payload, consisting of parameters. A Runnable can create as many events as desired, whenever desired. The branch number can be any integer, but note that “-2”, “-1”, “0”, and “1” have special meaning within eHive. -2, -1, and 0 are special branches for error handling, and 1 is the autoflow branch.

Warning

If a Runnable explicitly generates a dataflow event on branch 1, then no autoflow event will be generated when the Job finishes. This is unusual behaviour – many pipelines expect and depend on autoflow coinciding with Job completion. Therefore, you should avoid explicitly creating dataflow on branch 1, unless no alternative exists to produce the correct logic in the Runnable. If you do override the autoflow by creating an event on branch 1, be sure to clearly indicate this in the Runnable’s documentation.

Within a Runnable, dataflow events are performed via the $self->dataflow_output_id($data, $branch_number) method.

The payload $data must be of one of these types:

A hash-reference that maps parameter names (strings) to their values,
An array-reference of hash-references of the above type, or
undef to propagate the Job’s input_id.

If no branch number is provided, it defaults to 1.

Runnables can also use dataflow_output_ids_from_json($filename, $default_branch). This method simply wraps dataflow_output_id, allowing external programs to easily generate events. The method takes two arguments:

The path to a file containing one JSON object per line. Each line can be prefixed with a branch number (and some whitespace), which will override the default branch number.
The default branch number (defaults to 1).

Use of this is demonstrated in the Runnable Bio::EnsEMBL::Hive::RunnableDB::SystemCmd