Runnable API
eHive exposes an interface for Runnables (jobs) to interact with the system:
query their own parameters (see Parameter Handling),
control its own execution and report issues,
run system commands,
trigger some dataflow events (e.g. create new jobs).
Reporting and logging
Jobs can log messages to the standard output with the
$self->say_with_header($message, $important)
method. However they are only printed
when the debug mode is enabled (see below) or when the $important
flag is switched on.
They will also be prefixed with a standard prefix consisting of the
runtime context (Worker, Role, Job).
The debug mode is controlled by the --debug X
option of
beekeeper and runWorker. X is an integer,
allowing multiple levels of debug, although most of the modules will only
check whether it is 0 or not.
$self->warning($message)
calls $self->say_with_header($message, 1)
(so that the messages are printed on the standard output) but also stores
them in the database (in the log_message
table).
To indicate that a Job has to be terminated earlier (i.e. before reaching
the end of write_output
), you can call:
$self->complete_early($message)
to mark the Job as DONE (successful run) and record the message in the database. Beware that this will trigger the autoflow.$self->complete_early($message, $branch_code)
is a variation of the above that will replace the autoflow (branch 1) with a dataflow on the branch given.$self->throw($message)
to log a failed attempt. The Job may be given additional retries following the analysis’ max_retry_count parameter, or is marked as FAILED in the database.
System interactions
All Runnables have access to the $self->run_system_command
method to run
arbitrary system commands (the SystemCmd
Runnable is merely a wrapper
around this method).
run_system_command
takes two arguments:
The command to run, given as a single string or an arrayref. Arrayrefs are the preferred way as they simplify the handling of whitespace and quotes in the command-line arguments. Arrayrefs that correspond to straightforward commands, e.g.
['find', '-type', 'd']
, are passed to the underlyingsystem
function as lists. Arrayrefs can contain shell meta-characters and delimiters such as>
(to redirect the output to a file),;
(to separate two commands that have to be run sequentially) or|
(a pipe) and will be quoted and joined and passed tosystem
as a single string.An hashref of options. Accepted options are:
use_bash_pipefail
: Normally, the exit status of a pipeline (e.g.cmd1 | cmd2
is the exit status of the last command, meaning that errors in the first command are not captured. With the option turned on, the exit status of the pipeline will capture errors in any command of the pipeline, and will only be 0 if all the commands exit successfully.use_bash_errexit
: Exit immediately if a command fails. This is mostly useful for cases likecmd1; cmd2
where by default,cmd2
would always be executed, regardless of the exit status ofcmd1
.timeout
: the maximum number of seconds the command is allowed to run for. The exit status will be set to -2 if the command had to be aborted.
During their execution, jobs may certainly have to use temporary files.
eHive provides a directory that will exist throughout the lifespan of the
Worker with the $self->worker_temp_directory
method. The directory is created
the first time the method is called, and deleted when the Worker ends. It is the Runnable’s
responsibility to leave the directory in a clean-enough state for the next
Job (by removing some files, for instance), or to clean it up completely
with $self->cleanup_worker_temp_directory
.
By default, this directory will be put under /tmp, but it can be overriden
by setting the -worker_base_tmp_dir
option for workers through their
resource classes. This can
be used to:
use a faster filesystem (although /tmp is usually local to the machine),
use a network filesystem (needed for distributed applications, e.g. over MPI). See Temporary files in the How to use MPI section.
Dataflows
eHive is an event-driven system whereby agents trigger events that are immediately reacted upon. The main event is called “dataflow” (see Dataflows for more information). A dataflow event is made up of two parts: An event, which is identified by a “branch number”, with an attached data payload, consisting of parameters. A Runnable can create as many events as desired, whenever desired. The branch number can be any integer, but note that “-2”, “-1”, “0”, and “1” have special meaning within eHive. -2, -1, and 0 are special branches for error handling, and 1 is the autoflow branch.
Warning
If a Runnable explicitly generates a dataflow event on branch 1, then no autoflow event will be generated when the Job finishes. This is unusual behaviour – many pipelines expect and depend on autoflow coinciding with Job completion. Therefore, you should avoid explicitly creating dataflow on branch 1, unless no alternative exists to produce the correct logic in the Runnable. If you do override the autoflow by creating an event on branch 1, be sure to clearly indicate this in the Runnable’s documentation.
Within a Runnable, dataflow events are performed via the $self->dataflow_output_id($data,
$branch_number)
method.
The payload $data
must be of one of these types:
A hash-reference that maps parameter names (strings) to their values,
An array-reference of hash-references of the above type, or
undef
to propagate the Job’s input_id.
If no branch number is provided, it defaults to 1.
Runnables can also use dataflow_output_ids_from_json($filename, $default_branch)
.
This method simply wraps dataflow_output_id
, allowing external programs
to easily generate events. The method takes two arguments:
The path to a file containing one JSON object per line. Each line can be prefixed with a branch number (and some whitespace), which will override the default branch number.
The default branch number (defaults to 1).
Use of this is demonstrated in the Runnable Bio::EnsEMBL::Hive::RunnableDB::SystemCmd