PipeConfig Files
PipeConfig basics
A PipeConfig file describes a workflow as it should be run by eHive:
Definitions of analyses and the relationships between them,
Parameters required by the workflow, optionally with default values,
Certain eHive configuration options (meta-parameters).
The file itself is a Perl module implementing
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
. This class
defines six interface methods, some or all of which can be overridden
as needed to describe a particular workflow:
pipeline_analyses()
returns a list of hash structures that define Analyses and the relationships between them.
default_options()
returns a hash of defaults for options that the rest of the configuration depends on.
pipeline_create_commands()
returns a list of strings that will be executed as system commands to set up pipeline dependencies.
pipeline_wide_parameters()
returns a hash of pipeline-wide parameters, names and values.
resource_classes()
returns a hash of Resource Class definitions.
By convention, PipeConfig files are given names ending in ‘_conf’.
pipeline_analyses()
Every useful PipeConfig will have the pipeline_analyses method, as this is where the workflow is described in terms of Analyses and the dependencies between them. This method returns a list of hashes structures – each hash defines one Analysis. For example:
sub pipeline_analyses {
my ($self) = @_;
return [
{ -logic_name => 'first_analysis',
-comment => 'this is the first analysis in this pipeline',
-module => 'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
-flow_into => {
'1' => 'second_analysis',
},
},
{ -logic_name => 'second_analysis',
-module => 'MyCodeBase::RunnableDB::DoSomeWork',
},
];
}
The code above creates a simple pipeline with two Analyses: “first_analysis” and “second_analysis”. When a first_analysis Job runs, it will create a dataflow event on branch #1, which will seed a Job of second_analysis. Note that this relationship between first_analysis and second_analysis – where second_analysis Jobs are seeded by first_analysis Jobs – is entirely created through the -flow_into block in the first_analysis definition. The order in which the Analysis definitions appear in the list has no effect on Analysis relationships, and is completely arbitrary. That said, it’s generally a good idea to list Analysis definitions in a roughly sequential order to help make the code understandable.
The following directives are available for use in an Analysis definition:
Directive |
Required? |
Type |
Description |
---|---|---|---|
logic_name |
required |
string |
A name to identify this Analysis. Must be unique within the pipeline, but is otherwise arbitrary. |
module |
required |
string |
The classname of the Runnable for this Analysis. |
analysis_capacity |
optional |
integer |
Sets the Analysis capacity. Default is unlimited. |
batch_size |
optional |
integer |
Sets the batch size. Default 1. |
blocked |
optional |
boolean (0 or 1) |
Seeded Jobs of this Analysis will start out [BLOCKED]. |
can_be_empty |
optional |
boolean (0 or 1) |
Works in conjunction with wait_for. If set, then this Analysis will block other Analyses that are set to wait_for it, even if this Analysis has no Jobs. |
comment |
optional |
string |
A place for documentation. Please be kind to others who will use this pipeline. |
failed_job_tolerance |
optional |
integer |
Percentage of Jobs allowed to fail before the Analysis is considered to have failed. Example: |
flow_into |
optional |
string or arrayref or hashref (see below) |
Directs dataflow events generated by Jobs of this Analysis. |
hive_capacity |
optional |
integer |
Sets the reciprocal relative load of this Analysis in proportion to the overall hive_capacity. See the section covering hive capacity for details. |
input_ids |
optional |
arrayref |
Sets an input_id hash, or a list of input_id hashes, to seed Jobs for this Analysis at pipeline initialisation time. See the section on seeding Jobs for details. |
language |
optional |
string |
Language of the Runnable: Perl or Python. |
max_retry_count |
optional |
integer |
Maximum number of times Jobs of this Analysis can be retried before they are considered [FAILED]. Default three. |
meadow_type |
optional |
string |
Restricts Jobs of this Analysis to a particular meadow type. Most commonly used to restrict analyses to run Jobs in the LOCAL meadow, but any valid meadow can be given. Note that if a non-local meadow is specified, this will stop automatic failover to LOCAL if LOCAL is the only meadow available. |
parameters |
optional |
hashref |
Sets analysis-wide parameters and values. |
priority |
optional |
integer |
Sets relative priority for Jobs of this Analysis. Workers will claim available Jobs from higher priority Analyses before claiming Jobs of lower priority Analyses. |
rc_name |
optional |
string |
Name of the Resource Class for this Analysis. |
tags |
optional |
arrayref or comma-delimited string |
A tag or set of tags for this Analysis. |
wait_for |
optional |
arrayref or string |
Logic_name, or list of logic_names, of Analyses that Jobs of this Analysis will wait for. |
default_options()
A PipeConfig can be created with a set of overridable default options using the default_options method. This method should return a hashref, where the keys are option names and the values are option values:
sub default_options {
my ($self) = @_;
return {
#First, inherit from the base class. Doing this first
#allows any defined options to be overridden
%{ $self->SUPER::default_options() },
#An example of overriding 'hive_use_param_stack' which is defined
#in Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
'hive_use_param_stack' => 1,
#An example of setting a new, multilevel default option
'input_file' => {
-file_format => 'FASTA',
-file_contents => 'Nucleotide',
},
};
}
Note that a number of options are set in the base class
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
– these may be
overridden by providing a new key value pair in the returned
hashref. Also note that the value for a default option can be another
hashref, creating nested options.
Options set in default_options are available elsewhere in the
PipeConfig via eHive’s $self->o
mechanism. For example, to take
the “input_file” option above and make it available in the
“an_analysis” Analysis as a parameter named “input”:
sub pipeline_analyses {
my ($self) = @_;
return [
{ -logic_name => 'an_analysis',
-module => 'Some::Runnable',
-parameters => {
'input' => $self->o('input_file')
},
},
];
}
pipeline_create_commands()
For some workflows, it may be desirable to perform extra operations at pipeline creation time. A common example would be adding extra tables to the eHive database. The pipeline_create_commands method is provided as a place to add these operations that don’t fit into the other methods provided in the PipeConfig interface.
This method should return an arrayref containing system
-executable
statements.
For example, the following code runs db_cmd.pl
as a system command to
add a “final_result” table to this Pipeline’s eHive database:
sub pipeline_create_commands {
my ($self) = @_;
return [
@{$self->SUPER::pipeline_create_commands},
# $self->db_cmd() returns a db_cmd.pl command plus options and parameters
# as a properly escaped string suitable to be passed to system()
$self->db_cmd('CREATE TABLE final_result (inputfile VARCHAR(255) NOT NULL, result DOUBLE PRECISION NOT NULL, PRIMARY KEY (inputfile))'),
];
}
pipeline_wide_parameters()
The pipeline_wide_parameters method should return a hashref containing parameters available to every Analysis in the pipeline. In the hashref, the hash keys are parameter names, and the hash values are the parameter values.
sub pipeline_wide_parameters {
my ($self) = @_;
return {
# Although Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
# does not set any pipeline-wide parameters, a PipeConfig
# may inherit from a subclass of HiveGeneric_conf that does.
%{$self->SUPER::pipeline_wide_parameters},
'my_parameter' => 1,
};
}
resource_classes()
Resource classes for a pipeline are defined in a PipeConfig’s resource_classes method. This method should return a hashref of resource class definitions.
sub resource_classes {
my ($self) = @_;
return {
%{$self->SUPER::resource_classes},
'high_memory' => { 'LSF' => '-C0 -M16000 -R"rusage[mem=16000]"' },
};
}