PipeConfig Files

PipeConfig basics

A PipeConfig file describes a workflow as it should be run by eHive:

  • Definitions of analyses and the relationships between them,

  • Parameters required by the workflow, optionally with default values,

  • Certain eHive configuration options (meta-parameters).

The file itself is a Perl module implementing Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf. This class defines six interface methods, some or all of which can be overridden as needed to describe a particular workflow:

  • pipeline_analyses() returns a list of hash structures that define Analyses and the relationships between them.

  • default_options() returns a hash of defaults for options that the rest of the configuration depends on.

  • pipeline_create_commands() returns a list of strings that will be executed as system commands to set up pipeline dependencies.

  • pipeline_wide_parameters() returns a hash of pipeline-wide parameters, names and values.

  • resource_classes() returns a hash of Resource Class definitions.

By convention, PipeConfig files are given names ending in ‘_conf’.

pipeline_analyses()

Every useful PipeConfig will have the pipeline_analyses method, as this is where the workflow is described in terms of Analyses and the dependencies between them. This method returns a list of hashes structures – each hash defines one Analysis. For example:

sub pipeline_analyses {
    my ($self) = @_;
    return [
        {   -logic_name => 'first_analysis',
            -comment    => 'this is the first analysis in this pipeline',
            -module     => 'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
            -flow_into  => {
               '1' => 'second_analysis',
            },
        },
        {   -logic_name => 'second_analysis',
            -module     => 'MyCodeBase::RunnableDB::DoSomeWork',
        },
    ];
}

The code above creates a simple pipeline with two Analyses: “first_analysis” and “second_analysis”. When a first_analysis Job runs, it will create a dataflow event on branch #1, which will seed a Job of second_analysis. Note that this relationship between first_analysis and second_analysis – where second_analysis Jobs are seeded by first_analysis Jobs – is entirely created through the -flow_into block in the first_analysis definition. The order in which the Analysis definitions appear in the list has no effect on Analysis relationships, and is completely arbitrary. That said, it’s generally a good idea to list Analysis definitions in a roughly sequential order to help make the code understandable.

The following directives are available for use in an Analysis definition:

analysis definition directives

Directive

Required?

Type

Description

logic_name

required

string

A name to identify this Analysis. Must be unique within the pipeline, but is otherwise arbitrary.

module

required

string

The classname of the Runnable for this Analysis.

analysis_capacity

optional

integer

Sets the Analysis capacity. Default is unlimited.

batch_size

optional

integer

Sets the batch size. Default 1.

blocked

optional

boolean (0 or 1)

Seeded Jobs of this Analysis will start out [BLOCKED].

can_be_empty

optional

boolean (0 or 1)

Works in conjunction with wait_for. If set, then this Analysis will block other Analyses that are set to wait_for it, even if this Analysis has no Jobs.

comment

optional

string

A place for documentation. Please be kind to others who will use this pipeline.

failed_job_tolerance

optional

integer

Percentage of Jobs allowed to fail before the Analysis is considered to have failed. Example: -failed_job_tolerance => 25 means that up to 25% of Jobs can be [FAILED] before the Analysis is considered to be failed. Default 0. Note that this does not affect semaphores – any failed Job in a fan will still prevent release of a semaphore, regardless of failed_job_tolerance.

flow_into

optional

string or arrayref or hashref (see below)

Directs dataflow events generated by Jobs of this Analysis.

hive_capacity

optional

integer

Sets the reciprocal relative load of this Analysis in proportion to the overall hive_capacity. See the section covering hive capacity for details.

input_ids

optional

arrayref

Sets an input_id hash, or a list of input_id hashes, to seed Jobs for this Analysis at pipeline initialisation time. See the section on seeding Jobs for details.

language

optional

string

Language of the Runnable: Java, Perl, or Python.

max_retry_count

optional

integer

Maximum number of times Jobs of this Analysis can be retried before they are considered [FAILED]. Default three.

meadow_type

optional

string

Restricts Jobs of this Analysis to a particular meadow type. Most commonly used to restrict analyses to run Jobs in the LOCAL meadow, but any valid meadow can be given. Note that if a non-local meadow is specified, this will stop automatic failover to LOCAL if LOCAL is the only meadow available.

parameters

optional

hashref

Sets analysis-wide parameters and values.

priority

optional

integer

Sets relative priority for Jobs of this Analysis. Workers will claim available Jobs from higher priority Analyses before claiming Jobs of lower priority Analyses.

rc_name

optional

string

Name of the Resource Class for this Analysis.

tags

optional

arrayref or comma-delimited string

A tag or set of tags for this Analysis.

wait_for

optional

arrayref or string

Logic_name, or list of logic_names, of Analyses that Jobs of this Analysis will wait for.

default_options()

A PipeConfig can be created with a set of overridable default options using the default_options method. This method should return a hashref, where the keys are option names and the values are option values:

sub default_options {
    my ($self) = @_;

    return {
            #First, inherit from the base class. Doing this first
            #allows any defined options to be overridden
            %{ $self->SUPER::default_options() },

            #An example of overriding 'hive_use_param_stack' which is defined
            #in Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
            'hive_use_param_stack' => 1,

            #An example of setting a new, multilevel default option
            'input_file' => {
                -file_format   => 'FASTA',
                -file_contents => 'Nucleotide',
            },
    };
}

Note that a number of options are set in the base class Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf – these may be overridden by providing a new key value pair in the returned hashref. Also note that the value for a default option can be another hashref, creating nested options.

Options set in default_options are available elsewhere in the PipeConfig via eHive’s $self->o mechanism. For example, to take the “input_file” option above and make it available in the “an_analysis” Analysis as a parameter named “input”:

sub pipeline_analyses {
    my ($self) = @_;

    return [
        {   -logic_name => 'an_analysis',
            -module     => 'Some::Runnable',
            -parameters => {
                'input' => $self->o('input_file')
            },
        },
    ];
}

pipeline_create_commands()

For some workflows, it may be desirable to perform extra operations at pipeline creation time. A common example would be adding extra tables to the eHive database. The pipeline_create_commands method is provided as a place to add these operations that don’t fit into the other methods provided in the PipeConfig interface.

This method should return an arrayref containing system-executable statements.

For example, the following code runs db_cmd.pl as a system command to add a “final_result” table to this Pipeline’s eHive database:

sub pipeline_create_commands {
    my ($self) = @_;

    return [
        @{$self->SUPER::pipeline_create_commands},

        # $self->db_cmd() returns a db_cmd.pl command plus options and parameters
        # as a properly escaped string suitable to be passed to system()
        $self->db_cmd('CREATE TABLE final_result (inputfile VARCHAR(255) NOT NULL, result DOUBLE PRECISION NOT NULL, PRIMARY KEY (inputfile))'),
    ];
}

pipeline_wide_parameters()

The pipeline_wide_parameters method should return a hashref containing parameters available to every Analysis in the pipeline. In the hashref, the hash keys are parameter names, and the hash values are the parameter values.

sub pipeline_wide_parameters {
    my ($self) = @_;

    return {
        # Although Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
        # does not set any pipeline-wide parameters, a PipeConfig
        # may inherit from a subclass of HiveGeneric_conf that does.
        %{$self->SUPER::pipeline_wide_parameters},

        'my_parameter' => 1,
    };
}

resource_classes()

Resource classes for a pipeline are defined in a PipeConfig’s resource_classes method. This method should return a hashref of resource class definitions.

sub resource_classes {
    my ($self) = @_;

    return {
        %{$self->SUPER::resource_classes},
        'high_memory' => { 'LSF' => '-C0 -M16000 -R"rusage[mem=16000]"' },
    };
}