Job parameters

Parameter sources

Job parameters come from various places and are all aggregated using the following precedence rules:

Job-specific parameters:
1. Accumulated parameters.
2. Job’s input_id.
3. Inherited parameters from previous Jobs, still giving priority to a given Job’s accumulated parameters over its input_id.
Analysis-wide parameters
Pipeline-wide parameters
Default parameters hard-coded in the Runnable

Jobs can access all their parameters via $self->param('param_name') regardless of their origin.

Parameter substitution

You can manipulate parameters in a sophisticated way within the flow-into section of a PipeConfig file. Within this section, you can use hash-sign-based constructs in parameter values to rename or interpolate other parameters’ values.

Full redirection

A “full redirection” happens when the value of a parameter a is defined as #b#. This means that eHive will return b’s value when $self->param('a') is called, regardless of b’s type (e.g. a numeric value, a string, a structure made of hashes and arrays, or even undef).

Syntax:: -flow_into => {1 => {'target_analysis' => {'a' => '#b#'}}}

Interpolation

Parameter values can also be interpolated like in other languages (Perl, PHP, etc.). For instance, if the value of a parameter a is defined as value is #b#, b’s value will be stringified and appended to the string “value is “. This only works well for numeric and string values as Perl stringifies hashes and arrays to a reference (e.g. HASH(0x1760cb8)).

Syntax:: -flow_into => {1 => {'target_analysis' => {'a' => 'value is #b#'}}}

Arbitrary expressions

Arbitrary Perl expressions can also be used via the #expr()expr# construct. Any valid Perl expressions can be put within the parentheses; parameters are inserted into the expression by enclosing them with hash marks (#b#, #c#, etc.)

For example, to add 1 to the value in parameter ‘alpha’, define 'alpha_plus_one' => '#expr( #alpha#+1 )expr#'. The next Job will then see a parameter alpha_plus_one which will have a value one greater than the value of alpha.

If the parameter holds a data structure (arrayref or hashref), you can dereference it with curly-braces as in standard Perl. You can use these methods from List::Util: first, min, max, minstr, maxstr, reduce, sum, shuffle.

For example, 'array_max' => '#expr( max @{#array#} )expr#' will dereference $self->param('array') and call max on it.

Substitution chaining

Substitutions can involve more than two parameters in a chain, e.g. a requires b, which requires c, which requires d, etc., each of the substitutions being one of the patterns described above.

In the example below, the comp_size parameter is a hash associating filenames to their size once compressed. The text parameter is a message composed of the minimum and maximum compressed sizes.

'min_comp_size' => '#expr(min values %{#comp_size#})expr#',
'max_comp_size' => '#expr(max values %{#comp_size#})expr#',
'text' => 'compressed sizes between #min_comp_size# and #max_comp_size#',

Dataflow templates

“Templates” are a way of setting the input_id of the newly created Jobs differently from what has been passed in the dataflow with dataflow_output_id(). For instance, you may want to connect a Runnable “R” that flows two parameters named name and dbID into an Analysis that expects species_name and species_id. We need to let eHive know about the mapping { 'species_name' => '#name#', 'species_id' => '#dbID#' }. This mapping can be defined in three places:

Pipeline-wide parameters, but only if this doesn’t clash with other usages of species_name and species_id.
Analysis-wide parameters of every downstream Analysis. This can obviously be quite tedious.

Template on the dataflow coming from the Runnable “R”:

{   -logic_name => 'R',
    -flow_into  => {
        2 => { 'R_consumer' => { 'species_name' => '#name#', 'species_id' => '#dbID#' } },
    },
},
{   -logic_name => 'R_consumer',
},

This will tell eHive that Jobs created in R_consumer must have their input_ids composed of two parameters species_name and species_id, whose values respectively are #name# and #dbID#.

Values in template expressions are evaluated like with $self->param('param_name'), meaning they can undergo complex substitution patterns. These expressions are evaluated in the context of the runtime environment of the emitting Job.

Expressions can also be simple “pass-through” definitions, like 'creation_date' => '#creation_date#'.

Parameter scope

Explicit propagation and templates

By default, parameters are passed in an explicit manner, i.e. they have to be explicitly listed at one of these two levels:

In the code of the emitting Runnable, by listing them in the dataflow_output_id() call,
In the pipeline configuration, with templates to dataflow-targets.

A common problem when using factory dataflows is that the fan Jobs may need access to some parameters of their factory, but this is not necessarily granted by the system. For instance, eHive’s JobFactory Runnable emits hashes that do not include any of its input parameters. In this case, you will need to define a template to add the extra required parameters.

In the example below, the “parse_file” Analysis expects its Jobs to have the inputfile parameter defined. “parse_file” is a JobFactory Analysis that will read the tab-delimited file, extract the first two columns and create one dataflow event per row. Each event will carry two parameters, one named species_name and the other species_id. These events are wired to seed Jobs of an Analysis named “species_processor”. By default the latter will not know the name of the input-file the data comes from. If it requires the information, we can use templates to define the input_ids of its Jobs as 1) the parameters set by the factory and 2) the extra inputfile parameter - which comes from parse_file’s input_id. Note that with the explicit propagation, you will need to list all the parameters that you want to propagate.

{   -logic_name => 'parse_file',
    -module     => 'Bio::EnsEMBL::Hive::RunnableDB::JobFactory',
    -parameters => {
        'column_names'     => [ 'species_name', 'species_id' ],
    },
    -flow_into  => {
        2 => { 'species_processor' => { 'species_name' => '#species_name#', 'species_id' => '#species_id#', 'inputfile' => '#inputfile#' } },
    },
},
{   -logic_name => 'species_processor',
},

Per-Analysis implicit propagation using INPUT_PLUS

INPUT_PLUS is a template modifier that makes the dataflow automatically propagate both the dataflow output_id and the emitting Job’s parameters. The Analysis above can be rewritten as:

{   -logic_name => 'parse_file',
    -module     => 'Bio::EnsEMBL::Hive::RunnableDB::JobFactory',
    -parameters => {
        'column_names'     => [ 'species_name', 'species_id' ],
    },
    -flow_into  => {
        2 => { 'species_processor' => INPUT_PLUS() },
    },
},
{   -logic_name => 'species_processor',
},

INPUT_PLUS is specific to a dataflow target, and has to be repeated in all the Analyses that require it.

It can also be extended to include other templated variables, like INPUT_PLUS( { 'species_key' => '#species_name#_#species_id#' } )

Here is a diagram showing how the parameters are propagated in the absence / presence of INPUT_PLUS modifiers.

$digraph { label="Propagation without INPUT_PLUS" A -> B; A -> D; B -> C; B -> E; A [color="red", label=<Job A Pa1,Pa2>]; B [color="DodgerBlue", label=<Job B Pb1,Pb2,Pb3>]; C [label=<Job C Pc1,Pc2>]; D [label=<Job D Pd1>]; E [label=<Job E Pe1>]; }$

$digraph { label="Propagation with INPUT_PLUS" A -> B [label="INPUT_PLUS"]; A -> D [label=<no INPUT_PLUS>]; B -> C [label=<no INPUT_PLUS>]; B -> E [label="INPUT_PLUS"]; A [color="red", label=<Job A Pa1,Pa2>]; B [color="DodgerBlue", label=<Job B Pa1,Pa2,Pb1,Pb2,Pb3>]; C [label=<Job C Pa1,Pa2,Pc1,Pc2>]; D [label=<Job D Pd1>]; E [label=<Job E Pa1,Pa2,Pb1,Pb2,Pb3,Pe1>]; }$

Global implicit propagation

In this mode, all the Jobs automatically see all the parameters of their ascendants, without having to define any templates or INPUT_PLUS. Global implicit propagation is enabled by adding a hive_use_param_stack hive_meta parameter set to 1, like in the example below:

sub hive_meta_table {
    my ($self) = @_;
    return {
        %{$self->SUPER::hive_meta_table},       # here we inherit anything from the base class
        'hive_use_param_stack'  => 1,           # switch on the new param_stack mechanism
    };
}

The parse_file Analysis then becomes:

{   -logic_name => 'parse_file',
    -module     => 'Bio::EnsEMBL::Hive::RunnableDB::JobFactory',
    -parameters => {
        'column_names'     => [ 'species_name', 'species_id' ],
    },
    -flow_into  => {
        2 => [ 'species_processor' ],
    },
},
{   -logic_name => 'species_processor',
},

Reusing the same five Jobs as in the previous section, here is how the parameters would be propagated when hive_use_param_stack is switched on.

digraph {
A -> B;
A -> D;
B -> C;
B -> E;
A [color="red", label=<<font color='red'>Job A<br/>Pa<sub>1</sub>,Pa<sub>2</sub></font>>];
B [color="DodgerBlue", label=<<font color='DodgerBlue'>Job B<br/><font color='red'>Pa<sub>1</sub>,Pa<sub>2</sub></font>,Pb<sub>1</sub>,Pb<sub>2</sub>,Pb<sub>3</sub></font>>];
C [label=<Job C<br/><font color='red'>Pa<sub>1</sub>,Pa<sub>2</sub></font>,<font color='DodgerBlue'>Pb<sub>1</sub>,Pb<sub>2</sub>,Pb<sub>3</sub></font>,Pc<sub>1</sub>,Pc<sub>2</sub>>];
D [label=<Job D<br/><font color='red'>Pa<sub>1</sub>,Pa<sub>2</sub></font>,Pd<sub>1</sub>>];
E [label=<Job E<br/><font color='red'>Pa<sub>1</sub>,Pa<sub>2</sub></font>,<font color='DodgerBlue'>Pb<sub>1</sub>,Pb<sub>2</sub>,Pb<sub>3</sub></font>,Pe<sub>1</sub>>];
}