Inheritance diagram for Bio::EnsEMBL::Hive::RunnableDB::JobFactory:

Public Member Functions
public	param_defaults ()

public	run ()

public	write_output ()

protected	_get_rows_from_list ()

protected	_get_rows_from_query ()

protected	_get_rows_from_open ()

protected	_substitute_rows ()

protected	_substitute_minibatched_rows ()

protected	_fisher_yates_shuffle_in_place ()

Public Member Functions inherited from Bio::EnsEMBL::Hive::Process
public	new ()

public	life_cycle ()

public	say_with_header ()

public	enter_status ()

public	warning ()

public	param_defaults ()

public	fetch_input ()

public	run ()

public	write_output ()

public Bio::EnsEMBL::Hive::Worker	worker ()

public	execute_writes ()

public Bio::EnsEMBL::Hive::DBSQL::DBAdaptor	db ()

public Bio::EnsEMBL::Hive::DBSQL::DBConnection	dbc ()

public Bio::EnsEMBL::Hive::DBSQL::DBConnection	data_dbc ()

public Returns	run_system_command ()

public Bio::EnsEMBL::Hive::AnalysisJob	input_job ()

public	input_id ()

public	param ()

public	param_required ()

public	param_exists ()

public	param_is_defined ()

public	param_substitute ()

public	dataflow_output_id ()

public	dataflow_output_ids_from_json ()

public	throw ()

public This	complete_early ()

public Int	debug ()

public	worker_temp_directory ()

public	cleanup_worker_temp_directory ()

Detailed Description

Synopsis

standaloneJob.pl Bio::EnsEMBL::Hive::RunnableDB::JobFactory \
                --inputcmd 'cd ${ENSEMBL_CVS_ROOT_DIR}/ensembl-hive/modules/Bio/EnsEMBL/Hive/RunnableDB; ls -1 *.pm' \
                --flow_into "{ 2 => { 'mysql://ensadmin:${ENSADMIN_PSW}@127.0.0.1:2914/lg4_compara_families_70/meta' => {'meta_key'=>'module_name','meta_value'=>'#_0#'} } }""

Description

    This is a generic RunnableDB module for creating batches of similar jobs using dataflow mechanism
    (a fan of jobs is created in one branch and the funnel in another).
    Make sure you wire this buliding block properly from outside.

    You can supply as parameter one of 4 sources of ids from which the batches will be generated:

        param('inputlist');  The list is explicitly given in the parameters, can be abbreviated: 'inputlist' => ['a'..'z']

        param('inputfile');  The list is contained in a file whose name is supplied as parameter: 'inputfile' => 'myfile.txt'

        param('inputquery'); The list is generated by an SQL query (against the production database by default) : 'inputquery' => 'SELECT object_id FROM object WHERE x=y'

        param('inputcmd');   The list is generated by running a system command: 'inputcmd' => 'find /tmp/big_directory -type f'

    NB for developpers: fetch_input() method is intentionally missing from JobFactory.pm .
    If JobFactory is subclassed (say, by a Compara RunnableDB) the child class's should use fetch_input()
    to set $self->param('inputlist') to whatever list of ids specific to that particular type of data (slices, members, etc).
    The rest functionality will be taken care for by the parent class code.

Definition at line 35 of file JobFactory.pm.

Member Function Documentation

◆ _fisher_yates_shuffle_in_place()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_fisher_yates_shuffle_in_place ( )

    Description: a private function (not a method) that shuffles a list of ids

Code:

click to view

◆ _get_rows_from_list()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_list ( )

    Description: a private method that ensures the list is 2D

Code:

click to view

◆ _get_rows_from_open()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_open ( )

    Description: a private method that loads ids from a given file or command pipe

Code:

click to view

◆ _get_rows_from_query()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_get_rows_from_query ( )

    Description: a private method that loads ids from a given sql query

    param('db_conn'): An optional hash to pass in connection parameters to the database upon which the query will have to be run.

Code:

click to view

◆ _substitute_minibatched_rows()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_substitute_minibatched_rows ( )

    Description: a private method that minibatches a list and transforms every minibatch using param-substitution

Code:

click to view

◆ _substitute_rows()

protected Bio::EnsEMBL::Hive::RunnableDB::JobFactory::_substitute_rows ( )

    Description: a private method that goes through a list and transforms every row into a hash

Code:

click to view

◆ param_defaults()

public Bio::EnsEMBL::Hive::RunnableDB::JobFactory::param_defaults ( )

Undocumented method

Code:

click to view

◆ run()

public Bio::EnsEMBL::Hive::RunnableDB::JobFactory::run ( )

    Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).

    param('column_names'):  Controls the column names that come out of the parser: 0 = "no names", 1 = "parse names from data", arrayref = "take names from this array"

    param('delimiter'): If you set it your lines in file/cmd mode will be split into columns that you can use individually when constructing the input_id_template hash.

    param('randomize'): Shuffles the rows before creating jobs - can sometimes lead to better overall performance of the pipeline. Doesn't make any sence for minibatches (step>1).

    param('step'):      The requested size of the minibatch (1 by default). The real size of a range may be smaller than the requested size.

    param('contiguous'): Whether the key_column range of each minibatch should be contiguous (0 by default).

    param('key_column'): If every line of your input is a list (it happens, for example, when your SQL returns multiple columns or you have set the 'delimiter' in file/cmd mode)
                         this is the way to say which column is undergoing 'ranging'

        # The following 4 parameters are mutually exclusive and define the source of ids for the jobs:

    param('inputlist');  The list is explicitly given in the parameters, can be abbreviated: 'inputlist' => ['a'..'z']

    param('inputfile');  The list is contained in a file whose name is supplied as parameter: 'inputfile' => 'myfile.txt'

    param('inputquery'); The list is generated by an SQL query (against the production database by default) : 'inputquery' => 'SELECT object_id FROM object WHERE x=y'

    param('inputcmd');   The list is generated by running a system command: 'inputcmd' => 'find /tmp/big_directory -type f'

Code:

click to view

sub run {
    my $self = shift @_;
 
    my $column_names    = $self->param('column_names');   # can be 0 (no names), 1 (names from data) or an arrayref (names from this array)
    my $delimiter       = $self->param('delimiter');
 
    my $randomize       = $self->param('randomize');
 
        # minibatching-related:
    my $step            = $self->param('step');
    my $contiguous      = $self->param('contiguous');
    my $key_column      = $self->param('key_column');
 
    my $inputlist       = $self->param('inputlist');
    my $inputfile       = $self->param('inputfile');
    my $inputquery      = $self->param('inputquery');
    my $inputcmd        = $self->param('inputcmd');
 
    my $parse_column_names = $column_names && (ref($column_names) ne 'ARRAY');
 
    my ($rows, $column_names_from_data) =
              $inputlist    ? $self->_get_rows_from_list(  $inputlist  )
            : $inputquery   ? $self->_get_rows_from_query( $inputquery )
            : $inputfile    ? $self->_get_rows_from_open(  $inputfile  , '<', $delimiter, $parse_column_names )
            : $inputcmd     ? $self->_get_rows_from_open( ($self->param('use_bash_pipefail') ? 'set -o pipefail; ': '').$inputcmd, '-|', $delimiter, $parse_column_names )
            : die "range of values should be defined by setting 'inputlist', 'inputquery', 'inputfile' or 'inputcmd'";
 
    if( $column_names_from_data                                             # column data is available
    and ( defined($column_names) ? (ref($column_names) ne 'ARRAY') : 1 )    # and is badly needed
    ) {
        $column_names = $column_names_from_data;
    }
    # after this point $column_names should either contain a list or be false
 
    if( $self->param('input_id') ) {
        die "'input_id' is no longer supported, please reconfigure as the input_id_template of the dataflow_rule";
    }
 
    if($randomize) {
        _fisher_yates_shuffle_in_place($rows);
    }
 
    my $output_ids = $step
        ? $self->_substitute_minibatched_rows($rows, $column_names, $step, $contiguous, $key_column)
        : $self->_substitute_rows($rows, $column_names);
 
    $self->param('output_ids', $output_ids);
}

◆ write_output()

public Bio::EnsEMBL::Hive::RunnableDB::JobFactory::write_output ( )

    Description : Implements write_output() interface method of Bio::EnsEMBL::Hive::Process that is used to deal with job's output after the execution.
                  Here we rely on the dataflow mechanism to create jobs.

    param('fan_branch_code'): defines the branch where the fan of jobs is created (2 by default).

Code:

click to view

The documentation for this class was generated from the following file:

modules/Bio/EnsEMBL/Hive/RunnableDB/JobFactory.pm

Public Member Functions

Detailed Description

Synopsis

Description

Member Function Documentation

◆ _fisher_yates_shuffle_in_place()

◆ _get_rows_from_list()

◆ _get_rows_from_open()

◆ _get_rows_from_query()

◆ _substitute_minibatched_rows()

◆ _substitute_rows()

◆ param_defaults()

◆ run()

◆ write_output()