version-2.5/perl/Process_8pm_source.html

 =pod

 =head1 NAME

     Bio::EnsEMBL::Hive::Process

 =head1 DESCRIPTION

     Abstract superclass.  Each Process makes up the individual building blocks
     of the system.  Instances of these processes are created in a hive workflow
     graph of Analysis entries that are linked together with dataflow and
     AnalysisCtrl rules.

     Instances of these Processes are created by the system as work is done.
     The newly created Process will have preset $self->db, $self->dbc, $self->input_id
     and several other variables.
     From this input and configuration data, each Process can then proceed to
     do something.  The flow of execution within a Process is:
         pre_cleanup() if($retry_count>0);   # clean up databases/filesystem before subsequent attempts
         fetch_input();                      # fetch the data from databases/filesystems
         run();                              # perform the main computation
         write_output();                     # record the results in databases/filesystems
         post_healthcheck();                 # check if we got the expected result (optional)
         post_cleanup();                     # destroy all non-trivial data structures after the job is done
     The developer can implement their own versions of
     pre_cleanup, fetch_input, run, write_output, and post_cleanup to do what they need.

     The entire system is based around the concept of a workflow graph which
     can split and loop back on itself.  This is accomplished by dataflow
     rules (similar to Unix pipes) that connect one Process (or analysis) to others.
     Where a Unix command line program can send output on STDOUT STDERR pipes,
     a hive Process has access to unlimited pipes referenced by numerical
     branch_codes. This is accomplished within the Process via
     $self->dataflow_output_id(...);

     The design philosophy is that each Process does its work and creates output,
     but it doesn't worry about where the input came from, or where its output
     goes. If the system has dataflow pipes connected, then the output jobs
     have purpose, if not - the output work is thrown away.  The workflow graph
     'controls' the behaviour of the system, not the processes.  The processes just
     need to do their job.  The design of the workflow graph is based on the knowledge
     of what each Process does so that the graph can be correctly constructed.
     The workflow graph can be constructed a priori or can be constructed and
     modified by intelligent Processes as the system runs.


     The Hive is based on AI concepts and modeled on the social structure and
     behaviour of a honey bee hive. So where a worker honey bee's purpose is
     (go find pollen, bring back to hive, drop off pollen, repeat), an ensembl-hive
     worker's purpose is (find a job, create a Process for that job, run it,
     drop off output job(s), repeat).  While most workflow systems are based
     on 'smart' central controllers and external control of 'dumb' processes,
     the Hive is based on 'dumb' workflow graphs and job kiosk, and 'smart' workers
     (autonomous agents) who are self configuring and figure out for themselves what
     needs to be done, and then do it.  The workers are based around a set of
     emergent behaviour rules which allow a predictible system behaviour to emerge
     from what otherwise might appear at first glance to be a chaotic system. There
     is an inherent asynchronous disconnect between one worker and the next.
     Work (or jobs) are simply 'posted' on a blackboard or kiosk within the hive
     database where other workers can find them.
     The emergent behaviour rules of a worker are:
     1) If a job is posted, someone needs to do it.
     2) Don't grab something that someone else is working on
     3) Don't grab more than you can handle
     4) If you grab a job, it needs to be finished correctly
     5) Keep busy doing work
     6) If you fail, do the best you can to report back

     For further reading on the AI principles employed in this design see:
         http://en.wikipedia.org/wiki/Autonomous_Agent
         http://en.wikipedia.org/wiki/Emergence

 =head1 LICENSE

     Copyright [1999-2015] Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute
     Copyright [2016-2022] EMBL-European Bioinformatics Institute

     Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
     You may obtain a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

     Unless required by applicable law or agreed to in writing, software distributed under the License
     is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     See the License for the specific language governing permissions and limitations under the License.

 =head1 CONTACT

     Please subscribe to the Hive mailing list:  http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users  to discuss Hive-related questions or to be notified of our updates

 =head1 APPENDIX

     The rest of the documentation details each of the object methods.
     Internal methods are usually preceded with a _

 =cut


 package Bio::EnsEMBL::Hive::Process;

 use strict;
 use warnings;

 use File::Path qw(remove_tree);
 use JSON;
 use Scalar::Util qw(looks_like_number);
 use Time::HiRes qw(time);

 use Bio::EnsEMBL::Hive::Utils ('stringify', 'go_figure_dbc', 'join_command_args', 'timeout');
 use Bio::EnsEMBL::Hive::Utils::Stopwatch;


 sub new {
     my $class = shift @_;

     my $self = bless {}, $class;

     return $self;
 }


 sub life_cycle {
     my ($self) = @_;

     my $job = $self->input_job();
     my $partial_stopwatch = Bio::EnsEMBL::Hive::Utils::Stopwatch->new();
     my %job_partial_timing = ();

     $job->incomplete(1);    # reinforce, in case the life_cycle is not run by a Worker
     $job->autoflow(1);

     eval {
         # Catch all the "warn" calls
         #$SIG{__WARN__} = sub { $self->warning(@_) };

         if( $self->can('pre_cleanup') and $job->retry_count()>0 ) {
             $self->enter_status('PRE_CLEANUP');
             $self->pre_cleanup;
         }

         # PRE_HEALTHCHECK can come here

         $self->enter_status('FETCH_INPUT');
         $partial_stopwatch->restart();
         $self->fetch_input;
         $job_partial_timing{'FETCH_INPUT'} = $partial_stopwatch->pause->get_elapsed;

         $self->enter_status('RUN');
         $partial_stopwatch->restart();
         $self->run;
         $job_partial_timing{'RUN'} = $partial_stopwatch->pause->get_elapsed;

         if($self->worker->execute_writes) {
             $self->enter_status('WRITE_OUTPUT');
             $partial_stopwatch->restart();
             $self->write_output;
             $job_partial_timing{'WRITE_OUTPUT'} = $partial_stopwatch->pause->get_elapsed;

             if( $self->can('post_healthcheck') ) {
                 $self->enter_status('POST_HEALTHCHECK');
                 $self->post_healthcheck;
             }
         } else {
             $self->say_with_header( ": *no* WRITE_OUTPUT requested, so there will be no AUTOFLOW" );
         }
     };
     # Restore the default handler
     #$SIG{__WARN__} = 'DEFAULT';

     if(my $life_cycle_msg = $@) {
         $job->died_somewhere( $job->incomplete );  # it will be OR'd inside
         Bio::EnsEMBL::Hive::Process::warning($self, $life_cycle_msg, $job->incomplete?'WORKER_ERROR':'INFO');     # In case the Runnable has redefined warning()
     }

     if( $self->can('post_cleanup') ) {   # may be run to clean up memory even after partially failed attempts
         eval {
             $job->incomplete(1);    # it could have been reset by a previous call to complete_early
             $self->enter_status('POST_CLEANUP');
             $self->post_cleanup;
         };
         if(my $post_cleanup_msg = $@) {
             $job->died_somewhere( $job->incomplete );  # it will be OR'd inside
             Bio::EnsEMBL::Hive::Process::warning($self, $post_cleanup_msg, $job->incomplete?'WORKER_ERROR':'INFO');   # In case the Runnable has redefined warning()
         }
     }

     unless( $job->died_somewhere ) {

         if( $self->execute_writes and $job->autoflow ) {    # AUTOFLOW doesn't have its own status so will have whatever previous state of the job
             $self->say_with_header( ': AUTOFLOW input->output' );
             $job->dataflow_output_id();
         }

         my @zombie_funnel_dataflow_rule_ids = keys %{$job->fan_cache};
         if( scalar(@zombie_funnel_dataflow_rule_ids) ) {
             $job->transient_error(0);
             die "The group of semaphored jobs is incomplete ! Some fan jobs (coming from dataflow_rule_id(s) ".join(',',@zombie_funnel_dataflow_rule_ids).") are missing a job on their funnel. Check the order of your dataflow_output_id() calls.";
         }

         $job->incomplete(0);

         return \%job_partial_timing;
     }
 }


 sub say_with_header {
     my ($self, $msg, $important) = @_;

     $important //= $self->debug();

     if($important) {
         if(my $worker = $self->worker) {
             $worker->worker_say( $msg );
         } else {
             print "StandaloneJob $msg\n";
         }
     }
 }


 sub enter_status {
     my ($self, $status) = @_;

     my $job = $self->input_job();

     $job->set_and_update_status( $status );

     if(my $worker = $self->worker) {
         $worker->set_and_update_status( 'JOB_LIFECYCLE' );  # to ensure when_checked_in TIMESTAMP is updated
     }

     $self->say_with_header( '-> '.$status );
 }


 sub warning {
     my ($self, $msg, $message_class) = @_;

     $message_class = 'WORKER_ERROR' if $message_class && looks_like_number($message_class);
     $message_class ||= 'INFO';
     chomp $msg;

     $self->say_with_header( "$message_class : $msg", 1 );

     my $job = $self->input_job;
     my $worker = $self->worker;

     if(my $job_adaptor = ($job && $job->adaptor)) {
         $job_adaptor->db->get_LogMessageAdaptor()->store_job_message($job->dbID, $msg, $message_class);
     } elsif(my $worker_adaptor = ($worker && $worker->adaptor)) {
         $worker_adaptor->db->get_LogMessageAdaptor()->store_worker_message($worker, $msg, $message_class);
     }
 }


 ##########################################
 #
 # methods subclasses should override
 # in order to give this process function
 #
 ##########################################


 =head2 param_defaults

     Title   :  param_defaults
     Function:  sublcass can define defaults for all params used by the RunnableDB/Process

 =cut

 sub param_defaults {
     return {};
 }


 #
 ## Function: sublcass can implement functions related to cleaning up the database/filesystem after the previous unsuccessful run.
 #

 # sub pre_cleanup {
 #    my $self = shift;
 #
 #    return 1;
 # }


 =head2 fetch_input

     Title   :  fetch_input
     Function:  sublcass can implement functions related to data fetching.
                Typical acivities would be to parse $self->input_id .
                Subclasses may also want to fetch data from databases
                or from files within this function.

 =cut

 sub fetch_input {
     my $self = shift;

     return 1;
 }


 =head2 run

     Title   :  run
     Function:  sublcass can implement functions related to process execution.
                Typical activities include running external programs or running
                algorithms by calling perl methods.  Process may also choose to
                parse results into memory if an external program was used.

 =cut

 sub run {
     my $self = shift;

     return 1;
 }


 =head2 write_output

     Title   :  write_output
     Function:  sublcass can implement functions related to storing results.
                Typical activities including writing results into database tables
                or into files on a shared filesystem.

 =cut

 sub write_output {
     my $self = shift;

     return 1;
 }


 #
 ## Function:  sublcass can implement functions related to cleaning up after running one job
 #               (destroying non-trivial data structures in memory).
 #

 #sub post_cleanup {
 #    my $self = shift;
 #
 #    return 1;
 #}


 ######################################################
 #
 # methods that subclasses can use to get access
 # to hive infrastructure
 #
 ######################################################


 =head2 worker

     Title   :   worker
     Usage   :   my $worker = $self->worker;
     Function:   returns the Worker object this Process is run by
     Returns :   Bio::EnsEMBL::Hive::Worker

 =cut

 sub worker {
     my $self = shift;

     $self->{'_worker'} = shift if(@_);
     return $self->{'_worker'};
 }


 sub execute_writes {
     my $self = shift;

     return $self->worker->execute_writes(@_);
 }


 =head2 db

     Title   :   db
     Usage   :   my $hiveDBA = $self->db;
     Function:   returns DBAdaptor to Hive database
     Returns :   Bio::EnsEMBL::Hive::DBSQL::DBAdaptor

 =cut

 sub db {
     my $self = shift;

     return $self->worker->adaptor && $self->worker->adaptor->db(@_);
 }


 =head2 dbc

     Title   :   dbc
     Usage   :   my $hiveDBConnection = $self->dbc;
     Function:   returns DBConnection to Hive database
     Returns :   Bio::EnsEMBL::Hive::DBSQL::DBConnection

 =cut

 sub dbc {
     my $self = shift;

     return $self->db && $self->db->dbc;
 }


 =head2 data_dbc

     Title   :   data_dbc
     Usage   :   my $data_dbc = $self->data_dbc;
     Function:   returns a Bio::EnsEMBL::Hive::DBSQL::DBConnection object (the "current" one by default, but can be set up otherwise)
     Returns :   Bio::EnsEMBL::Hive::DBSQL::DBConnection

 =cut

 sub data_dbc {
     my $self = shift @_;

     my $given_db_conn   = shift @_ || ($self->param_is_defined('db_conn') ? $self->param('db_conn') : $self);
     my $given_ref = ref( $given_db_conn );
     my $given_signature = ($given_ref eq 'ARRAY' or $given_ref eq 'HASH') ? stringify ( $given_db_conn ) : "$given_db_conn";

     if (!$self->param_is_defined('db_conn') and !$self->db and !$self->dbc) {
         # go_figure_dbc won't be able to create a DBConnection, so let's
         # just print a nicer error message
         $self->input_job->transient_error(0);
         throw('In standaloneJob mode, $self->data_dbc requires the -db_conn parameter to be defined on the command-line');
     }

     if( !$self->{'_cached_db_signature'} or ($self->{'_cached_db_signature'} ne $given_signature) ) {
         $self->{'_cached_db_signature'} = $given_signature;
         $self->{'_cached_data_dbc'} = go_figure_dbc( $given_db_conn );
     }

     return $self->{'_cached_data_dbc'};
 }


 =head2 run_system_command

     Title   :  run_system_command
     Arg[1]  :  (string or arrayref) Command to be run
     Arg[2]  :  (hashref, optional) Options, amongst:
                  - use_bash_pipefail: when enabled, a command with pipes will require all sides to succeed
                  - use_bash_errexit: when enabled, will stop at the first failure (otherwise commands such as "do_something_that_fails; do_something_that_succeeds" would return 0)
                  - timeout: the maximum number of seconds the command can run for. Will return the exit code -2 if the command has to be aborted
     Usage   :  my $return_code = $self->run_system_command('script.sh with many_arguments');   # Command as a single string
                my $return_code = $self->run_system_command(['script.sh', 'arg1', 'arg2']);     # Command as an array-ref
                my ($return_code, $stderr, $string_command) = $self->run_system_command(['script.sh', 'arg1', 'arg2']);     # Same in list-context. $string_command will be "script.sh arg1 arg2"
                my $return_code = $self->run_system_command('script1.sh with many_arguments | script2.sh', {'use_bash_pipefail' => 1});  # Command with pipes evaluated in a bash "pipefail" environment
     Function:  Runs a command given as a single-string or an array-ref. The second argument is a list of options
     Returns :  Returns the return-code in scalar context, or a triplet (return-code, standard-error, command) in list context

 =cut

 sub run_system_command {
     my ($self, $cmd, $options) = @_;

     require Capture::Tiny;

     $options //= {};
     my ($join_needed, $flat_cmd) = join_command_args($cmd);
     my @cmd_to_run;

     my $need_bash = $options->{'use_bash_pipefail'} || $options->{'use_bash_errexit'};
     if ($need_bash) {
         @cmd_to_run = ('bash',
                        $options->{'use_bash_pipefail'} ? ('-o' => 'pipefail') : (),
                        $options->{'use_bash_errexit'} ? ('-o' => 'errexit') : (),
                        '-c' => $flat_cmd);
     } else {
         # Let's use the array if possible, it saves us from running a shell
         @cmd_to_run = $join_needed ? $flat_cmd : (ref($cmd) ? @$cmd : $cmd)
     }

     $self->say_with_header("Command given: " . stringify($cmd));
     $self->say_with_header("Command to run: " . stringify(\@cmd_to_run));

     $self->dbc and $self->dbc->disconnect_if_idle();    # release this connection for the duration of system() call

     my $return_value;

     # Capture:Tiny has weird behavior if 'require'd instead of 'use'd
     # see, for example,http://www.perlmonks.org/?node_id=870439
     my $starttime = time() * 1000;
     my ($stdout, $stderr) = Capture::Tiny::tee(sub {
         $return_value = timeout( sub {system(@cmd_to_run)}, $options->{'timeout'} );
     });
     die sprintf("Could not run '%s', got %s\nSTDERR %s\n", $flat_cmd, $return_value, $stderr) if $return_value && $options->{die_on_failure};

     return ($return_value, $stderr, $flat_cmd, $stdout, time()*1000-$starttime) if wantarray;
     return $return_value;
 }


 =head2 input_job

     Title   :  input_job
     Function:  Returns the AnalysisJob to be run by this process
                Subclasses should treat this as a read_only object.
     Returns :  Bio::EnsEMBL::Hive::AnalysisJob object

 =cut

 sub input_job {
     my $self = shift @_;

     if(@_) {
         if(my $job = $self->{'_input_job'} = shift) {
             throw("Not a Bio::EnsEMBL::Hive::AnalysisJob object") unless ($job->isa("Bio::EnsEMBL::Hive::AnalysisJob"));
         }
     }
     return $self->{'_input_job'};
 }


 # ##################### subroutines that link through to Job's methods #########################

 sub input_id {
     my $self = shift;

     return $self->input_job->input_id(@_);
 }

 sub param {
     my $self = shift @_;

     return $self->input_job->param(@_);
 }

 sub param_required {
     my $self = shift @_;

     my $prev_transient_error = $self->input_job->transient_error(); # make a note of previously set transience status
     $self->input_job->transient_error(0);                           # make sure if we die in param_required it is not transient

     my $value = $self->input_job->param_required(@_);

     $self->input_job->transient_error($prev_transient_error);       # restore the previous transience status
     return $value;
 }

 sub param_exists {
     my $self = shift @_;

     return $self->input_job->param_exists(@_);
 }

 sub param_is_defined {
     my $self = shift @_;

     return $self->input_job->param_is_defined(@_);
 }

 sub param_substitute {
     my $self = shift @_;

     return $self->input_job->param_substitute(@_);
 }

 sub dataflow_output_id {
     my $self = shift @_;

     # Let's not spend time stringifying a large object if it's not going to be printed anyway
     $self->say_with_header('Dataflow on branch #' . ($_[1] // 1) . (defined $_[0] ? ' of ' . stringify($_[0]) : ' (no parameters -> input parameters repeated)')) if $self->debug;
     return $self->input_job->dataflow_output_id(@_);
 }


 =head2 dataflow_output_ids_from_json

     Title   :  dataflow_output_ids_from_json
     Arg[1]  :  File name
     Arg[2]  :  (optional) Branch number, defaults to 1 (see L<AnalysisJob::dataflow_output_id>)
     Function:  Wrapper around L<dataflow_output_id> that takes the output_ids from a JSON file.
                Each line in the JSON file is expected to be a complete JSON structure, which
                may be prefixed with a branch number

 =cut

 sub dataflow_output_ids_from_json {
     my ($self, $filename, $default_branch) = @_;

     my $json_formatter = JSON->new()->indent(0);
     my @output_job_ids;
     open(my $fh, '<', $filename) or die "Could not open '$filename' because: $!";
     while (my $l = $fh->getline()) {
         chomp $l;
         my $branch = $default_branch;
         my $json = $l;
         if ($l =~ /^(-?\d+)\s+(.*)$/) {
             $branch = $1;
             $json = $2;
         }
         my $hash = $json_formatter->decode($json);
         push @output_job_ids, @{ $self->dataflow_output_id($hash, $branch) };
     }
     close($fh);
     return \@output_job_ids;
 }


 sub throw {
     my $msg = pop @_;

     Bio::EnsEMBL::Hive::Utils::throw( $msg );   # this module doesn't import 'throw' to avoid namespace clash
 }


 =head2 complete_early

   Arg[1]      : (string) message
   Arg[2]      : (integer, optional) branch number
   Description : Ends the job with the given message, whilst marking the job as complete
                 Dataflows to the given branch right before if a branch number if given,
                 in which case the autoflow is disabled too.
   Returntype  : This function does not return

 =cut

 sub complete_early {
     my ($self, $msg, $branch_code) = @_;

     if (defined $branch_code) {
         $self->dataflow_output_id(undef, $branch_code);
         $self->input_job->autoflow(0);
     }
     $self->input_job->incomplete(0);
     $msg .= "\n" unless $msg =~ /\n$/;
     die $msg;
 }


 =head2 debug

     Title   :  debug
     Function:  Gets/sets flag for debug level. Set through Worker/runWorker.pl
                Subclasses should treat as a read_only variable.
     Returns :  integer

 =cut

 sub debug {
     my $self = shift;

     return $self->worker->debug(@_);
 }


 =head2 worker_temp_directory

     Title   :  worker_temp_directory
     Function:  Returns a path to a directory on the local /tmp disk
                which the subclass can use as temporary file space.
                This directory is made the first time the function is called.
                It persists for as long as the worker is alive.  This allows
                multiple jobs run by the worker to potentially share temp data.
                For example the worker (which is a single Analysis) might need
                to dump a datafile file which is needed by all jobs run through
                this analysis.  The process can first check the worker_temp_directory
                for the file and dump it if it is missing.  This way the first job
                run by the worker will do the dump, but subsequent jobs can reuse the
                file.
     Usage   :  $tmp_dir = $self->worker_temp_directory;
     Returns :  <string> path to a local (/tmp) directory

 =cut

 sub worker_temp_directory {
     my $self = shift @_;

     unless(defined($self->{'_tmp_dir'}) and (-e $self->{'_tmp_dir'})) {
         $self->{'_tmp_dir'} = $self->worker_temp_directory_name();
         mkdir($self->{'_tmp_dir'}, 0777);
         throw("unable to create a writable directory ".$self->{'_tmp_dir'}) unless(-w $self->{'_tmp_dir'});
     }
     return $self->{'_tmp_dir'};
 }

 sub worker_temp_directory_name {
     my $self = shift @_;

     return $self->worker->temp_directory_name;
 }


 =head2 cleanup_worker_temp_directory

     Title   :  cleanup_worker_temp_directory
     Function:  Cleans up the directory on the local /tmp disk that is used for the
                worker. It can be used to remove files left there by previous jobs.
     Usage   :  $self->cleanup_worker_temp_directory;

 =cut

 sub cleanup_worker_temp_directory {
     my $self = shift @_;

     my $tmp_dir = $self->worker_temp_directory_name();
     if(-e $tmp_dir) {
         remove_tree($tmp_dir, {error => undef});
     }
 }


 1;

Bio::EnsEMBL::Hive
Definition: Hive.pm:38

Bio::EnsEMBL::Hive::DBSQL::DBAdaptor
Definition: DBAdaptor.pm:31

Bio::EnsEMBL::Hive::AnalysisJob
Definition: AnalysisJob.pm:13

Bio::EnsEMBL::Hive::DBSQL::DBConnection
Definition: DBConnection.pm:19

Bio::EnsEMBL::Hive::Process::warning
public warning()

Bio::EnsEMBL::Hive::Process::input_id
public input_id()

Bio::EnsEMBL::Hive::Process::worker
public Bio::EnsEMBL::Hive::Worker worker()

Bio::EnsEMBL::Hive::Worker::execute_writes
public execute_writes()

Bio::EnsEMBL::Hive::Process
Definition: Process.pm:77

Bio::EnsEMBL::Hive::Worker
Definition: Worker.pm:52

Bio::EnsEMBL::Hive::Process::db
public Bio::EnsEMBL::Hive::DBSQL::DBAdaptor db()

Bio::EnsEMBL::Hive::DBSQL::DBAdaptor::dbc
public dbc()

Bio::EnsEMBL::Hive::AnalysisJob::input_id
public input_id()

main
public main()