ensembl-hive  2.8.1
SmartLongMult_conf.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
11 
12  # optionally also seed it with your specific values:
13  seed_pipeline.pl -url $EHIVE_URL -logic_name take_b_apart -input_id '{ "a_multiplier" => "12345678", "b_multiplier" => "3359559666" }'
14 
15  # run the pipeline:
16  beekeeper.pl -url $EHIVE_URL -loop
17 
18 =head1 DESCRIPTION
19 
20  This is the PipeConfig file for the long multiplication pipeline example.
21  The main point of this pipeline is to provide an example of how to write Hive Runnables and link them together into a pipeline.
22 
23  Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
24 
25  The setting. let's assume we are given two loooooong numbers to multiply. reeeeally long.
26  soooo long that they do not fit into registers of the cpu and should be multiplied digit-by-digit.
27  For the purposes of this example we also assume this task is very computationally intensive and has to be done in parallel.
28 
29  The long multiplication pipeline consists of four "analyses" (types of tasks):
30  'redirect_trivial_jobs', 'take_b_apart', 'part_multiply' and 'add_together' that we use to examplify various features of the Hive.
31 
32  * A 'redirect_trivial_jobs' job takes in two string parameters, 'a_multiplier' and 'b_multiplier' and checks whether the second one is 0, or a power of 10
33  If it is the case, the multiplication is easier to compute and we can flow the result directly to the 'final_result' table
34 
35  * A 'take_b_apart' job takes in two string parameters, 'a_multiplier' and 'b_multiplier',
36  takes the second one apart into digits, finds what _different_ digits are there,
37  creates several jobs of the 'part_multiply' analysis and one job of 'add_together' analysis.
38  'take_b_apart' is used when 'redirect_trivial_jobs' could not recognize "trivial" patterns
39 
40  * A 'part_multiply' job takes in 'a_multiplier' and 'digit', multiplies them and accumulates the result in 'partial_product' accumulator.
41 
42  * An 'add_together' job waits for the first two analyses to complete,
43  takes in 'a_multiplier', 'b_multiplier' and 'partial_product' hash and produces the final result in 'final_result' table.
44 
45  Please see the implementation details in Runnable modules themselves.
46 
47 =head1 LICENSE
48 
49  See the NOTICE file distributed with this work for additional information
50  regarding copyright ownership.
51 
52  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
53  You may obtain a copy of the License at
54 
55  http://www.apache.org/licenses/LICENSE-2.0
56 
57  Unless required by applicable law or agreed to in writing, software distributed under the License
58  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
59  See the License for the specific language governing permissions and limitations under the License.
60 
61 =head1 CONTACT
62 
63  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
64 
65 =cut
66 
67 
69 
70 use strict;
71 use warnings;
72 
73 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
74 use Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf; # Allow this particular config to use conditional dataflow and INPUT_PLUS
75 
76 
77 =head2 pipeline_create_commands
78 
79  Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
80  In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates two pipeline-specific tables used by Runnables to communicate.
81 
82 =cut
83 
84 sub pipeline_create_commands {
85  my ($self) = @_;
86  return [
87  @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables' creation
88 
89  # additional tables needed for long multiplication pipeline's operation:
90  $self->db_cmd('CREATE TABLE final_result (a_multiplier varchar(255) NOT NULL, b_multiplier varchar(255) NOT NULL, result varchar(255) NOT NULL, PRIMARY KEY (a_multiplier, b_multiplier))'),
91  ];
92 }
93 
94 
95 =head2 pipeline_wide_parameters
96 
97  Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
98  The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
99  Please see existing PipeConfig modules for examples.
100 
101 =cut
102 
103 sub pipeline_wide_parameters {
104  my ($self) = @_;
105  return {
106  %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base class
107 
108  'take_time' => 1,
109  };
110 }
111 
112 
113 =head2 pipeline_analyses
114 
115  Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
116  Here it defines three analyses:
117  * 'redirect_trivial_jobs' that is auto-seeded with a pair of jobs (to check the commutativity of multiplication).
118  Each job will check whether the multiplication can be done quickly (multiplication by 0 or a power of 10) and flow the result to the final_result table
119  Otherwise, it passes on to 'take_b_apart'
120 
121  * 'take_b_apart' with jobs fed from redirect_trivial_jobs#1
122  Each job will dataflow (create more jobs) via branch #2 into 'part_multiply' and via branch #1 into 'add_together'.
123 
124  * 'part_multiply' with jobs fed from take_b_apart#2.
125  It multiplies input parameters 'a_multiplier' and 'digit' and dataflows 'partial_product' parameter into branch #1.
126 
127  * 'add_together' with jobs fed from take_b_apart#1.
128  It adds together results of partial multiplication computed by 'part_multiply'.
129  These results are accumulated in 'partial_product' hash.
130  Until the hash is complete the corresponding 'add_together' job is blocked by a semaphore.
131 
132 =cut
133 
134 sub pipeline_analyses {
135  my ($self) = @_;
136  return [
137  { -logic_name => 'redirect_trivial_jobs',
139  -meadow_type=> 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
140  -analysis_capacity => 2, # use per-analysis limiter
141  -input_ids => [
142  { 'a_multiplier' => '9650156169', 'b_multiplier' => '327358788' },
143  { 'a_multiplier' => '327358788', 'b_multiplier' => '9650156169' },
144  ],
145  -flow_into => {
146  # Identify "easy" multiplications and flow their results directly to the table
147  # We use WHEN to detect the cases, and INPUT_PLUS to make parent job's parameters available to the kids
148  1 => WHEN(
149  '#b_multiplier# =~ /^0+$/' => { '?table_name=final_result' => INPUT_PLUS( { 'result' => '0' } ) },
150  '#b_multiplier# =~ /^10*$/' => { '?table_name=final_result' => INPUT_PLUS( { 'result' => '#a_multiplier##expr("0" x (length(#b_multiplier#)-1))expr#' } ) },
151  ELSE 'take_b_apart',
152  ),
153  },
154  },
155 
156  { -logic_name => 'take_b_apart',
157  -module => 'Bio::EnsEMBL::Hive::Examples::LongMult::RunnableDB::DigitFactory',
158  -meadow_type=> 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
159  -analysis_capacity => 2, # use per-analysis limiter
160  -flow_into => {
161  # creating a semaphored fan of jobs; filtering by WHEN; using INPUT_PLUS or templates to top-up the hashes.
162  #
163  # A WHEN block is not a hash, so multiple occurences of each condition (including ELSE) is permitted.
164  '2->A' => WHEN(
165  '#digit#>1' => { 'part_multiply' => INPUT_PLUS() }, # make parent job's parameters available to the kids
166 # ELSE { 'part_multiply' => { 'a_multiplier' => '#a_multiplier#', 'digit' => '#digit#' } },
167  ),
168  # creating a semaphored funnel job to wait for the fan to complete and add the results:
169  'A->1' => [ 'add_together' ],
170  },
171  },
172 
173  { -logic_name => 'part_multiply',
175  -analysis_capacity => 4, # use per-analysis limiter
176  -flow_into => {
177  1 => [ '?accu_name=partial_product&accu_address={digit}&accu_input_variable=product' ],
178  },
179  },
180 
181  { -logic_name => 'add_together',
183  -flow_into => {
184  1 => [ '?table_name=final_result' ],
185  },
186  },
187  ];
188 }
189 
190 1;
191 
Bio::EnsEMBL::Hive::RunnableDB::Dummy
Definition: Dummy.pm:28
Bio::EnsEMBL::Hive::Examples::LongMult::RunnableDB::PartMultiply
Definition: PartMultiply.pm:20
Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::SmartLongMult_conf
Definition: SmartLongMult_conf.pm:51
main
public main()
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
Definition: HiveGeneric_conf.pm:54
Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf
Definition: LongMult_conf.pm:47
Bio::EnsEMBL::Hive::Examples::LongMult::RunnableDB::AddTogether
Definition: AddTogether.pm:21