9 # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
12 # optionally also seed it with your specific values:
13 seed_pipeline.pl -url $EHIVE_URL -logic_name chunk_sequences -input_id
'{ "sequence" => "gcpct_example.fa" }'
16 beekeeper.pl -url $EHIVE_URL -loop
20 This is the PipeConfig file
for the %GC pipeline example.
21 The
main point of
this pipeline is to provide an example of how to write
Hive Runnables and link them together into a pipeline.
25 The setting. Let's assume we are given a nucleotide sequence and want to calculate what percentage of bases are G or C.
26 The approach to this problem is quite simple: go through the sequence, tally up how many times a G or C occurs, then divide by the total number of bases in the sequence.
27 Thinking a bit more
about this problem, we see that it is very easy to split up into smaller subproblems.
28 Each base is its own, independent entity, and they can be tallied in any order, or even simultaneously, without impacting the final result.
29 (As an aside, this problem falls into a class of problems that computer scientists call "embarrassingly parallel" or "pleasingly parallel",
30 as they are so easy to divide.)
31 We can take advantage of this and speed up the computation on longer sequences by splitting up the input sequences into smaller chunks,
32 tallying Gs and Cs in those chunks in parallel, then adding up the individual results into a final total.
34 The %GC pipeline consists of three "analyses" (types of tasks):
35 'chunk_sequences', 'count_atgc', and 'calc_overall_percentage' that we use to exemplify various features of the
Hive.
37 * A chunk_sequences job takes sequences in a file and splits them
38 into smaller chunks. It creates a set of new files to store these sequence chunks. It creates
39 one new job for each of the new files it creates. In this configuration file, we specify that each of these
40 new jobs will be a 'count_atgc' job.
42 * A 'count_atgc' job takes in a string parameter 'fasta_filename', then tallies up the number of As, Cs, Gs and Ts in the sequence(s)
43 in that file. It outputs the tallies as two parameters: 'at_count' and 'gc_count'. In this pipeline,
44 these parameters are flowed into two accumulators, also called 'at_count' and 'gc_count' where they are
47 * The 'calc_overall_percentage' job is
run after all count_atgc jobs have completed.
48 It takes in the tallied AT and GC counts from the 'at_count' and 'gc_count' accumulators,
49 calculates the overall GC percentage, and outputs it as a 'result' parameter.
50 This pipeline then flows that result into the 'final_results' table.
52 Please see the implementation details in Runnable modules themselves.
56 See the NOTICE file distributed with this work for additional information
57 regarding copyright ownership.
59 Licensed under the Apache License,
Version 2.0 (the "License"); you may not use this file except in compliance with the License.
60 You may obtain a copy of the License at
64 Unless required by applicable law or agreed to in writing, software distributed under the License
65 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
66 See the License for the specific language governing permissions and limitations under the License.
70 Please subscribe to the
Hive mailing list: http:
80 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All
Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
83 =head2 pipeline_create_commands
86 that lists the commands that will create and set up the
Hive database.
87 In addition to the standard creation of the database and populating it with
Hive tables and procedures it
88 also creates a pipeline-specific table called 'final_result' to store the results of the computation.
92 sub pipeline_create_commands {
95 @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables
' creation
97 # create an additional table to store the end result of the computation:
98 $self->db_cmd('CREATE TABLE final_result (inputfile VARCHAR(255) NOT NULL, result DOUBLE PRECISION NOT NULL, PRIMARY KEY (inputfile))
'),
103 =head2 pipeline_wide_parameters
105 Description : Interface method that should return a hash of
106 pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
107 The value doesn't have to be a scalar, it can be any Perl structure. (They will be stringified and
108 de-stringified automagically).
112 sub pipeline_wide_parameters {
115 %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base
class
117 # init_pipeline.pl makes the best guess of the hive root directory and stores it in EHIVE_ROOT_DIR, if it wasn't already set in the shell
118 'inputfile' => $ENV{
'EHIVE_ROOT_DIR'} .
'/t/input_fasta.fa', # name of the input file, here set to a sample file included with the eHive distribution
119 'input_format' =>
'FASTA', # the expected format of the input file
121 # Because this is an example pipeline, we provide a way to slow down execution so
122 # that it can be more easily observed as it runs. The 'take_time' parameter,
123 # specifies how much additional time a step should take before setting itself
130 =head2 pipeline_analyses
132 Description : Implements
pipeline_analyses()
interface method of
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that
133 defines the structure of the pipeline: analyses, jobs, rules, etc.
134 Here it defines three analyses:
135 * 'chunk_sequences' which uses the FastaFactory runnable to split sequences in an input file
138 * 'count_atgc' which takes a chunk produced by chunk_sequences, and tallies the number of occurrences
139 of each base in the sequence(s) in the file
141 * 'calc_overall_percentage' which takes the base count subtotals from all count_atgc jobs and calculates
142 the overall %GC in the sequence(s) in the original input file. The 'calc_overall_percentage' job is
143 blocked by a semaphore until all count_atgc jobs have completed.
147 sub pipeline_analyses {
150 { -logic_name =>
'chunk_sequences',
151 -module =>
'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
153 'max_chunk_length' => 100, # amount of sequence, in bases, to include in a single chunk file
154 'output_dir' =>
'.', # directory to store the chunk files
155 'output_prefix' =>
'gcpct_pipeline_chunk_', # common prefix
for the chunk files
156 'output_suffix' =>
'.chnk', # common suffix
for the chunk files
159 -input_ids => [ { } ], #
auto-seed one job with
default parameters (coming from pipeline-wide parameters or analysis parameters)
161 '2->A' => [
'count_atgc' ], # will create a semaphored fan of jobs; will use param_stack mechanism to pass parameters around
162 'A->1' => [
'calc_overall_percentage' ], # will create a semaphored funnel job to wait
for the fan to complete
166 { -logic_name =>
'count_atgc',
167 -module =>
'Bio::EnsEMBL::Hive::Examples::GC::RunnableDB::CountATGC',
168 -analysis_capacity => 4, # use per-analysis limiter
170 1 => [
'?accu_name=at_count&accu_address=[]',
171 '?accu_name=gc_count&accu_address=[]']
175 { -logic_name =>
'calc_overall_percentage',
176 -module =>
'Bio::EnsEMBL::Hive::Examples::GC::RunnableDB::CalcOverallPercentage',
178 1 => [
'?table_name=final_result' ], #Flows output into the DB table
'final_result'