ensembl-hive  2.7.0
GCPct_conf.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
10  init_pipeline.pl Bio::EnsEMBL::Hive::Examples::GC::PipeConfig::GCPct_conf -password <mypass>
11 
12  # optionally also seed it with your specific values:
13  seed_pipeline.pl -url $EHIVE_URL -logic_name chunk_sequences -input_id '{ "sequence" => "gcpct_example.fa" }'
14 
15  # run the pipeline:
16  beekeeper.pl -url $EHIVE_URL -loop
17 
18 =head1 DESCRIPTION
19 
20  This is the PipeConfig file for the %GC pipeline example.
21  The main point of this pipeline is to provide an example of how to write Hive Runnables and link them together into a pipeline.
22 
23  Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
24 
25  The setting. Let's assume we are given a nucleotide sequence and want to calculate what percentage of bases are G or C.
26  The approach to this problem is quite simple: go through the sequence, tally up how many times a G or C occurs, then divide by the total number of bases in the sequence.
27  Thinking a bit more about this problem, we see that it is very easy to split up into smaller subproblems.
28  Each base is its own, independent entity, and they can be tallied in any order, or even simultaneously, without impacting the final result.
29  (As an aside, this problem falls into a class of problems that computer scientists call "embarrassingly parallel" or "pleasingly parallel",
30  as they are so easy to divide.)
31  We can take advantage of this and speed up the computation on longer sequences by splitting up the input sequences into smaller chunks,
32  tallying Gs and Cs in those chunks in parallel, then adding up the individual results into a final total.
33 
34  The %GC pipeline consists of three "analyses" (types of tasks):
35  'chunk_sequences', 'count_atgc', and 'calc_overall_percentage' that we use to exemplify various features of the Hive.
36 
37  * A chunk_sequences job takes sequences in a file and splits them
38  into smaller chunks. It creates a set of new files to store these sequence chunks. It creates
39  one new job for each of the new files it creates. In this configuration file, we specify that each of these
40  new jobs will be a 'count_atgc' job.
41 
42  * A 'count_atgc' job takes in a string parameter 'fasta_filename', then tallies up the number of As, Cs, Gs and Ts in the sequence(s)
43  in that file. It outputs the tallies as two parameters: 'at_count' and 'gc_count'. In this pipeline,
44  these parameters are flowed into two accumulators, also called 'at_count' and 'gc_count' where they are
45  stored for later use.
46 
47  * The 'calc_overall_percentage' job is run after all count_atgc jobs have completed.
48  It takes in the tallied AT and GC counts from the 'at_count' and 'gc_count' accumulators,
49  calculates the overall GC percentage, and outputs it as a 'result' parameter.
50  This pipeline then flows that result into the 'final_results' table.
51 
52  Please see the implementation details in Runnable modules themselves.
53 
54 =head1 LICENSE
55 
56  See the NOTICE file distributed with this work for additional information
57  regarding copyright ownership.
58 
59  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
60  You may obtain a copy of the License at
61 
62  http://www.apache.org/licenses/LICENSE-2.0
63 
64  Unless required by applicable law or agreed to in writing, software distributed under the License
65  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
66  See the License for the specific language governing permissions and limitations under the License.
67 
68 =head1 CONTACT
69 
70  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
71 
72 =cut
73 
74 
76 
77 use strict;
78 use warnings;
79 
80 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
81 
82 
83 =head2 pipeline_create_commands
84 
85  Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
86  that lists the commands that will create and set up the Hive database.
87  In addition to the standard creation of the database and populating it with Hive tables and procedures it
88  also creates a pipeline-specific table called 'final_result' to store the results of the computation.
89 
90 =cut
91 
92 sub pipeline_create_commands {
93  my ($self) = @_;
94  return [
95  @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables' creation
96 
97  # create an additional table to store the end result of the computation:
98  $self->db_cmd('CREATE TABLE final_result (inputfile VARCHAR(255) NOT NULL, result DOUBLE PRECISION NOT NULL, PRIMARY KEY (inputfile))'),
99  ];
100 }
101 
102 
103 =head2 pipeline_wide_parameters
104 
105  Description : Interface method that should return a hash of
106  pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
107  The value doesn't have to be a scalar, it can be any Perl structure. (They will be stringified and
108  de-stringified automagically).
109 
110 =cut
111 
112 sub pipeline_wide_parameters {
113  my ($self) = @_;
114  return {
115  %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base class
116 
117  # init_pipeline.pl makes the best guess of the hive root directory and stores it in EHIVE_ROOT_DIR, if it wasn't already set in the shell
118  'inputfile' => $ENV{'EHIVE_ROOT_DIR'} . '/t/input_fasta.fa', # name of the input file, here set to a sample file included with the eHive distribution
119  'input_format' => 'FASTA', # the expected format of the input file
120 
121  # Because this is an example pipeline, we provide a way to slow down execution so
122  # that it can be more easily observed as it runs. The 'take_time' parameter,
123  # specifies how much additional time a step should take before setting itself
124  # to "DONE."
125  'take_time' => 1,
126  };
127 }
128 
129 
130 =head2 pipeline_analyses
131 
132  Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that
133  defines the structure of the pipeline: analyses, jobs, rules, etc.
134  Here it defines three analyses:
135  * 'chunk_sequences' which uses the FastaFactory runnable to split sequences in an input file
136  into smaller chunks
137 
138  * 'count_atgc' which takes a chunk produced by chunk_sequences, and tallies the number of occurrences
139  of each base in the sequence(s) in the file
140 
141  * 'calc_overall_percentage' which takes the base count subtotals from all count_atgc jobs and calculates
142  the overall %GC in the sequence(s) in the original input file. The 'calc_overall_percentage' job is
143  blocked by a semaphore until all count_atgc jobs have completed.
144 
145 =cut
146 
147 sub pipeline_analyses {
148  my ($self) = @_;
149  return [
150  { -logic_name => 'chunk_sequences',
151  -module => 'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
152  -parameters => {
153  'max_chunk_length' => 100, # amount of sequence, in bases, to include in a single chunk file
154  'output_dir' => '.', # directory to store the chunk files
155  'output_prefix' => 'gcpct_pipeline_chunk_', # common prefix for the chunk files
156  'output_suffix' => '.chnk', # common suffix for the chunk files
157 
158  },
159  -input_ids => [ { } ], # auto-seed one job with default parameters (coming from pipeline-wide parameters or analysis parameters)
160  -flow_into => {
161  '2->A' => [ 'count_atgc' ], # will create a semaphored fan of jobs; will use param_stack mechanism to pass parameters around
162  'A->1' => [ 'calc_overall_percentage' ], # will create a semaphored funnel job to wait for the fan to complete
163  },
164  },
165 
166  { -logic_name => 'count_atgc',
167  -module => 'Bio::EnsEMBL::Hive::Examples::GC::RunnableDB::CountATGC',
168  -analysis_capacity => 4, # use per-analysis limiter
169  -flow_into => {
170  1 => ['?accu_name=at_count&accu_address=[]',
171  '?accu_name=gc_count&accu_address=[]']
172  },
173  },
174 
175  { -logic_name => 'calc_overall_percentage',
176  -module => 'Bio::EnsEMBL::Hive::Examples::GC::RunnableDB::CalcOverallPercentage',
177  -flow_into => {
178  1 => [ '?table_name=final_result' ], #Flows output into the DB table 'final_result'
179  },
180  },
181  ];
182 }
183 
184 1;
185 
Bio::EnsEMBL::Hive::Version
Definition: Version.pm:19
Bio::EnsEMBL::Hive::Examples::GC::PipeConfig::GCPct_conf::pipeline_analyses
public pipeline_analyses()
Bio::EnsEMBL::Hive::Examples::GC::PipeConfig::GCPct_conf
Definition: GCPct_conf.pm:57
main
public main()
about
public about()
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
Definition: HiveGeneric_conf.pm:54
run
public run()
Bio::EnsEMBL::Hive
Definition: Hive.pm:38
Bio
Definition: AltAlleleGroup.pm:4