ensembl-hive  2.7.0
KmerPipelineAoHIP_conf.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
11 
12  # optionally also seed it with your specific values:
13  seed_pipeline.pl -url $EHIVE_URL -logic_name split_sequence -input_id '{ "sequence_file" => "my_sequence.fa", "chunk_size" => 1000, "overlap_size" => 12 }'
14 
15  # run the pipeline:
16  beekeeper.pl -url $EHIVE_URL -loop
17 
18 =head1 DESCRIPTION
19 
20  This is the PipeConfig file for the Kmer counting pipeline example.
21  This pipeline illustrates how to write PipeConfigs and Runnables that utilize the eHive features:
22  * Factories creating a fan of jobs
23  * Array of hash Accumulators
24  * Semaphores
25  * Conditional pipeline flow
26  * Controlling parameter flow using INPUT_PLUS
27 
28  Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
29 
30  Determining the frequency of k-mers (runs of nucleotides k bases long) is an important part of sequence analysis.
31  This pipeline takes a flat file containing one or more sequences, counts the k-mers in them, then records
32  the count of each k-mer in a table in the hive database.
33 
34  The pipeline can be run in two modes: short-sequence mode and long-sequence mode. These modes reflect two k-mer
35  analysis use cases.
36 
37  Short-sequence mode is useful for counting k-mers when the input contains many short (< a few kb) sequences. In this
38  mode, the input file is chunked into several smaller files, each of which contains a subset of the sequences from
39  the original input. The k-mers in these sequences are counted up in parallel. Then, the pipeline sums up all the
40  k-mer counts from those individual sub-counts.
41 
42  Long-sequence mode is useful for counting k-mers when the input contains a few very long (> hundreds of kb) sequences.
43  In this mode, the sequence or sequences in the input file are split into shorter subsequences, with overlapping ends.
44  The k-mers in these subsequences are counted up in parallel. Then, the pipeline sums up all the k-mer counts from
45  those individual subcounts.
46 
47  Selection of short- and long- sequence mode is done by setting the "seqtype" parameter. This parameter determines
48  which analyses are included in the pipeline via eHive's conditional dataflow mechanism.
49 
50  Parameters:
51  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
52  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
53  will use to split the sequence
54  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
55  inputfile => Name of input file
56  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
57  and SplitSequence runnables for details
58  max_chunk_length => Maximum length of sequence in a sub-file - see the documentation in the FastaFactory
59  and SplitSequence runnables for details
60  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
61  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
62 
63 =head1 LICENSE
64 
65  See the NOTICE file distributed with this work for additional information
66  regarding copyright ownership.
67 
68  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
69  You may obtain a copy of the License at
70 
71  http://www.apache.org/licenses/LICENSE-2.0
72 
73  Unless required by applicable law or agreed to in writing, software distributed under the License
74  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
75  See the License for the specific language governing permissions and limitations under the License.
76 
77 =head1 CONTACT
78 
79  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
80 
81 =cut
82 
84 
85 use strict;
86 use warnings;
87 
88 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
89 use Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf; # Allow this particular config to use conditional dataflow and INPUT_PLUS
90 
91 =head2 default_options
92 
93  Description : Implements the default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
94  that sets default parameter values. These values can be overridden when running the init_pipeline.pl script.
95  Here, we set defaults for:
96 
97  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
98  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
99  will use to split the sequence
100  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
101  inputfile => Name of input file
102  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
103  and SplitSequence runnables for details
104  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
105  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
106 
107 =cut
108 
109 sub default_options {
110  my ($self) = @_;
111 
112  return {
113  %{ $self->SUPER::default_options() }, # inherit other stuff from the base class
114  'seqtype' => 'short',
115  'input_format' => 'FASTA',
116  # init_pipeline makes a best guess of the hive root directory and stores
117  # it in EHIVE_ROOT_DIR, if it is not already set in the shell
118  'inputfile' => $ENV{'EHIVE_ROOT_DIR'} . '/t/input_fasta.fa',
119  'chunk_size' => 40,
120  'output_dir' => '.',
121  'output_prefix' => 'k_split_',
122  'output_suffix' => '.fa',
123  };
124 }
125 
126 =head2 pipeline_create_commands
127 
128  Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
129  In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates a table to hold this pipeline's final result.
130 
131 =cut
132 
133 sub pipeline_create_commands {
134  my ($self) = @_;
135  return [
136  @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables' creation
137 
138  # additional table to store results:
139  $self->db_cmd('CREATE TABLE final_result (filename VARCHAR(255) NOT NULL, kmer VARCHAR(255) NOT NULL, count INT NOT NULL, PRIMARY KEY (filename, kmer))'),
140  ];
141 }
142 
143 
144 =head2 pipeline_wide_parameters
145 
146  Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
147  The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
148  Please see existing PipeConfig modules for examples.
149 
150 =cut
151 
152 sub pipeline_wide_parameters {
153  my ($self) = @_;
154  return {
155  %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base class
156  };
157 }
158 
159 =head2 hive_meta_table
160 
161  Description: Interface method that should return a hash of meta-information about the pipeline (e.g. pipeline name or schema version).
162  Here, there is nothing to declare, especially not the parameter stack since we are using INPUT_PLUS.
163 
164 =cut
165 
166 sub hive_meta_table {
167  my ($self) = @_;
168  return {
169  %{$self->SUPER::hive_meta_table}, # here we inherit anything from the base class
170  };
171 }
172 
173 =head2 pipeline_analyses
174 
175  Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
176  Here it defines these analyses:
177 
178  * split_strategy -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::Dummy. It performs no work in itself;
179  rather it exists to trigger dataflow. The interesting part of this pipeline is the WHEN-ELSE flow control
180  in the flow_into section of the analysis definition. Here, subsequent analyses are determined based
181  on the value in the "seqtype" parameter.
182  * split_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence.
183  It splits sequences in an input-file with overlap, and stores the subsequences in a collection of
184  output files. In this pipeline, flow goes from split_strategy into split_sequence when the "seqtype"
185  parameter is not "short."
186  * chunk_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::FastaFactory. It splits a file
187  containing many sequences into a collection of sub-files, each containing a few of the sequences from
188  the original input file. Individual sequences are kept intact (unlike SplitSequence). In this pipeline,
189  flow goes from split_strategy into chunk_sequence when the "seqtype" parameter is "short."
190  * count_kmers -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers, which
191  identifies and tallies k-mers in the sequences in an input file. This pipeline is designed to create
192  several count_kmers jobs in parallel, the fan of jobs being created by either split_sequence or chunk_sequence.
193  * compile_counts -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCounts.
194  In this pipeline, a compile_counts job is created but it is initially blocked from running
195  by a semaphore. When all count_kmers jobs have finished, the semaphore is cleared, allowing a worker
196  to claim the compile_counts job and run it. This job compiles all the k-mer counts from
197  the previous count_kmers jobs into overall counts for each k-mer.
198 
199 =cut
200 
201 sub pipeline_analyses {
202  my ($self) = @_;
203  return [
204  {-logic_name => 'split_strategy',
205  -module => 'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
206  -meadow_type => 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
207  -input_ids => [
208  { 'seqtype' => $self->o('seqtype'),
209  'input_format' => $self->o('input_format'),
210  'inputfile' => $self->o('inputfile'),
211  'chunk_size' => $self->o('chunk_size'),
212  'output_dir' => $self->o('output_dir'),
213  'output_prefix' => $self->o('output_prefix'),
214  'output_suffix' => $self->o('output_suffix'),
215  'k' => $self->o('k'),
216  },
217  ],
218  -flow_into => {
219  # use conditional dataflow to determine the next analysis, based on the value of the "seqtype" parameter
220  '1->A' => WHEN('#seqtype# eq "short"' => [ 'chunk_sequence' ],
221  ELSE [ 'split_sequence' ]),
222  # creating a semaphored funnel job to wait for the fan to complete and add the results:
223  'A->1' => [ 'compile_counts' ],
224  },
225 
226  },
227 
228  { -logic_name => 'split_sequence',
229  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence',
230  # here, a template is used to perform a calculation on a parameter
231  -parameters => { "overlap_size" => "#expr(#k#-1)expr#"},
232  -analysis_capacity => 2, # use per-analysis limiter
233  -flow_into => {
234  '2' => { 'count_kmers' => INPUT_PLUS() },
235  },
236  },
237 
238  { -logic_name => 'chunk_sequence',
239  -module => 'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
240  -parameters => { "max_chunk_length" => "#chunk_size#" },
241  -flow_into => {
242  '2' => { 'count_kmers' => INPUT_PLUS() },
243  },
244  },
245 
246  { -logic_name => 'count_kmers',
247  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers',
248  # Here, we use a template to rename a parameter.
249  -parameters => {
250  "sequence_file" => '#chunk_name#',
251  },
252  -analysis_capacity => 4, # use per-analysis limiter
253  -flow_into => {
254  # Flows into a "pile of hashes" accumulator called 'all_counts'. This is analogous to an array of hashes in Perl:
255  # Each job is going to create a hash table where the key is a kmer sequence, and the value is
256  # the count of that kmer. In a pile, unlike an array, the ordering of elements is not guaranteed.
257  # Breaking down the URL into its individual pieces:
258  # ?accu_name=all_counts : This is the name of the accu, other parts of the pipeline use this name to
259  # : access it later.
260  # &accu_address=[] : The [] indicates store this in a pile (an array where the order of elements
261  # : is not important). Having no label inside the brackets (e.g. [], not [i])
262  # : is what makes this a pile. Components of the pipeline accessing the accu
263  # : "all_counts" later will be able to enumerate over the elements, but there's
264  # : no guarantee for the order of those elements.
265  # &accu_input_variable=counts : This is the name of the variable in the dataflow output that holds the
266  # : value that hive will store in the accu. Here, the accu_input_variable
267  # : "counts" matches counts flown on branch 3 in
268  # : CountKmers::write_output
269  # The "hash" portion of this "array of hashes" is controlled by the runnable. In this case,
270  # CountKmers packs an entire hash into a hashref, which is flown into the accu.
271  3 => [ '?accu_name=all_counts&accu_address=[]&accu_input_variable=counts' ],
272  },
273  },
274 
275  { -logic_name => 'compile_counts',
276  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsAoH',
277  -flow_into => {
278  # Flows the output into a table in the hive database called 'final_result'.
279  # We created this table earlier in this conf file during pipeline_create_commands().
280  # It has three columns, 'filename', 'kmer' and 'count'. Each field
281  # is filled by matching the column name to a param name, and filling in with the value
282  # from that param. In the CompileCountsAoH runnable, there is a loop in write_output
283  # that creates a dataflow event for each kmer seen. Each iteration of this loop
284  # (i.e. each dataflow event generated in that loop) fills in one row of the table.
285  4 => [ '?table_name=final_result' ],
286  },
287  },
288  ];
289 }
290 
291 1;
EnsEMBL
Definition: Filter.pm:1
Bio::EnsEMBL::Hive::Version
Definition: Version.pm:19
about
public about()
Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::KmerPipelineAoHIP_conf
Definition: KmerPipelineAoHIP_conf.pm:67
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
Definition: HiveGeneric_conf.pm:54
run
public run()
Bio::EnsEMBL::Hive
Definition: Hive.pm:38
Bio
Definition: AltAlleleGroup.pm:4