ensembl-hive  2.7.0
KmerPipelineAoH_conf.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
11 
12  # optionally also seed it with your specific values:
13  seed_pipeline.pl -url $EHIVE_URL -logic_name split_sequence -input_id '{ "sequence_file" => "my_sequence.fa", "chunk_size" => 1000, "overlap_size" => 12 }'
14 
15  # run the pipeline:
16  beekeeper.pl -url $EHIVE_URL -loop
17 
18 =head1 DESCRIPTION
19 
20  This is the PipeConfig file for the Kmer counting pipeline example.
21  This pipeline illustrates how to write PipeConfigs and Runnables that utilize the eHive features:
22  * Factories creating a fan of jobs
23  * Array of hash Accumulators
24  * Semaphores
25  * Conditional pipeline flow
26 
27  Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
28 
29  Determining the frequency of k-mers (runs of nucleotides k bases long) is an important part of sequence analysis.
30  This pipeline takes a flat file containing one or more sequences, counts the k-mers in them, then records
31  the count of each k-mer in a table in the hive database.
32 
33  The pipeline can be run in two modes: short-sequence mode and long-sequence mode. These modes reflect two k-mer
34  analysis use cases.
35 
36  Short-sequence mode is useful for counting k-mers when the input contains many short (< a few kb) sequences. In this
37  mode, the input file is chunked into several smaller files, each of which contains a subset of the sequences from
38  the original input. The k-mers in these sequences are counted up in parallel. Then, the pipeline sums up all the
39  k-mer counts from those individual sub-counts.
40 
41  Long-sequence mode is useful for counting k-mers when the input contains a few very long (> hundreds of kb) sequences.
42  In this mode, the sequence or sequences in the input file are split into shorter subsequences, with overlapping ends.
43  The k-mers in these subsequences are counted up in parallel. Then, the pipeline sums up all the k-mer counts from
44  those individual subcounts.
45 
46  Selection of short- and long- sequence mode is done by setting the "seqtype" parameter. This parameter determines
47  which analyses are included in the pipeline via eHive's conditional dataflow mechanism.
48 
49  Parameters:
50  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
51  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
52  will use to split the sequence
53  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
54  inputfile => Name of input file
55  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
56  and SplitSequence runnables for details
57  max_chunk_length => Maximum length of sequence in a sub-file - see the documentation in the FastaFactory
58  and SplitSequence runnables for details
59  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
60  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
61 
62 =head1 LICENSE
63 
64  See the NOTICE file distributed with this work for additional information
65  regarding copyright ownership.
66 
67  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
68  You may obtain a copy of the License at
69 
70  http://www.apache.org/licenses/LICENSE-2.0
71 
72  Unless required by applicable law or agreed to in writing, software distributed under the License
73  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
74  See the License for the specific language governing permissions and limitations under the License.
75 
76 =head1 CONTACT
77 
78  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
79 
80 =cut
81 
83 
84 use strict;
85 use warnings;
86 
87 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
88 use Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf; # Allow this particular config to use conditional dataflow
89 
90 =head2 default_options
91 
92  Description : Implements the default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
93  that sets default parameter values. These values can be overridden when running the init_pipeline.pl script.
94  Here, we set defaults for:
95 
96  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
97  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
98  will use to split the sequence
99  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
100  inputfile => Name of input file
101  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
102  and SplitSequence runnables for details
103  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
104  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
105 
106 =cut
107 
108 sub default_options {
109  my ($self) = @_;
110 
111  return {
112  %{ $self->SUPER::default_options() }, # inherit other stuff from the base class
113  'seqtype' => 'short',
114  'input_format' => 'FASTA',
115  # init_pipeline makes a best guess of the hive root directory and stores
116  # it in EHIVE_ROOT_DIR, if it is not already set in the shell
117  'inputfile' => $ENV{'EHIVE_ROOT_DIR'} . '/t/input_fasta.fa',
118  'chunk_size' => 40,
119  'output_dir' => '.',
120  'output_prefix' => 'k_split_',
121  'output_suffix' => '.fa',
122  };
123 }
124 
125 =head2 pipeline_create_commands
126 
127  Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
128  In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates a table to hold this pipeline's final result.
129 
130 =cut
131 
132 sub pipeline_create_commands {
133  my ($self) = @_;
134  return [
135  @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables' creation
136 
137  # additional table to store results:
138  $self->db_cmd('CREATE TABLE final_result (filename VARCHAR(255) NOT NULL, kmer VARCHAR(255) NOT NULL, count INT NOT NULL, PRIMARY KEY (filename, kmer))'),
139  ];
140 }
141 
142 
143 =head2 pipeline_wide_parameters
144 
145  Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
146  The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
147  Please see existing PipeConfig modules for examples.
148 
149 =cut
150 
151 sub pipeline_wide_parameters {
152  my ($self) = @_;
153  return {
154  %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base class
155  };
156 }
157 
158 =head2 hive_meta_table
159 
160  Description: Interface method that should return a hash of meta-information about the pipeline (e.g. pipeline name or schema version).
161  Here, we indicate that this pipeline should use the parameter stack by setting 'hive_use_param_stack' to 1.
162 
163 =cut
164 
165 sub hive_meta_table {
166  my ($self) = @_;
167  return {
168  %{$self->SUPER::hive_meta_table}, # here we inherit anything from the base class
169 
170  'hive_use_param_stack' => 1, # switch on the param_stack mechanism
171  };
172 }
173 
174 =head2 pipeline_analyses
175 
176  Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
177  Here it defines these analyses:
178 
179  * split_strategy -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::Dummy. It performs no work in itself;
180  rather it exists to trigger dataflow. The interesting part of this pipeline is the WHEN-ELSE flow control
181  in the flow_into section of the analysis definition. Here, subsequent analyses are determined based
182  on the value in the "seqtype" parameter.
183  * split_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence.
184  It splits sequences in an input-file with overlap, and stores the subsequences in a collection of
185  output files. In this pipeline, flow goes from split_strategy into split_sequence when the "seqtype"
186  parameter is not "short."
187  * chunk_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::FastaFactory. It splits a file
188  containing many sequences into a collection of sub-files, each containing a few of the sequences from
189  the original input file. Individual sequences are kept intact (unlike SplitSequence). In this pipeline,
190  flow goes from split_strategy into chunk_sequence when the "seqtype" parameter is "short."
191  * count_kmers -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers, which
192  identifies and tallies k-mers in the sequences in an input file. This pipeline is designed to create
193  several count_kmers jobs in parallel, the fan of jobs being created by either split_sequence or chunk_sequence.
194  * compile_counts -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCounts.
195  In this pipeline, a compile_counts job is created but it is initially blocked from running
196  by a semaphore. When all count_kmers jobs have finished, the semaphore is cleared, allowing a worker
197  to claim the compile_counts job and run it. This job compiles all the k-mer counts from
198  the previous count_kmers jobs into overall counts for each k-mer.
199 
200 =cut
201 
202 sub pipeline_analyses {
203  my ($self) = @_;
204  return [
205  {-logic_name => 'split_strategy',
206  -module => 'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
207  -meadow_type => 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
208  -input_ids => [
209  { 'seqtype' => $self->o('seqtype'),
210  'input_format' => $self->o('input_format'),
211  'inputfile' => $self->o('inputfile'),
212  'chunk_size' => $self->o('chunk_size'),
213  'output_dir' => $self->o('output_dir'),
214  'output_prefix' => $self->o('output_prefix'),
215  'output_suffix' => $self->o('output_suffix'),
216  'k' => $self->o('k'),
217  },
218  ],
219  -flow_into => {
220  # use conditional dataflow to determine the next analysis, based on the value of the "seqtype" parameter
221  '1->A' => WHEN('#seqtype# eq "short"' => [ 'chunk_sequence' ],
222  ELSE [ 'split_sequence' ]),
223  # creating a semaphored funnel job to wait for the fan to complete and add the results:
224  'A->1' => [ 'compile_counts' ],
225  },
226 
227  },
228 
229  { -logic_name => 'split_sequence',
230  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence',
231  # here, a template is used to perform a calculation on a parameter
232  -parameters => { "overlap_size" => "#expr(#k#-1)expr#"},
233  -analysis_capacity => 2, # use per-analysis limiter
234  -flow_into => {
235  '2' => ['count_kmers'],
236  },
237  },
238 
239  { -logic_name => 'chunk_sequence',
240  -module => 'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
241  -parameters => { "max_chunk_length" => "#chunk_size#" },
242  -flow_into => {
243  '2' => ['count_kmers'],
244  },
245  },
246 
247  { -logic_name => 'count_kmers',
248  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers',
249  # Here, we use a template to rename a parameter.
250  -parameters => {
251  "sequence_file" => '#chunk_name#',
252  },
253  -analysis_capacity => 4, # use per-analysis limiter
254  -flow_into => {
255  # Flows into a "pile of hashes" accumulator called 'all_counts'. This is analogous to an array of hashes in Perl:
256  # Each job is going to create a hash table where the key is a kmer sequence, and the value is
257  # the count of that kmer. In a pile, unlike an array, the ordering of elements is not guaranteed.
258  # Breaking down the URL into its individual pieces:
259  # ?accu_name=all_counts : This is the name of the accu, other parts of the pipeline use this name to
260  # : access it later.
261  # &accu_address=[] : The [] indicates store this in a pile (an array where the order of elements
262  # : is not important). Having no label inside the brackets (e.g. [], not [i])
263  # : is what makes this a pile. Components of the pipeline accessing the accu
264  # : "all_counts" later will be able to enumerate over the elements, but there's
265  # : no guarantee for the order of those elements.
266  # &accu_input_variable=counts : This is the name of the variable in the dataflow output that holds the
267  # : value that hive will store in the accu. Here, the accu_input_variable
268  # : "counts" matches counts flown on branch 3 in
269  # : CountKmers::write_output
270  # The "hash" portion of this "array of hashes" is controlled by the runnable. In this case,
271  # CountKmers packs an entire hash into a hashref, which is flown into the accu.
272  3 => [ '?accu_name=all_counts&accu_address=[]&accu_input_variable=counts' ],
273  },
274  },
275 
276  { -logic_name => 'compile_counts',
277  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsAoH',
278  -flow_into => {
279  # Flows the output into a table in the hive database called 'final_result'.
280  # We created this table earlier in this conf file during pipeline_create_commands().
281  # It has three columns, 'filename', 'kmer' and 'count'. Each field
282  # is filled by matching the column name to a param name, and filling in with the value
283  # from that param. In the CompileCountsAoH runnable, there is a loop in write_output
284  # that creates a dataflow event for each kmer seen. Each iteration of this loop
285  # (i.e. each dataflow event generated in that loop) fills in one row of the table.
286  4 => [ '?table_name=final_result' ],
287  },
288  },
289  ];
290 }
291 
292 1;
EnsEMBL
Definition: Filter.pm:1
Bio::EnsEMBL::Hive::Version
Definition: Version.pm:19
about
public about()
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
Definition: HiveGeneric_conf.pm:54
run
public run()
Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::KmerPipelineAoH_conf
Definition: KmerPipelineAoH_conf.pm:66
Bio::EnsEMBL::Hive
Definition: Hive.pm:38
Bio
Definition: AltAlleleGroup.pm:4