ensembl-hive  2.7.0
KmerPipelineHoH_conf.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
11 
12  # optionally also seed it with your specific values:
13  seed_pipeline.pl -url $EHIVE_URL -logic_name split_sequence -input_id '{ "sequence_file" => "my_sequence.fa", "chunk_size" => 1000, "overlap_size" => 12 }'
14 
15  # run the pipeline:
16  beekeeper.pl -url $EHIVE_URL -loop
17 
18 =head1 DESCRIPTION
19 
20  This is the PipeConfig file for the Kmer counting pipeline example.
21  This pipeline illustrates how to write PipeConfigs and Runnables that utilize the eHive features:
22  * Factories creating a fan of jobs
23  * Hash accumulator
24  * Semaphores
25  * Conditional pipeline flow
26 
27  Please refer to Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf module to understand the interface implemented here.
28 
29  Determining the frequency of k-mers (runs of nucleotides k bases long) is an important part of sequence analysis.
30  This pipeline takes a flat file containing one or more sequences, counts the k-mers in them, then records
31  the count of each k-mer in a table in the hive database.
32 
33  The pipeline can be run in two modes: short-sequence mode and long-sequence mode. These modes reflect two k-mer
34  analysis use cases.
35 
36  Short-sequence mode is useful for counting k-mers when the input contains many short (< a few kb) sequences. In this
37  mode, the input file is chunked into several smaller files, each of which contains a subset of the sequences from
38  the original input. The k-mers in these sequences are counted up in parallel. Then, the pipeline sums up all the
39  k-mer counts from those individual sub-counts.
40 
41  Long-sequence mode is useful for counting k-mers when the input contains a few very long (> hundreds of kb) sequences.
42  In this mode, the sequence or sequences in the input file are split into shorter subsequences, with overlapping ends.
43  The k-mers in these subsequences are counted up in parallel. Then, the pipeline sums up all the k-mer counts from
44  those individual subcounts.
45 
46  Selection of short- and long- sequence mode is done by setting the "seqtype" parameter. This parameter determines
47  which analyses are included in the pipeline via eHive's conditional dataflow mechanism.
48 
49  Parameters:
50  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
51  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
52  will use to split the sequence
53  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
54  inputfile => Name of input file
55  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
56  and SplitSequence runnables for details
57  max_chunk_length => Maximum length of sequence in a sub-file - see the documentation in the FastaFactory
58  and SplitSequence runnables for details
59  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
60  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
61 
62 =head1 LICENSE
63 
64  See the NOTICE file distributed with this work for additional information
65  regarding copyright ownership.
66 
67  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
68  You may obtain a copy of the License at
69 
70  http://www.apache.org/licenses/LICENSE-2.0
71 
72  Unless required by applicable law or agreed to in writing, software distributed under the License
73  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
74  See the License for the specific language governing permissions and limitations under the License.
75 
76 =head1 CONTACT
77 
78  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
79 
80 =cut
81 
83 
84 use strict;
85 use warnings;
86 
87 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
88 use Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf; # Allow this particular config to use conditional dataflow
89 
90 =head2 default_options
91 
92  Description : Implements the default_options() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
93  that sets default parameter values. These values can be overridden when running the init_pipeline.pl script.
94  Here, we set defaults for:
95 
96  seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
97  or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
98  will use to split the sequence
99  input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
100  inputfile => Name of input file
101  chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
102  and SplitSequence runnables for details
103  output_prefix => Filename prefix for the intermediate split files generated by this pipeline
104  output_suffix => Filename suffix for the intermediate split files generated by this pipeline
105 
106 =cut
107 
108 sub default_options {
109  my ($self) = @_;
110 
111  return {
112  %{ $self->SUPER::default_options() }, # inherit other stuff from the base class
113  'seqtype' => 'short',
114  'input_format' => 'FASTA',
115  # init_pipeline makes a best guess of the hive root directory and stores
116  # it in EHIVE_ROOT_DIR, if it is not already set in the shell
117  'inputfile' => $ENV{'EHIVE_ROOT_DIR'} . '/t/input_fasta.fa',
118  'chunk_size' => 40,
119  'output_dir' => '.',
120  'output_prefix' => 'k_split_',
121  'output_suffix' => '.fa',
122  };
123 }
124 
125 =head2 pipeline_create_commands
126 
127  Description : Implements pipeline_create_commands() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
128  In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates a table to hold this pipeline's final result.
129 
130 =cut
131 
132 sub pipeline_create_commands {
133  my ($self) = @_;
134  return [
135  @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables' creation
136 
137  # additional table to store results:
138  $self->db_cmd('CREATE TABLE final_result (filename VARCHAR(255) NOT NULL, kmer VARCHAR(255) NOT NULL, count INT NOT NULL, PRIMARY KEY (filename, kmer))'),
139  ];
140 }
141 
142 
143 =head2 pipeline_wide_parameters
144 
145  Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
146  The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
147  Please see existing PipeConfig modules for examples.
148 
149 =cut
150 
151 sub pipeline_wide_parameters {
152  my ($self) = @_;
153  return {
154  %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base class
155  };
156 }
157 
158 =head2 hive_meta_table
159 
160  Description: Interface method that should return a hash of meta-information about the pipeline (e.g. pipeline name or schema version).
161  Here, we indicate that this pipeline should use the parameter stack by setting 'hive_use_param_stack' to 1.
162 
163 =cut
164 
165 sub hive_meta_table {
166  my ($self) = @_;
167  return {
168  %{$self->SUPER::hive_meta_table}, # here we inherit anything from the base class
169 
170  'hive_use_param_stack' => 1, # switch on the param_stack mechanism
171  };
172 }
173 
174 =head2 pipeline_analyses
175 
176  Description : Implements pipeline_analyses() interface method of Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
177  Here it defines these analyses:
178 
179  * split_strategy -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::Dummy. It performs no work in itself;
180  rather it exists to trigger dataflow. The interesting part of this pipeline is the WHEN-ELSE flow control
181  in the flow_into section of the analysis definition. Here, subsequent analyses are determined based
182  on the value in the "seqtype" parameter.
183  * split_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence.
184  It splits sequences in an input-file with overlap, and stores the subsequences in a collection of
185  output files. In this pipeline, flow goes from split_strategy into split_sequence when the "seqtype"
186  parameter is not "short."
187  * chunk_sequence -- This analysis uses the runnable Bio::EnsEMBL::Hive::RunnableDB::FastaFactory. It splits a file
188  containing many sequences into a collection of sub-files, each containing a few of the sequences from
189  the original input file. Individual sequences are kept intact (unlike SplitSequence). In this pipeline,
190  flow goes from split_strategy into chunk_sequence when the "seqtype" parameter is "short."
191  * count_kmers -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers, which
192  identifies and tallies k-mers in the sequences in an input file. This pipeline is designed to create
193  several count_kmers jobs in parallel, the fan of jobs being created by either split_sequence or chunk_sequence.
194  * compile_counts -- This analysis uses the runnable Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCounts.
195  In this pipeline, a compile_counts job is created but it is initially blocked from running
196  by a semaphore. When all count_kmers jobs have finished, the semaphore is cleared, allowing a worker
197  to claim the compile_counts job and run it. This job compiles all the k-mer counts from
198  the previous count_kmers jobs into overall counts for each k-mer.
199 
200 =cut
201 
202 sub pipeline_analyses {
203  my ($self) = @_;
204  return [
205  {-logic_name => 'split_strategy',
206  -module => 'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
207  -meadow_type => 'LOCAL', # do not bother the farm with such a simple task (and get it done faster)
208  -input_ids => [
209  { 'seqtype' => $self->o('seqtype'),
210  'input_format' => $self->o('input_format'),
211  'inputfile' => $self->o('inputfile'),
212  'chunk_size' => $self->o('chunk_size'),
213  'output_dir' => $self->o('output_dir'),
214  'output_prefix' => $self->o('output_prefix'),
215  'output_suffix' => $self->o('output_suffix'),
216  'k' => $self->o('k'),
217  },
218  ],
219  -flow_into => {
220  # use conditional dataflow to determine the next analysis, based on the value of the "seqtype" parameter
221  '1->A' => WHEN('#seqtype# eq "short"' => [ 'chunk_sequence' ],
222  ELSE [ 'split_sequence' ]),
223  # creating a semaphored funnel job to wait for the fan to complete and add the results:
224  'A->1' => [ 'compile_counts' ],
225  },
226 
227  },
228 
229  { -logic_name => 'split_sequence',
230  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence',
231  # here, a template is used to perform a calculation on a parameter
232  -parameters => { "overlap_size" => "#expr(#k#-1)expr#"},
233  -analysis_capacity => 2, # use per-analysis limiter
234  -flow_into => {
235  '2' => ['count_kmers'],
236  },
237  },
238 
239  { -logic_name => 'chunk_sequence',
240  -module => 'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
241  -parameters => { "max_chunk_length" => "#chunk_size#" },
242  -flow_into => {
243  '2' => ['count_kmers'],
244  },
245  },
246 
247  { -logic_name => 'count_kmers',
248  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers',
249  # Here, we use a template to rename a parameter.
250  -parameters => {
251  "sequence_file" => '#chunk_name#',
252  },
253  -analysis_capacity => 4, # use per-analysis limiter
254  -flow_into => {
255  # Flows into a hash accumulator called all_counts. The hash key is the name of the chunk
256  # file in which the kmers were counted: it is dataflown out in a parameter called
257  # 'sequence_file', and we indicate it is to be the hash key in the 'accu_address={sequence_file}'
258  # portion of the url below. The value for each key is dataflown out in a parameter
259  # called 'counts'; the 'accu_input_variable=counts' portion of the url is where it's
260  # set as the value.
261  # The name of the Accumulator is 'all_counts', as designated by 'accu_name=all_counts' in the url.
262  # It is not required to match the name of a param, but it is allowed.
263 
264  3 => [ '?accu_name=all_counts&accu_address={sequence_file}&accu_input_variable=counts' ],
265 
266  # It is important to notice that values can be structures too: here "counts" is in fact
267  # a hash that associate each kmer seen in the file to its count. The funnel Runnable
268  # CompileCountsHoH will use the "all_counts" variable to be a two-dimensional hash indexed
269  # by the chunk filename first, and then the kmer. The Runnable would thus work exactly
270  # the same way if the accumulator was directly defined as a two-dimensional hash.
271  # In this case, the accumulator would have to be connected to the branch #4 (which
272  # triggers 1 event per kmer count) and use this syntax:
273  #4 => [ '?accu_name=all_counts&accu_address={sequence_file}{kmer}&accu_input_variable=count' ],
274  # The first-level hash key is still 'sequence_file', and the second level is now explicitly
275  # 'kmer'. 'kmer' is a string with the kmer sequence: it is dataflown out in a parameter
276  # with the same name. Together they form a two-dimensional hash accumulator defined in the
277  # 'accu_address={sequence_file}{kmer}' portion of the url above. The value for each pair of
278  # keys is dataflown out in a parameter called 'count', which represents the actual count for
279  # this kmer in this chunk file; the 'accu_input_variable=count' portion of the url is where
280  # it's set as the value. The name of the Accumulator is still 'all_counts', as designated by
281  # 'accu_name=all_counts' in the url.
282  },
283  },
284 
285  { -logic_name => 'compile_counts',
286  -module => 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsHoH',
287  -flow_into => {
288  # Flows the output into a table in the hive database called 'final_result'.
289  # We created this table earlier in this conf file during pipeline_create_commands().
290  # It has three columns, 'filename', 'kmer' and 'count'. Each field
291  # is filled by matching the column name to a param name, and filling in with the value
292  # from that param. In the CompileCountsAoH runnable, there is a loop in write_output
293  # that creates a dataflow event for each kmer seen. Each iteration of this loop
294  # (i.e. each dataflow event generated in that loop) fills in one row of the table.
295  4 => [ '?table_name=final_result' ],
296  },
297  },
298  ];
299 }
300 
301 1;
EnsEMBL
Definition: Filter.pm:1
Bio::EnsEMBL::Hive::Version
Definition: Version.pm:19
about
public about()
Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
Definition: HiveGeneric_conf.pm:54
run
public run()
Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::KmerPipelineHoH_conf
Definition: KmerPipelineHoH_conf.pm:66
Bio::EnsEMBL::Hive
Definition: Hive.pm:38
Bio
Definition: AltAlleleGroup.pm:4