9 # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
12 # optionally also seed it with your specific values:
13 seed_pipeline.pl -url $EHIVE_URL -logic_name split_sequence -input_id
'{ "sequence_file" => "my_sequence.fa", "chunk_size" => 1000, "overlap_size" => 12 }'
16 beekeeper.pl -url $EHIVE_URL -loop
20 This is the PipeConfig file
for the Kmer counting pipeline example.
21 This pipeline illustrates how to write PipeConfigs and Runnables that utilize the eHive features:
22 * Factories creating a fan of jobs
23 * Array of hash Accumulators
25 * Conditional pipeline flow
26 * Controlling parameter flow
using INPUT_PLUS
30 Determining the frequency of k-mers (runs of nucleotides k bases long) is an important part of sequence analysis.
31 This pipeline takes a flat file containing one or more sequences, counts the k-mers in them, then records
32 the count of each k-mer in a table in the hive database.
34 The pipeline can be
run in two modes: short-sequence mode and long-sequence mode. These modes reflect two k-mer
37 Short-sequence mode is useful for counting k-mers when the input contains many short (< a few kb) sequences. In this
38 mode, the input file is chunked into several smaller files, each of which contains a subset of the sequences from
39 the original input. The k-mers in these sequences are counted up in parallel. Then, the pipeline sums up all the
40 k-mer counts from those individual sub-counts.
42 Long-sequence mode is useful for counting k-mers when the input contains a few very long (> hundreds of kb) sequences.
43 In this mode, the sequence or sequences in the input file are split into shorter subsequences, with overlapping ends.
44 The k-mers in these subsequences are counted up in parallel. Then, the pipeline sums up all the k-mer counts from
45 those individual subcounts.
47 Selection of short- and long- sequence mode is done by setting the "seqtype" parameter. This parameter determines
48 which analyses are included in the pipeline via eHive's conditional dataflow mechanism.
51 seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
52 or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
53 will use to split the sequence
54 input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
55 inputfile => Name of input file
56 chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
57 and SplitSequence runnables for details
58 max_chunk_length => Maximum length of sequence in a sub-file - see the documentation in the FastaFactory
59 and SplitSequence runnables for details
60 output_prefix => Filename prefix for the intermediate split files generated by this pipeline
61 output_suffix => Filename suffix for the intermediate split files generated by this pipeline
65 See the NOTICE file distributed with this work for additional information
66 regarding copyright ownership.
68 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
69 You may obtain a copy of the License at
73 Unless required by applicable law or agreed to in writing, software distributed under the License
74 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
75 See the License for the specific language governing permissions and limitations under the License.
79 Please subscribe to the Hive mailing list: http:
88 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
91 =head2 default_options
94 that sets default parameter values. These values can be overridden when running the init_pipeline.pl script.
95 Here, we set defaults for:
97 seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
98 or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
99 will use to split the sequence
100 input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
101 inputfile => Name of input file
102 chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
103 and SplitSequence runnables for details
104 output_prefix => Filename prefix for the intermediate split files generated by this pipeline
105 output_suffix => Filename suffix for the intermediate split files generated by this pipeline
109 sub default_options {
113 %{ $self->SUPER::default_options() }, # inherit other stuff from the base
class
114 'seqtype' =>
'short',
115 'input_format' =>
'FASTA',
116 # init_pipeline makes a best guess of the hive root directory and stores
117 # it in EHIVE_ROOT_DIR, if it is not already set in the shell
118 'inputfile' => $ENV{
'EHIVE_ROOT_DIR'} .
'/t/input_fasta.fa',
121 'output_prefix' =>
'k_split_',
122 'output_suffix' =>
'.fa',
126 =head2 pipeline_create_commands
128 Description : Implements pipeline_create_commands() interface method of
Bio::
EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the Hive database.
129 In addition to the standard creation of the database and populating it with Hive tables and procedures it also creates a table to hold this pipeline's final result.
133 sub pipeline_create_commands {
136 @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables
' creation
138 # additional table to store results:
139 $self->db_cmd('CREATE TABLE final_result (filename VARCHAR(255) NOT NULL, kmer VARCHAR(255) NOT NULL, count INT NOT NULL, PRIMARY KEY (filename, kmer))
'),
144 =head2 pipeline_wide_parameters
146 Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
147 The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
148 Please see existing PipeConfig modules
for examples.
152 sub pipeline_wide_parameters {
155 %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base
class
159 =head2 hive_meta_table
161 Description: Interface method that should
return a hash of meta-information
about the pipeline (e.g. pipeline name or schema version).
162 Here, there is nothing to declare, especially not the parameter stack since we are
using INPUT_PLUS.
166 sub hive_meta_table {
169 %{$self->SUPER::hive_meta_table}, # here we inherit anything from the base
class
173 =head2 pipeline_analyses
175 Description : Implements pipeline_analyses() interface method of
Bio::
EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
176 Here it defines these analyses:
178 * split_strategy -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::RunnableDB::Dummy. It performs no work in itself;
179 rather it exists to trigger dataflow. The interesting part of this pipeline is the WHEN-ELSE flow control
180 in the flow_into section of the analysis definition. Here, subsequent analyses are determined based
181 on the value in the "seqtype" parameter.
182 * split_sequence -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence.
183 It splits sequences in an input-file with overlap, and stores the subsequences in a collection of
184 output files. In this pipeline, flow goes from split_strategy into split_sequence when the "seqtype"
185 parameter is not "
short."
186 * chunk_sequence -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::RunnableDB::FastaFactory. It splits a file
187 containing many sequences into a collection of sub-files, each containing a few of the sequences from
188 the original input file. Individual sequences are kept intact (unlike SplitSequence). In this pipeline,
189 flow goes from split_strategy into chunk_sequence when the "seqtype" parameter is "
short."
190 * count_kmers -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers, which
191 identifies and tallies k-mers in the sequences in an input file. This pipeline is designed to create
192 several count_kmers jobs in parallel, the fan of jobs being created by either split_sequence or chunk_sequence.
193 * compile_counts -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCounts.
194 In this pipeline, a compile_counts job is created but it is initially blocked from running
195 by a semaphore. When all count_kmers jobs have finished, the semaphore is cleared, allowing a worker
196 to claim the compile_counts job and
run it. This job compiles all the k-mer counts from
197 the previous count_kmers jobs into overall counts for each k-mer.
201 sub pipeline_analyses {
204 {-logic_name =>
'split_strategy',
205 -module =>
'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
206 -meadow_type =>
'LOCAL', #
do not bother the farm with such a simple task (and get it done faster)
208 {
'seqtype' => $self->o(
'seqtype'),
209 'input_format' => $self->o(
'input_format'),
210 'inputfile' => $self->o(
'inputfile'),
211 'chunk_size' => $self->o(
'chunk_size'),
212 'output_dir' => $self->o(
'output_dir'),
213 'output_prefix' => $self->o(
'output_prefix'),
214 'output_suffix' => $self->o(
'output_suffix'),
215 'k' => $self->o(
'k'),
219 # use conditional dataflow to determine the next analysis, based on the value of the "seqtype" parameter
220 '1->A' => WHEN(
'#seqtype# eq "short"' => [
'chunk_sequence' ],
221 ELSE [
'split_sequence' ]),
222 # creating a semaphored funnel job to wait for the fan to complete and add the results:
223 'A->1' => [
'compile_counts' ],
228 { -logic_name =>
'split_sequence',
229 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence',
230 # here, a template is used to perform a calculation on a parameter
231 -parameters => {
"overlap_size" =>
"#expr(#k#-1)expr#"},
232 -analysis_capacity => 2, # use per-analysis limiter
234 '2' => {
'count_kmers' => INPUT_PLUS() },
238 { -logic_name =>
'chunk_sequence',
239 -module =>
'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
240 -parameters => {
"max_chunk_length" =>
"#chunk_size#" },
242 '2' => {
'count_kmers' => INPUT_PLUS() },
246 { -logic_name =>
'count_kmers',
247 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers',
248 # Here, we use a template to rename a parameter.
250 "sequence_file" =>
'#chunk_name#',
252 -analysis_capacity => 4, # use per-analysis limiter
254 # Flows into a "pile of hashes" accumulator called 'all_counts'. This is analogous to an array of hashes in Perl:
255 # Each job is going to create a hash table where the key is a kmer sequence, and the value is
256 # the count of that kmer. In a pile, unlike an array, the ordering of elements is not guaranteed.
257 # Breaking down the URL into its individual pieces:
258 # ?accu_name=all_counts : This is the name of the accu, other parts of the pipeline use this name to
260 # &accu_address=[] : The [] indicates store this in a pile (an array where the order of elements
261 # : is not important). Having no label inside the brackets (e.g. [], not [i])
262 # : is what makes this a pile. Components of the pipeline accessing the accu
263 # : "all_counts" later will be able to enumerate over the elements, but there's
264 # : no guarantee for the order of those elements.
265 # &accu_input_variable=counts : This is the name of the variable in the dataflow output that holds the
266 # : value that hive will store in the accu. Here, the accu_input_variable
267 # : "counts" matches counts flown on branch 3 in
268 # : CountKmers::write_output
269 # The "hash" portion of this "array of hashes" is controlled by the runnable. In this case,
270 # CountKmers packs an entire hash into a hashref, which is flown into the accu.
271 3 => [
'?accu_name=all_counts&accu_address=[]&accu_input_variable=counts' ],
275 { -logic_name =>
'compile_counts',
276 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsAoH',
278 # Flows the output into a table in the hive database called 'final_result'.
279 # We created this table earlier in this conf file during pipeline_create_commands().
280 # It has three columns, 'filename', 'kmer' and 'count'. Each field
281 # is filled by matching the column name to a param name, and filling in with the value
282 # from that param. In the CompileCountsAoH runnable, there is a loop in write_output
283 # that creates a dataflow event for each kmer seen. Each iteration of this loop
284 # (i.e. each dataflow event generated in that loop) fills in one row of the table.
285 4 => [
'?table_name=final_result' ],