9 # initialize the database and build the graph in it (it will also print the value of EHIVE_URL) :
12 # optionally also seed it with your specific values:
13 seed_pipeline.pl -url $EHIVE_URL -logic_name split_sequence -input_id
'{ "sequence_file" => "my_sequence.fa", "chunk_size" => 1000, "overlap_size" => 12 }'
16 beekeeper.pl -url $EHIVE_URL -loop
20 This is the PipeConfig file
for the Kmer counting pipeline example.
21 This pipeline illustrates how to write PipeConfigs and Runnables that utilize the eHive features:
22 * Factories creating a fan of jobs
23 * Hash of array accumulator
25 * Conditional pipeline flow
29 Determining the frequency of k-mers (runs of nucleotides k bases long) is an important part of sequence analysis.
30 This pipeline takes a flat file containing one or more sequences, counts the k-mers in them, then records
31 the count of each k-mer in a table in the hive database.
33 The pipeline can be
run in two modes: short-sequence mode and long-sequence mode. These modes reflect two k-mer
36 Short-sequence mode is useful for counting k-mers when the input contains many short (< a few kb) sequences. In this
37 mode, the input file is chunked into several smaller files, each of which contains a subset of the sequences from
38 the original input. The k-mers in these sequences are counted up in parallel. Then, the pipeline sums up all the
39 k-mer counts from those individual sub-counts.
41 Long-sequence mode is useful for counting k-mers when the input contains a few very long (> hundreds of kb) sequences.
42 In this mode, the sequence or sequences in the input file are split into shorter subsequences, with overlapping ends.
43 The k-mers in these subsequences are counted up in parallel. Then, the pipeline sums up all the k-mer counts from
44 those individual subcounts.
46 Selection of short- and long- sequence mode is done by setting the "seqtype" parameter. This parameter determines
47 which analyses are included in the pipeline via eHive's conditional dataflow mechanism.
50 seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
51 or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
52 will use to split the sequence
53 input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
54 inputfile => Name of input file
55 chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
56 and SplitSequence runnables for details
57 max_chunk_length => Maximum length of sequence in a sub-file - see the documentation in the FastaFactory
58 and SplitSequence runnables for details
59 output_prefix => Filename prefix for the intermediate split files generated by this pipeline
60 output_suffix => Filename suffix for the intermediate split files generated by this pipeline
64 See the NOTICE file distributed with this work for additional information
65 regarding copyright ownership.
67 Licensed under the Apache License,
Version 2.0 (the "License"); you may not use this file except in compliance with the License.
68 You may obtain a copy of the License at
72 Unless required by applicable law or agreed to in writing, software distributed under the License
73 is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
74 See the License for the specific language governing permissions and limitations under the License.
78 Please subscribe to the
Hive mailing list: http:
87 use base ('Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf'); # All
Hive databases configuration files should inherit from HiveGeneric, directly or indirectly
90 =head2 default_options
93 that sets default parameter values. These values can be overridden when running the init_pipeline.pl script.
94 Here, we set defaults for:
96 seqtype => Can be 'short' or 'long' which determines whether the pipeline runs in short-sequence mode
97 or long-sequence mode (see descriptions above). The value determines which runnable the pipeline
98 will use to split the sequence
99 input_format => Format of the input sequence file (e.g. FASTA, FASTQ). Must be supported by Bio::SeqIO
100 inputfile => Name of input file
101 chunk_size => Size of sub-sequences or sub-files (in bases) - see the documentation in the FastaFactory
102 and SplitSequence runnables for details
103 output_prefix => Filename prefix for the intermediate split files generated by this pipeline
104 output_suffix => Filename suffix for the intermediate split files generated by this pipeline
108 sub default_options {
112 %{ $self->SUPER::default_options() }, # inherit other stuff from the base
class
113 'seqtype' =>
'short',
114 'input_format' =>
'FASTA',
115 # init_pipeline makes a best guess of the hive root directory and stores
116 # it in EHIVE_ROOT_DIR, if it is not already set in the shell
117 'inputfile' => $ENV{
'EHIVE_ROOT_DIR'} .
'/t/input_fasta.fa',
120 'output_prefix' =>
'k_split_',
121 'output_suffix' =>
'.fa',
125 =head2 pipeline_create_commands
127 Description : Implements pipeline_create_commands() interface method of
Bio::
EnsEMBL::
Hive::PipeConfig::HiveGeneric_conf that lists the commands that will create and set up the
Hive database.
128 In addition to the standard creation of the database and populating it with
Hive tables and procedures it also creates a table to hold this pipeline's final result.
132 sub pipeline_create_commands {
135 @{$self->SUPER::pipeline_create_commands}, # inheriting database and hive tables
' creation
137 # additional table to store results:
138 $self->db_cmd('CREATE TABLE final_result (filename VARCHAR(255) NOT NULL, kmer VARCHAR(255) NOT NULL, count INT NOT NULL, PRIMARY KEY (filename, kmer))
'),
143 =head2 pipeline_wide_parameters
145 Description : Interface method that should return a hash of pipeline_wide_parameter_name->pipeline_wide_parameter_value pairs.
146 The value doesn't have to be a scalar, can be any Perl structure now (will be stringified and de-stringified automagically).
147 Please see existing PipeConfig modules
for examples.
151 sub pipeline_wide_parameters {
154 %{$self->SUPER::pipeline_wide_parameters}, # here we inherit anything from the base
class
158 =head2 hive_meta_table
160 Description: Interface method that should
return a hash of meta-information
about the pipeline (e.g. pipeline name or schema version).
161 Here, we indicate that
this pipeline should use the parameter stack by setting
'hive_use_param_stack' to 1.
165 sub hive_meta_table {
168 %{$self->SUPER::hive_meta_table}, # here we inherit anything from the base
class
170 'hive_use_param_stack' => 1, #
switch on the param_stack mechanism
174 =head2 pipeline_analyses
176 Description : Implements pipeline_analyses() interface method of
Bio::
EnsEMBL::Hive::PipeConfig::HiveGeneric_conf that defines the structure of the pipeline: analyses, jobs, rules, etc.
177 Here it defines these analyses:
179 * split_strategy -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::RunnableDB::Dummy. It performs no work in itself;
180 rather it exists to trigger dataflow. The interesting part of this pipeline is the WHEN-ELSE flow control
181 in the flow_into section of the analysis definition. Here, subsequent analyses are determined based
182 on the value in the "seqtype" parameter.
183 * split_sequence -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence.
184 It splits sequences in an input-file with overlap, and stores the subsequences in a collection of
185 output files. In this pipeline, flow goes from split_strategy into split_sequence when the "seqtype"
186 parameter is not "
short."
187 * chunk_sequence -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::RunnableDB::FastaFactory. It splits a file
188 containing many sequences into a collection of sub-files, each containing a few of the sequences from
189 the original input file. Individual sequences are kept intact (unlike SplitSequence). In this pipeline,
190 flow goes from split_strategy into chunk_sequence when the "seqtype" parameter is "
short."
191 * count_kmers -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers, which
192 identifies and tallies k-mers in the sequences in an input file. This pipeline is designed to create
193 several count_kmers jobs in parallel, the fan of jobs being created by either split_sequence or chunk_sequence.
194 * compile_counts -- This analysis uses the runnable
Bio::
EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCounts.
195 In this pipeline, a compile_counts job is created but it is initially blocked from running
196 by a semaphore. When all count_kmers jobs have finished, the semaphore is cleared, allowing a worker
197 to claim the compile_counts job and
run it. This job compiles all the k-mer counts from
198 the previous count_kmers jobs into overall counts for each k-mer.
202 sub pipeline_analyses {
205 {-logic_name =>
'split_strategy',
206 -module =>
'Bio::EnsEMBL::Hive::RunnableDB::Dummy',
207 -meadow_type =>
'LOCAL', #
do not bother the farm with such a simple task (and get it done faster)
209 {
'seqtype' => $self->o(
'seqtype'),
210 'input_format' => $self->o(
'input_format'),
211 'inputfile' => $self->o(
'inputfile'),
212 'chunk_size' => $self->o(
'chunk_size'),
213 'output_dir' => $self->o(
'output_dir'),
214 'output_prefix' => $self->o(
'output_prefix'),
215 'output_suffix' => $self->o(
'output_suffix'),
216 'k' => $self->o(
'k'),
220 # use conditional dataflow to determine the next analysis, based on the value of the "seqtype" parameter
221 '1->A' => WHEN(
'#seqtype# eq "short"' => [
'chunk_sequence' ],
222 ELSE [
'split_sequence' ]),
223 # creating a semaphored funnel job to wait for the fan to complete and add the results:
224 'A->1' => [
'compile_counts' ],
229 { -logic_name =>
'split_sequence',
230 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence',
231 # here, a template is used to perform a calculation on a parameter
232 -parameters => {
"overlap_size" =>
"#expr(#k#-1)expr#"},
233 -analysis_capacity => 2, # use per-analysis limiter
235 '2' => [
'count_kmers'],
239 { -logic_name =>
'chunk_sequence',
240 -module =>
'Bio::EnsEMBL::Hive::RunnableDB::FastaFactory',
241 -parameters => {
"max_chunk_length" =>
"#chunk_size#" },
243 '2' => [
'count_kmers'],
247 { -logic_name =>
'count_kmers',
248 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers',
249 # Here, we use a template to rename a parameter.
251 "sequence_file" =>
'#chunk_name#',
253 -analysis_capacity => 4, # use per-analysis limiter
255 # Flows into a hash accumulator called all_counts. The hash key is a string with the kmer
256 # sequence: it is dataflown out in a parameter called 'kmer', and we indicate it is to
257 # be the hash key in the 'accu_address={kmer}' portion of the url below. The value for
258 # each key is dataflown out in a parameter called 'count'; the
259 # 'accu_input_variable=count' portion of the url is where it's set as the value in the accu.
260 # The name of the Accumulator is 'all_counts', as designated by 'accu_name=all_counts' in the url.
261 # It is allowed to use the same name as the input variable, in which case accu_name could be skipped
262 4 => [
'?accu_name=all_counts&accu_address={kmer}[]&accu_input_variable=count' ],
266 { -logic_name =>
'compile_counts',
267 -module =>
'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsHoA',
269 # Flows the output into a table in the hive database called 'final_result'.
270 # We created this table earlier in this conf file during pipeline_create_commands().
271 # It has three columns, 'filename', 'kmer' and 'count'. Each field
272 # is filled by matching the column name to a param name, and filling in with the value
273 # from that param. In the CompileCountsAoH runnable, there is a loop in write_output
274 # that creates a dataflow event for each kmer seen. Each iteration of this loop
275 # (i.e. each dataflow event generated in that loop) fills in one row of the table.
276 4 => [
'?table_name=final_result' ],