10 to understand how
this particular example pipeline is configured and
run.
14 Kmer::RunnableDB::CompileCounts is the last runnable in the kmer counting pipeline (
using an array of hashes
Accumulator).
15 This runnable fetches kmer counts that the previous jobs stored in the hash
Accumulator, and combines them to determine
16 the overall kmer counts from the sequences in the original input file.
20 See the NOTICE file distributed with
this work
for additional information
21 regarding copyright ownership.
23 Licensed under the Apache License,
Version 2.0 (the
"License"); you may not use
this file except in compliance with the License.
24 You may obtain a copy of the License at
28 Unless required by applicable law or agreed to in writing, software distributed under the License
29 is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
30 See the License
for the specific language governing permissions and limitations under the License.
34 Please subscribe to the
Hive mailing list: http:
39 package Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsAoH;
44 use base (
'Bio::EnsEMBL::Hive::Process');
49 Description : Implements param_defaults()
interface method of
Bio::EnsEMBL::Hive::Process that defines module defaults for parameters.
59 Description : Implements fetch_input()
interface method of
Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
60 In this runnable, fetch_input is left empty. It fetches data from a hive
Accumulator, so there are no extra database
61 connections to open, nor files to check. It's more sensible to fetch data from the
Accumulator in
run, where it's needed
62 rather than to fetch it here, then pass it along in another parameter.
72 Description : Implements
run()
interface method of
Bio::EnsEMBL::Hive::Process that is used to perform the
main bulk of the job (minus input and output).
74 In this method, we fetch kmer counts produced by previous jobs and stored in an
Accumulator. We sum up the
75 number of times each kmer is found over all the chunks, and store the sums in a param. Storing the results
76 in a param makes them available to other methods in this runnable -- specifically write_output.
78 This method expects counts to be stored in the accumulator as an array of hashes. The counts themselves are stored in a hash; the key
79 being the kmer sequence, and the value being the count (e.g. {'ACGT' => 5, 'CCGG' => 3, ...}). Each element of the array is a hashref
80 pointing to one of these kmer => count hashes generated by a
CountKmers job
run previously in this pipeline.
87 # Create a hash where we can add up counts for each kmer from each previous CountKmers job to determine overall total counts.
90 # Accessing the Accumulator by it's name ('all_counts'), as a param.
91 # We get an arrayref back.
92 my $all_counts = $self->param(
'all_counts');
94 # Loop through all the results from each individual CountKmers job.
95 foreach my $count_kmers_result (@{$all_counts}) {
97 # for each CountKmers result, retrieve the count for each particular kmer, and add to our total.
98 foreach my $kmer (keys %{$count_kmers_result}) {
99 $sum_of_counts{$kmer} += $count_kmers_result->{$kmer};
103 # Finally, store our total counts for each kmer in a param called 'sum_of_counts', making them available to other methods
104 $self->param(
'sum_of_counts', \%sum_of_counts);
109 Description : Implements write_output() interface method of
Bio::
EnsEMBL::
Hive::
Process that is used to deal with job's output after the execution.
111 Here, we flow out three values:
112 * filename -- name of the sequence file given at the start of the pipeline
113 * kmer -- the kmer being counted
114 * count -- count of that kmer across the entire original input
119 my $self = shift(@_);
121 my $sum_of_counts = $self->param(
'sum_of_counts');
123 foreach my $kmer (keys(%{$sum_of_counts})) {
124 $self->dataflow_output_id({
125 'filename' => $self->param(
'inputfile'),
127 'count' => $sum_of_counts->{$kmer}