ensembl-hive  2.8.1
CompileCountsHoA.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  Please refer to Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::KmerPipelineHoA_conf pipeline configuration file
10  to understand how this particular example pipeline is configured and run.
11 
12 =head1 DESCRIPTION
13 
14  Kmer::RunnableDB::CompileCounts is the last runnable in the kmer counting pipeline (using an array of hashes Accumulator).
15  This runnable fetches kmer counts that the previous jobs stored in the hash Accumulator, and combines them to determine
16  the overall kmer counts from the sequences in the original input file.
17 
18 =head1 LICENSE
19 
20  See the NOTICE file distributed with this work for additional information
21  regarding copyright ownership.
22 
23  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
24  You may obtain a copy of the License at
25 
26  http://www.apache.org/licenses/LICENSE-2.0
27 
28  Unless required by applicable law or agreed to in writing, software distributed under the License
29  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
30  See the License for the specific language governing permissions and limitations under the License.
31 
32 =head1 CONTACT
33 
34  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
35 
36 =cut
37 
38 
39 package Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsHoA;
40 
41 use strict;
42 use warnings;
43 
44 use base ('Bio::EnsEMBL::Hive::Process');
45 
46 
47 =head2 param_defaults
48 
49  Description : Implements param_defaults() interface method of Bio::EnsEMBL::Hive::Process that defines module defaults for parameters.
50 
51 =cut
52 
53 sub param_defaults {
54 }
55 
56 
57 =head2 fetch_input
58 
59  Description : Implements fetch_input() interface method of Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
60  In this runnable, fetch_input is left empty. It fetches data from a hive Accumulator, so there are no extra database
61  connections to open, nor files to check. It's more sensible to fetch data from the Accumulator in run, where it's needed
62  rather than to fetch it here, then pass it along in another parameter.
63 
64 =cut
65 
66 sub fetch_input {
67 
68 }
69 
70 =head2 run
71 
72  Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).
73 
74  In this method, we fetch kmer counts produced by previous jobs and stored in an Accumulator. We sum up the
75  number of times each kmer is found over all the chunks, and store the sums in a param. Storing the results
76  in a param makes them available to other methods in this runnable -- specifically write_output.
77 
78  This method expects counts to be stored in the accumulator as an hash of arrays. The key in the hash is the kmer sequence, and the value
79  is the list of counts of this kmer in the chunk files (e.g. [5, 8, ...]). Each element of the array is a value generated by a CountKmers
80  job run previously in this pipeline.
81 
82 =cut
83 
84 sub run {
85  my $self = shift @_;
86 
87  # Create a hash where we can add up counts for each kmer from each previous CountKmers job to determine overall total counts.
88  my %sum_of_counts;
89 
90  # Accessing the Accumulator by it's name ('all_counts'), as a param.
91  # We get an arrayref back.
92  my $all_counts = $self->param('all_counts');
93 
94  # Loop through all the keys stored in the accumulator.
95  foreach my $kmer (keys %{$all_counts}) {
96 
97  # for each kmer, retrieve the counts for each particular CountKmers job, and add to our total.
98  foreach my $c (@{$all_counts->{$kmer}}) {
99  $sum_of_counts{$kmer} += $c;
100  }
101  }
102 
103  # Finally, store our total counts for each kmer in a param called 'sum_of_counts', making them available to other methods
104  $self->param('sum_of_counts', \%sum_of_counts);
105 }
106 
107 =head2 write_output
108 
109  Description : Implements write_output() interface method of Bio::EnsEMBL::Hive::Process that is used to deal with job's output after the execution.
110 
111  Here, we flow out three values:
112  * filename -- name of the sequence file given at the start of the pipeline
113  * kmer -- the kmer being counted
114  * count -- count of that kmer across the entire original input
115 
116 =cut
117 
118 sub write_output {
119  my $self = shift(@_);
120 
121  my $sum_of_counts = $self->param('sum_of_counts');
122 
123  foreach my $kmer (keys(%{$sum_of_counts})) {
124  $self->dataflow_output_id({
125  'filename' => $self->param('inputfile'),
126  'kmer' => $kmer,
127  'count' => $sum_of_counts->{$kmer}
128  }, 4);
129  }
130 }
131 
132 1;
Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::KmerPipelineHoA_conf
Definition: KmerPipelineHoA_conf.pm:66
EnsEMBL
Definition: Filter.pm:1
Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CountKmers
Definition: CountKmers.pm:30
main
public main()
run
public run()
Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::CompileCountsHoA
Definition: CompileCountsHoA.pm:21
Bio
Definition: AltAlleleGroup.pm:4