ensembl-hive  2.8.1
SplitSequence.pm
Go to the documentation of this file.
1 =pod
2 
3 =head1 NAME
4 
6 
7 =head1 SYNOPSIS
8 
9  Please refer to Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::Kmer_conf pipeline configuration file
10  to understand how this particular example pipeline is configured and ran.
11 
12 =head1 DESCRIPTION
13 
14  'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence is a utility runnable to split a DNA sequence into several shorter,
15  overlapping subsequences. (Compare Bio::EnsEMBL::Hive::RunnableDB::FastaFactory, which splits a file containing many sequences
16  into several files containing some of the sequences in the original file, but keeps the sequences intact).
17 
18  As input, it takes the name of a file with one or more DNA sequences,
19  the format of that file (which must be supported by Bio::SeqIO),
20  a chunk size in base-pairs (BP),
21  and an overlap size, also in BP.
22 
23  Overlap size must be >= 1. Chunk size must be > overlap size.
24 
25  From this, it generates subsequences [chunk-size] BP long, overlapping each other by [overlap-size] BP.
26  The last chunk may be shorter than [chunk-size] BP, but will always be at least [overlap-size] + 1 BP
27 
28  It then stores each of these chunks into new FASTA-format files, one chunk per file.
29 
30  It flows out each chunk's filename, along with a flag indicating if a particular chunk is the last chunk
31  from the original sequence.
32 
33 
34 =head1 LICENSE
35 
36  See the NOTICE file distributed with this work for additional information
37  regarding copyright ownership.
38 
39  Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.
40  You may obtain a copy of the License at
41 
42  http://www.apache.org/licenses/LICENSE-2.0
43 
44  Unless required by applicable law or agreed to in writing, software distributed under the License
45  is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
46  See the License for the specific language governing permissions and limitations under the License.
47 
48 =head1 CONTACT
49 
50  Please subscribe to the Hive mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users to discuss Hive-related questions or to be notified of our updates
51 
52 =cut
53 
54 package Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence;
55 
56 use strict;
57 use warnings;
58 
59 use Bio::SeqIO;
60 use Bio::Seq;
61 
62 use Data::Dumper;
63 
64 use base ('Bio::EnsEMBL::Hive::Process');
65 
66 =head2 param_defaults
67 
68  Description : Implements param_defaults() interface method of Bio::EnsEMBL::Hive::Process that defines module defaults for parameters.
69 
70 =cut
71 
72 sub param_defaults {
73 
74  return {
75  'output_dir' => '.',
76  'output_prefix' => 'my_chunk_',
77  'output_suffix' => '.fasta',
78  'input_format' => 'FASTA',
79  };
80 }
81 
82 =head2 fetch_input
83 
84  Description : Implements fetch_input() interface method of Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
85 
86  There are no hard and fast rules on whether to fetch parameters in fetch_input(), or to wait until run() to fetch them.
87  In general, fetch_input() is a place to validate parameter existence and values for errors before the worker get set into RUN state
88  from the FETCH_INPUT state.
89 
90 
91 
92 =cut
93 
94 sub fetch_input {
95 
96 }
97 
98 =head2 run
99 
100  Description : Implements run() interface method of Bio::EnsEMBL::Hive::Process that is used to perform the main bulk of the job (minus input and output).
101 
102  Here, run() reads in the filename of a file containing a DNA sequence, an integer chunk-size in bases,
103  and an integer overlap size in bases. The file needs to be in a format supported by Bio::SeqIO.
104 
105  param('inputfile'): Name of the file containing the sequence to split
106  param('input_format'): Format of the sequence file (e.g. FASTA)
107  param('chunk_size'): Desired chunk size in base-pairs
108  param('overlap_size'): Desired length of overlap between chunks, in base-pairs
109 
110 
111 
112 =cut
113 
114 sub run {
115  my $self = shift @_;
116 
117  my $inputfile = $self->param_required('inputfile');
118  my $chunk_size = $self->param_required('chunk_size');
119  my $overlap_size = $self->param_required('overlap_size');
120 
121  my $input_seqio;
122  if ( $inputfile =~ /\.(?:gz|Z)$/) {
123  open(my $in_fh, '-|', "gunzip -c $inputfile");
124  $input_seqio = Bio::SeqIO->new(-fh => $in_fh, -format => $self->param('input_format'));
125  $self->param('input_fh', $in_fh); # storing as a param so that it can be closed by post_cleanup if necessary
126  } else {
127  $input_seqio = Bio::SeqIO->new(-file => $inputfile, -format => $self->param('input_format'));
128  $self->param('input_fh', undef);
129  }
130  die "Could not open or parse '$inputfile', please investigate" unless $input_seqio;
131 
132 
133  if (($chunk_size < 1) ||
134  ($overlap_size < 0) ||
135  ($overlap_size >= $chunk_size)) {
136  die "chunk_size must be > overlap_size, and both must be positive";
137  }
138 
139  my $split_sequences = _split_sequences($input_seqio, $chunk_size, $overlap_size);
140  $self->param('split_sequences', $split_sequences);
141 }
142 
143 =head2 write_output
144 
145  Description: Implements the write_output() interface method of Bio::EnsEMBL::Hive::Process that is used to flow output to the rest of the pipeline.
146 
147  Here, we flow out two values:
148  * chunk_name -- the name of a file containing a sequence chunk
149  * output_format -- the format of the output file. For this runnable, this is hard-coded as 'FASTA'
150  as it always outputs FASTA format files
151 
152 =cut
153 
154 
155 sub write_output {
156  my $self = shift @_;
157 
158  my @split_sequences = @{$self->param('split_sequences')};
159 
160  for (my $i = 0; $i <= $#split_sequences; $i++) {
161  my $seq_object = Bio::Seq->new(-seq => $split_sequences[$i],
162  -id => "split_" . $i);
163 
164  my $chunk_filename = ($self->param('output_dir') ? $self->param('output_dir') . '/' : '' ) . $self->param('output_prefix') . $i . $self->param('output_suffix');
165  my $chunk_seqio = Bio::SeqIO->new(-file => '>' . $chunk_filename,
166  -format => 'fasta');
167 
168  $chunk_seqio->write_seq($seq_object);
169 
170 
171  $self->dataflow_output_id( {'chunk_name' => $chunk_filename,
172  'output_format' => 'FASTA',
173  }, 2);
174  }
175 
176 }
177 
178 =head2 post_cleanup
179 
180  Description: Here, we implement the post_cleanup method of Bio::EnsEMBL::Hive::Process that is used to take care
181 of housekeeping details after a runnable finishes.
182 
183  This method will run even if the job fails, or write_output never gets called - so it's somewhat analogous to
184  the finalize methods many languages implement as part of exception handling.
185 
186  In this case, it ensures the input filehandle gets closed.
187 
188 =cut
189 
190 sub post_cleanup {
191  my $self = shift;
192  close( $self->param('input_fh') ) if $self->param('input_fh');
193 }
194 
195 =head2 _split_sequence
196 
197  Description: This is a private function (not a method, so it doesn't know about parameters)
198  that takes a long sequence and returns a collection of shorter subsequences.
199 
200  Arg [1] : The sequence, as a string
201  Arg [2] : The length of each subsequence (if possible). Subsequences will be no longer than
202  this value, but one may be shorter
203  Arg [3] : Number of bases to overlap adjacent subsequences
204 
205 =cut
206 
207 sub _split_sequences {
208  my ($seqio, $chunk_size, $overlap_size) = @_;
209 
210  my @split_sequences;
211 
212  while (my $seq = $seqio->next_seq()) {
213  my $seq_str = $seq->seq();
214 
215  $seq_str = uc($seq_str);
216  my $chunk_pointer = 0;
217  my $last_chunk = 0;
218  my $chunk_substring_size = $chunk_size;
219  do {
220  my $chunk_end = ($chunk_pointer + $chunk_size) - 1;
221  if ($chunk_end > length($seq_str)) {
222  $chunk_end = length($seq_str) - 1;
223  $chunk_substring_size = ($chunk_end - $chunk_pointer) + 1;
224  $last_chunk = 1;
225  }
226  my $subseq = substr($seq_str, $chunk_pointer, $chunk_substring_size);
227  push(@split_sequences, $subseq);
228 
229  $chunk_pointer = (($chunk_end + 1) - $overlap_size);
230  } while ($last_chunk == 0);
231  }
232 
233  return \@split_sequences;
234 }
235 
236 1;
Bio::EnsEMBL::Hive::Process
Definition: Process.pm:77
main
public main()
about
public about()
run
public run()
Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence
Definition: SplitSequence.pm:40
Bio
Definition: AltAlleleGroup.pm:4