9 Please refer to Bio::EnsEMBL::Hive::Examples::Kmer::PipeConfig::Kmer_conf pipeline configuration file
10 to understand how
this particular example pipeline is configured and ran.
14 'Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence is a utility runnable to split a DNA sequence into several shorter,
15 overlapping subsequences. (Compare Bio::EnsEMBL::Hive::RunnableDB::FastaFactory, which splits a file containing many sequences
16 into several files containing some of the sequences in the original file, but keeps the sequences intact).
18 As input, it takes the name of a file with one or more DNA sequences,
19 the format of that file (which must be supported by Bio::SeqIO),
20 a chunk size in base-pairs (BP),
21 and an overlap size, also in BP.
23 Overlap size must be >= 1. Chunk size must be > overlap size.
25 From this, it generates subsequences [chunk-size] BP long, overlapping each other by [overlap-size] BP.
26 The last chunk may be shorter than [chunk-size] BP, but will always be at least [overlap-size] + 1 BP
28 It then stores each of these chunks into new FASTA-format files, one chunk per file.
30 It flows out each chunk's filename, along with a flag indicating
if a particular chunk is the last chunk
31 from the original sequence.
36 See the NOTICE file distributed with
this work
for additional information
37 regarding copyright ownership.
39 Licensed under the Apache License, Version 2.0 (the
"License"); you may not use
this file except in compliance with the License.
40 You may obtain a copy of the License at
44 Unless required by applicable law or agreed to in writing, software distributed under the License
45 is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
46 See the License
for the specific language governing permissions and limitations under the License.
50 Please subscribe to the Hive mailing list: http:
54 package Bio::EnsEMBL::Hive::Examples::Kmer::RunnableDB::SplitSequence;
64 use base (
'Bio::EnsEMBL::Hive::Process');
68 Description : Implements param_defaults()
interface method of
Bio::EnsEMBL::Hive::Process that defines module defaults for parameters.
76 'output_prefix' =>
'my_chunk_',
77 'output_suffix' =>
'.fasta',
78 'input_format' =>
'FASTA',
84 Description : Implements fetch_input()
interface method of
Bio::EnsEMBL::Hive::Process that is used to read in parameters and load data.
86 There are no hard and fast rules on whether to fetch parameters in fetch_input(), or to wait until
run() to fetch them.
87 In general, fetch_input() is a place to validate parameter existence and values for errors before the worker get set into RUN state
88 from the FETCH_INPUT state.
100 Description : Implements
run()
interface method of
Bio::EnsEMBL::Hive::Process that is used to perform the
main bulk of the job (minus input and output).
102 Here,
run() reads in the filename of a file containing a DNA sequence, an integer chunk-size in bases,
103 and an integer overlap size in bases. The file needs to be in a format supported by Bio::SeqIO.
105 param('inputfile'): Name of the file containing the sequence to split
106 param('input_format'): Format of the sequence file (e.g. FASTA)
107 param('chunk_size'): Desired chunk size in base-pairs
108 param('overlap_size'): Desired length of overlap between chunks, in base-pairs
117 my $inputfile = $self->param_required(
'inputfile');
118 my $chunk_size = $self->param_required(
'chunk_size');
119 my $overlap_size = $self->param_required(
'overlap_size');
122 if ( $inputfile =~ /\.(?:gz|Z)$/) {
123 open(my $in_fh,
'-|',
"gunzip -c $inputfile");
124 $input_seqio = Bio::SeqIO->new(-fh => $in_fh, -format => $self->param(
'input_format'));
125 $self->param(
'input_fh', $in_fh); # storing as a param so that it can be closed by post_cleanup
if necessary
127 $input_seqio = Bio::SeqIO->new(-file => $inputfile, -format => $self->param(
'input_format'));
128 $self->param(
'input_fh', undef);
130 die
"Could not open or parse '$inputfile', please investigate" unless $input_seqio;
133 if (($chunk_size < 1) ||
134 ($overlap_size < 0) ||
135 ($overlap_size >= $chunk_size)) {
136 die
"chunk_size must be > overlap_size, and both must be positive";
139 my $split_sequences = _split_sequences($input_seqio, $chunk_size, $overlap_size);
140 $self->param(
'split_sequences', $split_sequences);
145 Description: Implements the write_output()
interface method of
Bio::EnsEMBL::Hive::Process that is used to flow output to the rest of the pipeline.
147 Here, we flow out two values:
148 * chunk_name -- the name of a file containing a sequence chunk
149 * output_format -- the format of the output file. For this runnable, this is hard-coded as 'FASTA'
150 as it always outputs FASTA format files
158 my @split_sequences = @{$self->param(
'split_sequences')};
160 for (my $i = 0; $i <= $#split_sequences; $i++) {
161 my $seq_object = Bio::Seq->new(-seq => $split_sequences[$i],
162 -
id =>
"split_" . $i);
164 my $chunk_filename = ($self->param(
'output_dir') ? $self->param(
'output_dir') .
'/' :
'' ) . $self->param(
'output_prefix') . $i . $self->param(
'output_suffix');
165 my $chunk_seqio = Bio::SeqIO->new(-file =>
'>' . $chunk_filename,
168 $chunk_seqio->write_seq($seq_object);
171 $self->dataflow_output_id( {
'chunk_name' => $chunk_filename,
172 'output_format' =>
'FASTA',
181 of housekeeping details after a runnable finishes.
183 This method will
run even
if the job fails, or write_output never gets called - so it
's somewhat analogous to
184 the finalize methods many languages implement as part of exception handling.
186 In this case, it ensures the input filehandle gets closed.
192 close( $self->param('input_fh
') ) if $self->param('input_fh
');
195 =head2 _split_sequence
197 Description: This is a private function (not a method, so it doesn't know
about parameters)
198 that takes a
long sequence and returns a collection of shorter subsequences.
200 Arg [1] : The sequence, as a
string
201 Arg [2] : The length of each subsequence (
if possible). Subsequences will be no longer than
202 this value, but one may be shorter
203 Arg [3] : Number of bases to overlap adjacent subsequences
207 sub _split_sequences {
208 my ($seqio, $chunk_size, $overlap_size) = @_;
212 while (my $seq = $seqio->next_seq()) {
213 my $seq_str = $seq->seq();
215 $seq_str = uc($seq_str);
216 my $chunk_pointer = 0;
218 my $chunk_substring_size = $chunk_size;
220 my $chunk_end = ($chunk_pointer + $chunk_size) - 1;
221 if ($chunk_end > length($seq_str)) {
222 $chunk_end = length($seq_str) - 1;
223 $chunk_substring_size = ($chunk_end - $chunk_pointer) + 1;
226 my $subseq = substr($seq_str, $chunk_pointer, $chunk_substring_size);
227 push(@split_sequences, $subseq);
229 $chunk_pointer = (($chunk_end + 1) - $overlap_size);
230 }
while ($last_chunk == 0);
233 return \@split_sequences;