ensembl-hive  2.8.1
BaseAlignFeature.pm
Go to the documentation of this file.
1 =head1 LICENSE
2 
3 See the NOTICE file distributed with this work for additional information
4 regarding copyright ownership.
5 
6 Licensed under the Apache License, Version 2.0 (the "License");
7 you may not use this file except in compliance with the License.
8 You may obtain a copy of the License at
9 
10  http://www.apache.org/licenses/LICENSE-2.0
11 
12 Unless required by applicable law or agreed to in writing, software
13 distributed under the License is distributed on an "AS IS" BASIS,
14 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 See the License for the specific language governing permissions and
16 limitations under the License.
17 
18 =cut
19 
20 
21 =head1 CONTACT
22 
23  Please email comments or questions to the public Ensembl
24  developers list at <http://lists.ensembl.org/mailman/listinfo/dev>.
25 
26  Questions may also be sent to the Ensembl help desk at
27  <http://www.ensembl.org/Help/Contact>.
28 
29 =cut
30 
31 =head1 NAME
32 
33 Bio::EnsEMBL::BaseAlignFeature - Baseclass providing a common abstract
34 implementation for alignment features
35 
36 =head1 SYNOPSIS
37 
38  my $feat = new Bio::EnsEMBL::DnaPepAlignFeature(
39  -slice => $slice,
40  -start => 100,
41  -end => 120,
42  -strand => 1,
43  -hseqname => 'SP:RF1231',
44  -hstart => 200,
45  -hend => 220,
46  -analysis => $analysis,
47  -cigar_string => '10M3D5M2I',
48  -align_type => 'ensembl'
49  );
50 
51  where $analysis is a Bio::EnsEMBL::Analysis object.
52 
53  Alternatively if you have an array of ungapped features:
54 
55  my $feat =
56  new Bio::EnsEMBL::DnaPepAlignFeature( -features => \@features );
57 
58  where @features is an array of Bio::EnsEMBL::FeaturePair objects.
59 
60  There is a method to (re)create ungapped features from the cigar_string:
61 
62  my @ungapped_features = $feat->ungapped_features();
63 
64  where @ungapped_features is an array of Bio::EnsEMBL::FeaturePair's.
65 
66  Bio::EnsEMBL::BaseAlignFeature inherits from:
67  Bio::EnsEMBL::FeaturePair, which in turn inherits from:
68  Bio::EnsEMBL::Feature,
69  thus methods from both parent classes are available.
70 
71 
72  The cigar_string is a condensed representation of the matches and gaps
73  which make up the gapped alignment (where CIGAR stands for
74  Concise Idiosyncratic Gapped Alignment Report).
75 
76  CIGAR format is: n <matches> [ x <deletes or inserts> m <matches> ]*
77  where M = match, D = delete, I = insert; n, m are match lengths;
78  x is delete or insert length.
79 
80  Spaces are omitted, thus: "23M4I12M2D1M"
81  as are counts for any lengths of 1, thus 1M becomes M: "23M4I12M2DM"
82 
83 
84  To make things clearer this is how a blast HSP would be parsed:
85 
86  >AK014066
87  Length = 146
88 
89  Minus Strand HSPs:
90 
91  Score = 76 (26.8 bits), Expect = 1.4, P = 0.74
92  Identities = 20/71 (28%), Positives = 29/71 (40%), Frame = -1
93 
94  Query: 479 GLQAPPPTPQGCRLIPPPPLGLQAPLPTLRAVGSSHHHP*GRQGSSLSSFRSSLASKASA 300
95  G APPP PQG R P P G + P L + + ++ R +A +
96  Sbjct: 7 GALAPPPAPQG-RWAFPRPTG-KRPATPLHGTARQDRQVRRSEAAKVTGCRGRVAPHVAP 64
97 
98  Query: 299 SSPHNPSPLPS 267
99  H P+P P+
100  Sbjct: 65 PLTHTPTPTPT 75
101 
102  The alignment goes from 479 down to 267 in the query sequence on the reverse
103  strand, and from 7 to 75 in the subject sequence.
104 
105  The alignment is made up of the following ungapped pieces:
106 
107  query_seq start 447 , sbjct_seq hstart 7 , match length 33 , strand -1
108  query_seq start 417 , sbjct_seq hstart 18 , match length 27 , strand -1
109  query_seq start 267 , sbjct_seq hstart 27 , match length 147 , strand -1
110 
111  When assembled into a DnaPepAlignFeature where:
112  (seqname, start, end, strand) refer to the query sequence,
113  (hseqname, hstart, hend, hstrand) refer to the subject sequence,
114  these ungapped pieces are represented by the cigar string:
115  33M3I27M3I147M
116  with start 267, end 479, strand -1, and hstart 7, hend 75, hstrand 1.
117 
118 =head1 CAVEATS
119 
120  AlignFeature cigar strings have the opposite 'sense'
121  ('D' and 'I' swapped) compared with Exonerate cigar strings.
122 
123  Exonerate modules in Bio::EnsEMBL::Analysis use this convention:
124 
125  The longer genomic sequence specified by:
126  exonerate: target
127  AlignFeature: (sequence, start, end, strand)
128 
129  A shorter sequence (such as EST or protein) specified by:
130  exonerate: query
131  AlignFeature: (hsequence, hstart, hend, hstrand)
132 
133  The resulting AlignFeature cigar strings have 'D' and 'I'
134  swapped compared with the Exonerate output, i.e.:
135 
136  exonerate: M 123 D 1 M 11 I 1 M 39
137  AlignFeature: 123MI11MD39M
138 
139 =head1 METHODS
140 
141 =cut
142 
143 
144 package Bio::EnsEMBL::BaseAlignFeature;
145 
146 use Bio::EnsEMBL::FeaturePair;
147 use Bio::EnsEMBL::Utils::Argument qw(rearrange);
148 use Bio::EnsEMBL::Utils::Exception qw(throw warning);
149 
150 use vars qw(@ISA);
151 use strict;
152 
153 @ISA = qw(Bio::EnsEMBL::FeaturePair);
154 
155 
156 =head2 new
157 
158  Arg [..] : List of named arguments. (-cigar_string , -features, -align_type) defined
159  in this constructor, others defined in FeaturePair and
160  SeqFeature superclasses. Either cigar_string or a list
161  of ungapped features should be provided - not both.
162  Example : $baf = new BaseAlignFeatureSubclass(-cigar_string => '3M3I12M', -align_type => 'ensembl');
163  Description: Creates a new BaseAlignFeature using either a cigar string or
164  a list of ungapped features. BaseAlignFeature is an abstract
165  baseclass and should not actually be instantiated - rather its
166  subclasses should be.
167  Returntype : Bio::EnsEMBL::BaseAlignFeature
168  Exceptions : thrown if both feature and cigar string args are provided
169  thrown if neither feature nor cigar string args are provided
170  warn if cigar string is provided without cigar type
171  Caller : general
172  Status : Stable
173 
174 =cut
175 
176 sub new {
177 
178  my $caller = shift;
179 
180  my $class = ref($caller) || $caller;
181 
182  my $self = $class->SUPER::new(@_);
183 
184  my ($cigar_string,$align_type,$features) = rearrange([qw(CIGAR_STRING ALIGN_TYPE FEATURES)], @_);
185 
186  if (defined($align_type)) {
187  $self->{'align_type'} = $align_type;
188  } else {
189  warning("No align_type provided, using ensembl as default");
190  $self->{'align_type'} = 'ensembl';
191  }
192 
193  if (defined($cigar_string) && defined($features)) {
194  throw("CIGAR_STRING or FEATURES argument is required - not both.");
195  } elsif (defined($features)) {
196  $self->_parse_features($features);
197 
198  } elsif (defined($cigar_string)) {
199  $self->{'cigar_string'} = $cigar_string;
200  } else {
201  throw("CIGAR_STRING or FEATURES argument is required");
202  }
203 
204  return $self;
205 }
206 
207 
208 =head2 cigar_string
209 
210  Arg [1] : string $cigar_string
211  Example : $feature->cigar_string( "12MI3M" );
212  Description: get/set for attribute cigar_string.
213  cigar_string describes the alignment:
214  "xM" stands for x matches (or mismatches),
215  "xI" for x inserts into the query sequence,
216  "xD" for x deletions from the query sequence
217  where the query sequence is specified by (seqname, start, ...)
218  and the subject sequence by (hseqname, hstart, ...).
219  An "x" that is 1 can be omitted.
220  See the SYNOPSIS for an example.
221  Returntype : string
222  Exceptions : none
223  Caller : general
224  Status : Stable
225 
226 =cut
227 
228 sub cigar_string {
229  my $self = shift;
230  $self->{'cigar_string'} = shift if(@_);
231  return $self->{'cigar_string'};
232 }
233 
234 
235 =head2 align_type
236 
237  Arg [1] : type $align_type
238  Example : $feature->align_type( "ensembl" );
239  Description: get/set for attribute align_type.
240  align_type specifies which cigar string
241  is used to describe the alignment:
242  The default is 'ensembl'
243  Returntype : string
244  Exceptions : none
245  Caller : general
246  Status : Stable
247 
248 =cut
249 
250 sub align_type {
251  my $self = shift;
252  $self->{'align_type'} = shift if(@_);
253  return $self->{'align_type'};
254 }
255 
256 
257 =head2 alignment_length
258 
259  Arg [1] : None
260  Description: return the alignment length (including indels) based on the alignment_type ('ensembl', 'mdtag')
261  Returntype : int
262  Exceptions :
263  Caller :
264  Status : Stable
265 
266 =cut
267 
268 sub alignment_length {
269  my $self = shift;
270 
271  # ensembl: Internal CIGAR format
272  if ($self->{'align_type'} eq 'ensembl') {
273  return $self->_ensembl_cigar_alignment_length();
274  # mdtag: MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer: SAM/BAM specification)
275  } elsif ($self->{'align_type'} eq 'mdtag') {
276  return $self->_mdtag_alignment_length();
277  } else {
278  throw("No alignment_length method available for " . $self->{'align_type'});
279  }
280 
281 }
282 
283 
284 =head2 _ensembl_cigar_alignment_length
285 
286  Arg [1] : None
287  Description: return the alignment length (including indels) based on the cigar_string
288  Returntype : int
289  Exceptions :
290  Caller :
291  Status : Stable
292 
293 =cut
294 
295 sub _ensembl_cigar_alignment_length {
296  my $self = shift;
297 
298  if (! defined $self->{'_alignment_length'} && defined $self->cigar_string) {
299 
300  my @pieces = ( $self->cigar_string =~ /(\d*[MDI])/g );
301  unless (@pieces) {
302  print STDERR "Error parsing cigar_string\n";
303  }
304  my $alignment_length = 0;
305  foreach my $piece (@pieces) {
306  my ($length) = ( $piece =~ /^(\d*)/ );
307  if (! defined $length || $length eq "") {
308  $length = 1;
309  }
310  $alignment_length += $length;
311  }
312  $self->{'_alignment_length'} = $alignment_length;
313  }
314  return $self->{'_alignment_length'};
315 }
316 
317 =head2 ungapped_features
318 
319  Args : none
320  Example : @ungapped_features = $align_feature->get_feature
321  Description: converts the internal cigar_string into an array of
322  ungapped feature pairs
323  Returntype : list of Bio::EnsEMBL::FeaturePair
324  Exceptions : cigar_string not set internally
325  Caller : general
326  Status : Stable
327 
328 =cut
329 
330 sub ungapped_features {
331  my ($self) = @_;
332 
333  if (!defined($self->{'cigar_string'})) {
334  throw("No cigar_string defined. Can't return ungapped features");
335  }
336 
337  return @{$self->_parse_cigar()};
338 }
339 
340 =head2 strands_reversed
341 
342  Arg [1] : int $strands_reversed
343  Description: get/set for attribute strands_reversed
344  0 means that strand and hstrand are the original strands obtained
345  from the alignment program used
346  1 means that strand and hstrand have been flipped as compared to
347  the original result provided by the alignment program used.
348  You may want to use the reverse_complement method to restore the
349  original strandness.
350  Returntype : int
351  Exceptions : none
352  Caller : general
353  Status : Stable
354 
355 =cut
356 
357 sub strands_reversed {
358  my ($self, $arg) = @_;
359 
360  if ( defined $arg ) {
361  $self->{'strands_reversed'} = $arg ;
362  }
363 
364  $self->{'strands_reversed'} = 0 unless (defined $self->{'strands_reversed'});
365 
366  return $self->{'strands_reversed'};
367 }
368 
369 
370 =head2 reverse_complement
371 
372  Args : none
373  Description: reverse complement the FeaturePair based on the cigar type
374  modifing strand, hstrand and cigar_string in consequence
375  Returntype : none
376  Exceptions : none
377  Caller : general
378  Status : Stable
379 
380 =cut
381 
382 
383 sub reverse_complement {
384  my ($self) = @_;
385 
386  if ($self->{'align_type'} eq 'ensembl') {
387  return $self->_ensembl_reverse_complement();
388  } else {
389  throw("no reverse_complement method implemented for " . $self->{'align_type'});
390  }
391 }
392 
393 =head2 _ensembl_reverse_complement
394 
395  Args : none
396  Description: reverse complement the FeaturePair for ensembl cigar string,
397  modifing strand, hstrand and cigar_string in consequence
398  Returntype : none
399  Exceptions : none
400  Caller : general
401  Status : Stable
402 
403 =cut
404 
405 sub _ensembl_reverse_complement {
406  my ($self) = @_;
407 
408  # reverse strand in both sequences
409  $self->strand($self->strand * -1);
410  $self->hstrand($self->hstrand * -1);
411 
412  # reverse cigar_string as consequence
413  my $cigar_string = $self->cigar_string;
414  $cigar_string =~ s/(D|I|M)/$1 /g;
415  my @cigar_pieces = split / /,$cigar_string;
416  $cigar_string = "";
417  while (my $piece = pop @cigar_pieces) {
418  $cigar_string .= $piece;
419  }
420 
421  $self->{'strands_reversed'} = 0 unless (defined $self->{'strands_reversed'});
422 
423  if ($self->strands_reversed) {
424  $self->strands_reversed(0)
425  } else {
426  $self->strands_reversed(1);
427  }
428 
429  $self->cigar_string($cigar_string);
430 }
431 
432 
433 
434 =head2 transform
435 
436  Arg 1 : String $coordinate_system_name
437  Arg [2] : String $coordinate_system_version
438  Example : $feature = $feature->transform('contig');
439  $feature = $feature->transform('chromosome', 'NCBI33');
440  Description: Moves this AlignFeature to the given coordinate system.
441  If the feature cannot be transformed to the destination
442  coordinate system undef is returned instead.
443  Returntype : Bio::EnsEMBL::BaseAlignFeature;
444  Exceptions : wrong parameters
445  Caller : general
446  Status : Medium Risk
447 
448 =cut
449 
450 sub transform {
451  my $self = shift;
452 
453  my $new_feature = $self->SUPER::transform(@_);
454  if ( !defined($new_feature)
455  || $new_feature->length() != $self->length() )
456  {
457  my @segments = @{ $self->project(@_) };
458 
459  if ( !@segments ) {
460  return undef;
461  }
462 
463  my @ungapped;
464  foreach my $f ( $self->ungapped_features() ) {
465  $f = $f->transform(@_);
466  if ( defined($f) ) {
467  push( @ungapped, $f );
468  } else {
469  warning( "Failed to transform alignment feature; "
470  . "ungapped component could not be transformed" );
471  return undef;
472  }
473  }
474 
475  eval { $new_feature = $self->new( -features => \@ungapped ); };
476 
477  if ($@) {
478  warning($@);
479  return undef;
480  }
481  } ## end if ( !defined($new_feature...))
482 
483  return $new_feature;
484 }
485 
486 
487 =head2 _parse_ensembl_cigar
488 
489  Args : none
490  Description: PRIVATE (internal) method - creates ungapped features from
491  internally stored cigar line in ensembl format
492  Returntype : list of Bio::EnsEMBL::FeaturePair
493  Exceptions : none
494  Caller : ungapped_features
495  Status : Stable
496 
497 =cut
498 
499 sub _parse_ensembl_cigar {
500  my ( $self ) = @_;
501 
502  my $query_unit = $self->_query_unit();
503  my $hit_unit = $self->_hit_unit();
504 
505  my $string = $self->{'cigar_string'};
506 
507  throw("No cigar string defined in object") if(!defined($string));
508 
509  my @pieces = ( $string =~ /(\d*[MDI])/g );
510 
511  my @features;
512  my $strand1 = $self->{'strand'} || 1;
513  my $strand2 = $self->{'hstrand'}|| 1;
514 
515  my ( $start1, $start2 );
516 
517  if( $strand1 == 1 ) {
518  $start1 = $self->{'start'};
519  } else {
520  $start1 = $self->{'end'};
521  }
522 
523  if( $strand2 == 1 ) {
524  $start2 = $self->{'hstart'};
525  } else {
526  $start2 = $self->{'hend'};
527  }
528 
529  #
530  # Construct ungapped blocks as FeaturePairs objects for each MATCH
531  #
532  foreach my $piece (@pieces) {
533 
534  my ($length) = ( $piece =~ /^(\d*)/ );
535  if( $length eq "" ) { $length = 1 }
536  my $mapped_length;
537 
538  # explicit if statements to avoid rounding problems
539  # and make sure we have sane coordinate systems
540  if( $query_unit == 1 && $hit_unit == 3 ) {
541  $mapped_length = $length*3;
542  } elsif( $query_unit == 3 && $hit_unit == 1 ) {
543  $mapped_length = $length / 3;
544  } elsif ( $query_unit == 1 && $hit_unit == 1 ) {
545  $mapped_length = $length;
546  } else {
547  throw("Internal error $query_unit $hit_unit, currently only " .
548  "allowing 1 or 3 ");
549  }
550 
551  if( int($mapped_length) != $mapped_length and
552  ($piece =~ /M$/ or $piece =~ /D$/)) {
553  throw("Internal error with mismapped length of hit, query " .
554  "$query_unit, hit $hit_unit, length $length");
555  }
556 
557  if( $piece =~ /M$/ ) {
558  #
559  # MATCH
560  #
561  my ( $qstart, $qend);
562  if( $strand1 == 1 ) {
563  $qstart = $start1;
564  $qend = $start1 + $length - 1;
565  $start1 = $qend + 1;
566  } else {
567  $qend = $start1;
568  $qstart = $start1 - $length + 1;
569  $start1 = $qstart - 1;
570  }
571 
572  my ($hstart, $hend);
573  if( $strand2 == 1 ) {
574  $hstart = $start2;
575  $hend = $start2 + $mapped_length - 1;
576  $start2 = $hend + 1;
577  } else {
578  $hend = $start2;
579  $hstart = $start2 - $mapped_length + 1;
580  $start2 = $hstart - 1;
581  }
582 
583 
584  push @features, Bio::EnsEMBL::FeaturePair->new
585  (-SLICE => $self->{'slice'},
586  -SEQNAME => $self->{'seqname'},
587  -START => $qstart,
588  -END => $qend,
589  -STRAND => $strand1,
590  -HSLICE => $self->{'hslice'},
591  -HSEQNAME => $self->{'hseqname'},
592  -HSTART => $hstart,
593  -HEND => $hend,
594  -HSTRAND => $strand2,
595  -SCORE => $self->{'score'},
596  -PERCENT_ID => $self->{'percent_id'},
597  -ANALYSIS => $self->{'analysis'},
598  -P_VALUE => $self->{'p_value'},
599  -EXTERNAL_DB_ID => $self->{'external_db_id'},
600  -HCOVERAGE => $self->{'hcoverage'},
601  -GROUP_ID => $self->{'group_id'},
602  -LEVEL_ID => $self->{'level_id'});
603 
604 
605  # end M cigar bits
606  } elsif( $piece =~ /I$/ ) {
607  #
608  # INSERT
609  #
610  if( $strand1 == 1 ) {
611  $start1 += $length;
612  } else {
613  $start1 -= $length;
614  }
615  } elsif( $piece =~ /D$/ ) {
616  #
617  # DELETION
618  #
619  if( $strand2 == 1 ) {
620  $start2 += $mapped_length;
621  } else {
622  $start2 -= $mapped_length;
623  }
624  } else {
625  throw( "Illegal cigar line $string!" );
626  }
627  }
628 
629  return \@features;
630 }
631 
632 
633 sub _parse_cigar {
634  my $self = shift;
635 
636  if ($self->{'align_type'} eq 'ensembl') {
637  return $self->_parse_ensembl_cigar();
638  }
639  else {
640  throw("No parsing method implemented for " . $self->{'align_type'});
641  }
642 }
643 
644 
645 =head2 _parse_features
646 
647  Arg [1] : listref Bio::EnsEMBL::FeaturePair $ungapped_features
648  Description: creates internal cigar_string and start,end hstart,hend
649  entries.
650  Returntype : none, fills in values of self
651  Exceptions : argument list undergoes many sanity checks - throws under many
652  invalid conditions
653  Caller : new
654  Status : Stable
655 
656 =cut
657 
658 sub _parse_features {
659  my ($self, $features) = @_;
660  if ($self->{'align_type'} eq 'ensembl') {
661  $self->_parse_ensembl_features($features);
662  } else {
663  throw("No _parse_features method implemented for " . $self->{'align_type'});
664  }
665 }
666 
667 my $message_only_once = 1;
668 
669 sub _parse_ensembl_features {
670  my ($self,$features ) = @_;
671 
672  if (ref($features) ne "ARRAY") {
673  throw("features must be an array reference not a [".ref($features)."]");
674  } elsif (scalar(@$features) == 0) {
675  throw("features array must not be empty");
676  }
677 
678  my $query_unit = $self->_query_unit();
679  my $hit_unit = $self->_hit_unit();
680 
681  my $strand = $features->[0]->strand;
682 
683  throw ('FeaturePair needs to have strand == 1 or strand == -1') if(!$strand);
684 
685  my @f;
686 
687  #
688  # Sort the features on their start position
689  # Ascending order on positive strand, descending on negative strand
690  #
691  if( $strand == 1 ) {
692  @f = sort {$a->start <=> $b->start} @$features;
693  } else {
694  @f = sort { $b->start <=> $a->start} @$features;
695  }
696 
697  my $hstrand = $f[0]->hstrand;
698  my $slice = $f[0]->slice();
699  my $hslice = $f[0]->hslice();
700  my $name = $slice ? $slice->name() : undef;
701  my $hname = $f[0]->hseqname;
702  my $score = $f[0]->score;
703  my $percent = $f[0]->percent_id;
704  my $analysis = $f[0]->analysis;
705  my $pvalue = $f[0]->p_value();
706  my $external_db_id = $f[0]->external_db_id;
707  my $hcoverage = $f[0]->hcoverage;
708  my $group_id = $f[0]->group_id;
709  my $level_id = $f[0]->level_id;
710 
711  my $seqname = $f[0]->seqname;
712  # implicit strand 1 for peptide sequences
713  $strand ||= 1;
714  $hstrand ||= 1;
715  my $ori = $strand * $hstrand;
716 
717  throw("No features in the array to parse") if(scalar(@f) == 0);
718 
719  my $prev1; # where last feature q part ended
720  my $prev2; # where last feature s part ended
721 
722  my $string;
723 
724  # Use strandedness info of query and hit to make sure both sets of
725  # start and end coordinates are oriented the right way around.
726  my $f1start;
727  my $f1end;
728  my $f2end;
729  my $f2start;
730 
731  if ($strand == 1) {
732  $f1start = $f[0]->start;
733  $f1end = $f[-1]->end;
734  } else {
735  $f1start = $f[-1]->start;
736  $f1end = $f[0]->end;
737  }
738 
739  if ($hstrand == 1) {
740  $f2start = $f[0]->hstart;
741  $f2end = $f[-1]->hend;
742  } else {
743  $f2start = $f[-1]->hstart;
744  $f2end = $f[0]->hend;
745  }
746 
747  #
748  # Loop through each portion of alignment and construct cigar string
749  #
750 
751  foreach my $f (@f) {
752  #
753  # Sanity checks
754  #
755 
756  if (!$f->isa("Bio::EnsEMBL::FeaturePair")) {
757  throw("Array element [$f] is not a Bio::EnsEMBL::FeaturePair");
758  }
759  if( defined($f->hstrand()) && $f->hstrand() != $hstrand ) {
760  throw("Inconsistent hstrands in feature array");
761  }
762  if( defined($f->strand()) && ($f->strand != $strand)) {
763  throw("Inconsistent strands in feature array");
764  }
765  if ( defined($name) && $name ne $f->slice->name()) {
766  throw("Inconsistent names in feature array [$name - ".
767  $f->slice->name()."]");
768  }
769  if ( defined($hname) && $hname ne $f->hseqname) {
770  throw("Inconsistent hit names in feature array [$hname - ".
771  $f->hseqname . "]");
772  }
773  if ( defined($score) && $score ne $f->score) {
774  throw("Inconsisent scores in feature array [$score - " .
775  $f->score . "]");
776  }
777  if (defined($f->percent_id) && $percent ne $f->percent_id) {
778  throw("Inconsistent pids in feature array [$percent - " .
779  $f->percent_id . "]");
780  }
781  if(defined($pvalue) && $pvalue != $f->p_value()) {
782  throw("Inconsistant p_values in feature arraw [$pvalue " .
783  $f->p_value() . "]");
784  }
785  if($seqname && $seqname ne $f->seqname){
786  throw("Inconsistent seqname in feature array [$seqname - ".
787  $f->seqname . "]");
788  }
789  my $start1 = $f->start; #source sequence alignment start
790  my $start2 = $f->hstart(); #hit sequence alignment start
791 
792  #
793  # More sanity checking
794  #
795  if (defined($prev1)) {
796  if ( $strand == 1 ) {
797  if ($f->start < $prev1) {
798  throw("Inconsistent coords in feature array (forward strand).\n" .
799  "Start [".$f->start()."] in current feature should be greater " .
800  "than previous feature end [$prev1].");
801  }
802  } else {
803  if ($f->end > $prev1) {
804  throw("Inconsistent coords in feature array (reverse strand).\n" .
805  "End [".$f->end() ."] should be less than previous feature " .
806  "start [$prev1].");
807  }
808  }
809  }
810 
811  my $length = ($f->end - $f->start + 1); #length of source seq alignment
812  my $hlength = ($f->hend - $f->hstart + 1); #length of hit seq alignment
813 
814  # using multiplication to avoid rounding errors, hence the
815  # switch from query to hit for the ratios
816 
817  #
818  # Yet more sanity checking
819  #
820  if($query_unit > $hit_unit){
821  # I am going to make the assumption here that this situation will
822  # only occur with DnaPepAlignFeatures, this may not be true
823  my $query_p_length = sprintf "%.0f", ($length/$query_unit);
824  my $hit_p_length = sprintf "%.0f", ($hlength * $hit_unit);
825  if( $query_p_length != $hit_p_length) {
826  throw( "Feature lengths not comparable Lengths:" .$length .
827  " " . $hlength . " Ratios:" . $query_unit . " " .
828  $hit_unit );
829  }
830  } else{
831  my $query_d_length = sprintf "%.0f", ($length*$hit_unit);
832  my $hit_d_length = sprintf "%.0f", ($hlength * $query_unit);
833  if( $length * $hit_unit != $hlength * $query_unit ) {
834  throw( "Feature lengths not comparable Lengths:" . $length .
835  " " . $hlength . " Ratios:" . $query_unit . " " .
836  $hit_unit );
837  }
838  }
839 
840  my $hlengthfactor = ($query_unit/$hit_unit);
841 
842  #
843  # Check to see if there is an I type (insertion) gap:
844  # If there is a space between the end of the last source sequence
845  # alignment and the start of this one, then this is an insertion
846  #
847 
848  my $insertion_flag = 0;
849  if( $strand == 1 ) {
850  if( ( defined $prev1 ) && ( $f->start > $prev1 + 1 )) {
851 
852  #there is an insertion
853  $insertion_flag = 1;
854  my $gap = $f->start - $prev1 - 1;
855  if( $gap == 1 ) {
856  $gap = ""; # no need for a number if gap length is 1
857  }
858  $string .= "$gap"."I";
859 
860  }
861 
862  #shift our position in the source seq alignment
863  $prev1 = $f->end();
864  } else {
865 
866  if(( defined $prev1 ) && ($f->end + 1 < $prev1 )) {
867 
868  #there is an insertion
869  $insertion_flag = 1;
870  my $gap = $prev1 - $f->end() - 1;
871  if( $gap == 1 ) {
872  $gap = ""; # no need for a number if gap length is 1
873  }
874  $string .= "$gap"."I";
875  }
876 
877  #shift our position in the source seq alignment
878  $prev1 = $f->start();
879  }
880 
881  #
882  # Check to see if there is a D type (deletion) gap
883  # There is a deletion gap if there is a space between the end of the
884  # last portion of the hit sequence alignment and this one
885  #
886  if( $hstrand == 1 ) {
887  if(( defined $prev2 ) && ( $f->hstart() > $prev2 + 1 )) {
888 
889  #there is a deletion
890  my $gap = $f->hstart - $prev2 - 1;
891  my $gap2 = int( $gap * $hlengthfactor + 0.5 );
892 
893  if( $gap2 == 1 ) {
894  $gap2 = ""; # no need for a number if gap length is 1
895  }
896  $string .= "$gap2"."D";
897 
898  #sanity check, Should not be an insertion and deletion
899  if($insertion_flag) {
900  if ($message_only_once) {
901  warning("Should not be an deletion and insertion on the " .
902  "same alignment region. cigar_line=$string\n");
903  $message_only_once = 0;
904  }
905  }
906  }
907  #shift our position in the hit seq alignment
908  $prev2 = $f->hend();
909 
910  } else {
911  if( ( defined $prev2 ) && ( $f->hend() + 1 < $prev2 )) {
912 
913  #there is a deletion
914  my $gap = $prev2 - $f->hend - 1;
915  my $gap2 = int( $gap * $hlengthfactor + 0.5 );
916 
917  if( $gap2 == 1 ) {
918  $gap2 = ""; # no need for a number if gap length is 1
919  }
920  $string .= "$gap2"."D";
921 
922  #sanity check, Should not be an insertion and deletion
923  if($insertion_flag) {
924  if ($message_only_once) {
925  warning("Should not be an deletion and insertion on the " .
926  "same alignment region. prev2 = $prev2; f->hend() = " .
927  $f->hend() . "; cigar_line = $string;\n");
928  $message_only_once = 0;
929  }
930  }
931  }
932  #shift our position in the hit seq alignment
933  $prev2 = $f->hstart();
934  }
935 
936  my $matchlength = $f->end() - $f->start() + 1;
937  if( $matchlength == 1 ) {
938  $matchlength = "";
939  }
940  $string .= $matchlength."M";
941  }
942 
943  $self->{'start'} = $f1start;
944  $self->{'end'} = $f1end;
945  $self->{'seqname'} = $seqname;
946  $self->{'strand'} = $strand;
947  $self->{'score'} = $score;
948  $self->{'percent_id'} = $percent;
949  $self->{'analysis'} = $analysis;
950  $self->{'slice'} = $slice;
951  $self->{'hslice'} = $hslice;
952  $self->{'hstart'} = $f2start;
953  $self->{'hend'} = $f2end;
954  $self->{'hstrand'} = $hstrand;
955  $self->{'hseqname'} = $hname;
956  $self->{'cigar_string'} = $string;
957  $self->{'p_value'} = $pvalue;
958  $self->{'external_db_id'} = $external_db_id;
959  $self->{'hcoverage'} = $hcoverage;
960  $self->{'group_id'} = $group_id;
961  $self->{'level_id'} = $level_id;
962 }
963 
964 
965 
966 
967 
968 
969 =head2 _hit_unit
970 
971  Args : none
972  Description: abstract method, overwrite with something that returns
973  one or three
974  Returntype : int 1,3
975  Exceptions : none
976  Caller : internal
977  Status : Stable
978 
979 =cut
980 
981 sub _hit_unit {
982  my $self = shift;
983  throw( "Abstract method call!" );
984 }
985 
986 
987 
988 =head2 _query_unit
989 
990  Args : none
991  Description: abstract method, overwrite with something that returns
992  one or three
993  Returntype : int 1,3
994  Exceptions : none
995  Caller : internal
996  Status : Stable
997 
998 =cut
999 
1000 sub _query_unit {
1001  my $self = shift;
1002  throw( "Abstract method call!" );
1003 }
1004 
1005 =head2 _mdtag_alignment_length
1006 
1007  Arg [1] : None
1008  Description: return the alignment length (including indels) based on the mdtag (mdz) string
1009  Returntype : int
1010  Exceptions : none
1011  Caller : internal
1012  Status : Stable
1013 
1014 =cut
1015 
1016 sub _mdtag_alignment_length {
1017  my $self = shift;
1018 
1019  if (! defined $self->{'_alignment_length'} && defined $self->cigar_string) {
1020  my $mdz_string = $self->cigar_string;
1021  my $chunks = $self->_get_mdz_chunks($mdz_string);
1022  my $alignment_length = 0;
1023  $alignment_length = $self->_get_mdz_alignment_length($chunks);
1024  $self->{'_alignment_length'} = $alignment_length;
1025  }
1026  return $self->{'_alignment_length'};
1027 }
1028 
1029 =head2 _get_mdz_chunks
1030 
1031  Arg [1] : mdtag string - MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer: SAM/BAM specification)
1032  Description: parses the mdtag string and group it according the type
1033  eg: MD:Z:35^VIVALE31^GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE10 returns ['35', '^', 'VIVALE', '31', '^', 'GRPLIQPRRKKAYQLEHTFQGLLGKRSLFTE', '10']
1034  Returntype : array of strings
1035  Exceptions : none
1036  Caller : internal
1037  Status : Stable
1038 
1039 =cut
1040 sub _get_mdz_chunks {
1041  my $self = shift;
1042 
1043  my $mdz_string = shift;
1044  my $alignment_length = 0;
1045  my @chunks;
1046  if ( $mdz_string =~ /^MD:Z:(.+)/ ){
1047  my $mdtag = $1;
1048 
1049  #if start to end is a number then all match return as it is
1050  if($mdtag =~/^\d+$/){
1051  return [$mdtag];
1052  } else{
1053  my @char_arr = split('',$mdtag);
1054  # Set up for the first loop, this also will handle
1055  # if the array is size one, as we'll have a character
1056  # in the current_chunk to push on the pile at the end.
1057  my $prev_type = $self->_get_mdz_chunk_type( $char_arr[0]);
1058  my $current_chunk = $char_arr[0];
1059  for(my $i = 1; $i <= $#char_arr; $i++) {
1060  my $cur_type = $self->_get_mdz_chunk_type( $char_arr[$i] );
1061  if($cur_type ne $prev_type) {
1062  # We've found a new character class, push the
1063  # current chunk on to the list
1064  push @chunks, $current_chunk;
1065  # Restart the current pile
1066  $current_chunk = $char_arr[$i];
1067  } else {
1068  # Same character class, put it on the current pile
1069  $current_chunk .= $char_arr[$i];
1070  }
1071  # Shift current type in to the prior slot
1072  $prev_type = $cur_type;
1073  }
1074 
1075  # Post loop cleanup of the last piece
1076  # This also handles an array of size one where
1077  # we never entered the loop, that one element
1078  # now gets pushed on the pile.
1079  push @chunks, $current_chunk;
1080  }
1081  }
1082  return \@chunks;
1083 }
1084 
1085 =head2 _get_mdz_alignment_length
1086 
1087  Arg [1] : array of strings
1088  Description: calculate the alignment length from the given chunks
1089  Returntype : array of strings
1090  Exceptions : none
1091  Caller : internal
1092  Status : Stable
1093 
1094 =cut
1095 
1096 sub _get_mdz_alignment_length{
1097  my $self = shift;
1098 
1099  my $chunks = shift;
1100  my $length = 0;
1101  for(my $i=0; $i< scalar(@$chunks); $i++){
1102  my $chunk = $chunks->[$i];
1103  my $type = $self->_get_mdz_chunk_type($chunk);
1104 
1105  if($type eq 'num'){
1106  $length = $length + $chunk;
1107  }elsif($type eq 'alpha'){
1108  $length = $length + length($chunk);
1109  }elsif($type eq 'del'){
1110  #skip next chunk
1111  $i++;
1112  }
1113 
1114  }
1115  return $length;
1116 }
1117 
1118 =head2 _get_mdz_chunk_type
1119 
1120  Arg [1] : char
1121  Description: get the chunk type
1122  Returntype : string
1123  Exceptions : none
1124  Caller : internal
1125  Status : Stable
1126 
1127 =cut
1128 
1129 sub _get_mdz_chunk_type{
1130  my $self = shift;
1131  my $char = shift;
1132 
1133  my $type;
1134  if($char eq '^'){
1135  $type = "del";
1136  } elsif($char =~ /\d+/){
1137  $type = 'num';
1138  } elsif($char =~ /\w+/){
1139  $type = 'alpha';
1140  }
1141  return $type;
1142 }
1143 
1144 
1145 =head2 _mdz_alignment_string
1146 
1147  Arg [1] : input sequence
1148  Arg [2] : MD Z String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* (Refer: SAM/BAM specification)
1149  eg: MD:Z:96^RHKTDSFVGLMGKRALNS0V14
1150  Example : $pf->alignment_strings
1151  Description: Allows to rebuild the alignment string of both the seq and hseq sequence
1152  Returntype : array reference containing 2 strings
1153  the first corresponds to seq
1154  the second corresponds to hseq
1155  Exceptions : none
1156  Caller : general
1157  Status : Stable
1158 
1159 =cut
1160 
1161 # Try to get the sequence from translation object
1162 # mdz_string from current object
1163 sub _mdz_alignment_string {
1164  my ($self, $input_seq, $mdz_string) = @_;
1165 
1166  my $chunks = $self->_get_mdz_chunks($mdz_string);
1167 
1168  my $target_seq = "";
1169  my $query_seq = "";
1170 
1171  #Handle first chunk, which is always a num
1172  my $first_chunk = $chunks->[0];
1173  my $first_chunk_type = $self->_get_mdz_chunk_type($first_chunk);
1174 
1175  my $offset = 0;
1176  if($first_chunk_type eq 'num'){
1177  if($first_chunk > 0){
1178  $target_seq = substr($input_seq, $offset, $first_chunk);
1179 
1180  #query and target are same at this point
1181  $query_seq = $target_seq;
1182  $offset = length($target_seq);
1183  }
1184  }else{
1185  die "First chunk should always be a num...something wrong\n";
1186  }
1187 
1188  for(my $i=1; $i < scalar(@$chunks); $i++){
1189 
1190  my $chunk = $chunks->[$i];
1191  my $chunk_type = $self->_get_mdz_chunk_type($chunk);
1192 
1193  if($chunk_type eq 'num'){
1194  #if 0, what follows next is a INSERT
1195  if($chunk == 0){
1196  my $insert_chunk = $chunks->[++$i];
1197  my $insert_chunk_type = $self->_get_mdz_chunk_type($insert_chunk);
1198 
1199  #insert_chunk_type is always alpha
1200  die if $insert_chunk_type eq "num";
1201  if($insert_chunk_type eq 'del'){
1202  $i--;
1203  next;
1204  }
1205  $target_seq = $target_seq . substr($input_seq, $offset, length($insert_chunk));
1206  $query_seq = $query_seq . $insert_chunk;
1207 
1208  $offset = $offset + length($insert_chunk);
1209 
1210  }else{
1211  #if non-0, then it is a match
1212  my $match_chunk = $chunks->[$i];
1213  my $match_chunk_type = $self->_get_mdz_chunk_type($match_chunk);
1214 
1215  #$match_chunk_type is always num
1216  die if $match_chunk_type ne "num";
1217 
1218  my $match_target_string = substr($input_seq, $offset, $match_chunk);
1219 
1220  $target_seq = $target_seq . $match_target_string;
1221  $query_seq = $query_seq . $match_target_string;
1222 
1223  $offset = $offset + $match_chunk;
1224  }
1225 
1226  }elsif($chunk_type eq 'alpha'){
1227  my $insert_chunk = $chunk;
1228  my $insert_chunk_type = $self->_get_mdz_chunk_type($insert_chunk);
1229  die if $insert_chunk_type ne "alpha";
1230 
1231  $target_seq = $target_seq . substr($input_seq, $offset, length($insert_chunk));
1232  $query_seq = $query_seq . $insert_chunk;
1233 
1234  $offset = $offset + length($insert_chunk);
1235 
1236  }elsif($chunk_type eq 'del'){
1237  #get next chunk which contains the deleted sequence
1238  my $del_chunk = $chunks->[++$i];
1239  my $del_chunk_type = $self->_get_mdz_chunk_type($del_chunk);
1240 
1241  #del_chunk_type is always alpha
1242  die if $del_chunk_type ne 'alpha';
1243 
1244  $target_seq = $target_seq . "-" x length($del_chunk);
1245  $query_seq = $query_seq . $del_chunk;
1246 
1247  #no change in offset
1248  $offset = $offset;
1249 
1250 
1251  }
1252 
1253  }
1254 
1255  return [$target_seq, $query_seq];
1256 
1257 }
1258 
1259 
1260 1;
Bio::EnsEMBL::DnaPepAlignFeature
Definition: DnaPepAlignFeature.pm:12
Bio::EnsEMBL::BaseAlignFeature
Definition: BaseAlignFeature.pm:90
Bio::EnsEMBL::Feature
Definition: Feature.pm:47
Bio::EnsEMBL::FeaturePair
Definition: FeaturePair.pm:56
Bio::EnsEMBL::Analysis
Definition: PairAlign.pm:3