A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences - sersc

Page created by Mike Parks
 
CONTINUE READING
International Journal of Advanced Science and Technology
                                                      Vol. 29, No. 3, (2020), pp. 11251 - 11265

     A Comprehensive Review and Performance Evaluation of
       Sequence Alignment Algorithms for DNA Sequences

                     Neelofar Sohi1 and Amardeep Singh2
                            1
                              Assistant Professor,
  Department of Computer Science & Engineering, Punjabi University Patiala,
                                Punjab, India
                                  2
                                    Professor,
  Department of Computer Science & Engineering, Punjabi University Patiala,
                                Punjab, India
              1
                sohi_ce@yahoo.co.in,2amardeepsingh@pbi.ac.in

                                       Abstract

Background: Sequence alignment is very important step for high level sequence analysis
applications in the area of biocomputing. Alignment of DNA sequences helps in finding
origin of sequences, homology between sequences, constructing phylogenetic trees
depicting evolutionary relationships and other tasks. It helps in identifying genetic
variations in DNA sequences which might lead to diseases. Objectives: This paper
presents a comprehensive review on sequence alignment approaches, methods and
various state-of-the-art algorithms. Performance evaluation and comparison of few
algorithms and tools is performed. Methods and Results: In this study, various tools and
algorithms are studied, implemented and their performance is evaluated and compared
using Identity Percentage is used as the main metric. Conclusion: It is observed that for
pairwise sequence alignment, Clustal Omega Emboss Matcher outperforms other tools &
algorithms followed by Clustal Omega Emboss Water further followed by Blast
(Needleman-Wunsch algorithm for global alignment).

Keywords: sequence alignment; DNA sequences; progressive alignment; iterative
alignment; natural computing approach; identity percentage.

1. Brief History

Sequence alignment is very important area in the field of Biocomputing and
Bioinformatics. Sequence alignment aims to identify the regions of similarity between
two or more sequences. Alignment is also termed as ‘mapping’ that is done to identify
and compare the nucleotide bases i.e. A, G, C and T in the DNA sequences (nucleotide
bases in DNA or RNA sequences and amino acids for proteins).

Sequence alignment acts as an important step in solving problems like finding homology,
determining origin of a sequence, protein structure prediction, for identifying new
members of protein family, constructing evolutionary (phylogenetic trees), identifying
mutations and for further higher level sequence analyses [1-2].

The sequence analyses provide the evidence that 99% of genome sequences of different
individuals are identical [3] and the difference of 1% is due to the genetic variations
which lead to human inherited diseases. Single Nucleotide Polymorphisms (SNPs),
insertions/deletions, block substitutions, inversions, variable number of tandem repeat

  ISSN: 2005-4238 IJAST
  Copyright ⓒ 2020 SERSC                                                                11251
International Journal of Advanced Science and Technology
                                                       Vol. 29, No. 3, (2020), pp. 11251 - 11265

sequences (VNTRs) and copy number variations (CNVs) are the few common genetic
variations [4].

Sequence alignment is a pre-processing step for detection and identification of these
genetic variations.

There are two categories of sequence alignment viz. Pairwise sequence alignment (PSA)
and multiple sequence alignment (MSA). Pairwise sequence alignment involves
comparison of two sequences to identify the matching regions. MSA involves comparison
of more than two sequences where there can be ‘n’ query sequences to be compared
against ‘n’ reference sequences. Before aligning the sequences, unequal sequences need to
be made equal in length by inserting gaps in between. A large number of gap insertion
algorithms are available. Optimality of a gap insertion algorithm relies on maximisation
of number of matches.

Next generation sequencing technologies like Roche/454 (454 Life Sciences, 2013),
Illumina (Illumina, 2013), Solid (SolidTM4 System, 2013) and large scale projects like
Human Genome Project (Human Genome Project Information, 2013), 1000 Genomes
Project (Home, 1000 Genomes, 2013), Genome 10K project (G.K.C.O Scientists, 2009)
are generating large volumes of sequence data. Alignment of multiple sequences of huge
lengths becomes a big challenge. As many high level applications depend upon alignment
of sequences it becomes highly important to produce alignments with high accuracy, high
speed, good quality and low computational complexity [5-7]. For primal MSA tools, time
complexity is O(Ln) where L is length of sequence and ‘n’ is number of sequences. Earlier
‘L’ used to be large but ‘n’ used to be smaller than ‘L’ whereas in current situation ‘n’,
the number of sequences has become larger than ‘L’. Clustal omega is the only MSA
algorithm which can handle up to 190,000 sequences, aligning them in few hours on a
single processor. In a study conducted in 2013, Sievers et al. (2011) compared 18 standard
automated MSA tools and packages with respect to scalability. The study concluded that
tools like PSAlign, Prank, FSA and Mummals can align up to 100 sequences. Tools like
Probcons, MUSCLE, MAFFT, ClustalW and MSAProbs can align up to 1000 sequences.
Tools such as Clustal Omega, Kalign and Part-Tree can align up to 50,000 sequences.
Some MSA methods produce high quality output but do not scale well for thousands of
sequences whereas there are few which provide good scalability but produce poor quality
output [8]. Various researchers have reviewed the pairwise and multiple sequence
alignment methods and approaches [9-12]. Comparison as well as evaluation of alignment
methods has been done in certain studies [8], [13-16].

There are two types of sequence alignment viz. Local and global alignment. Local
alignment is done to find highly similar local regions of similarity between two sequences.
This is best suited for quite divergent sequences having local regions of high similarity.
Global alignment strategy performs end-to-end alignment by comparing full lengths of
two sequences against each other. Local alignment algorithms produce alignment without
gaps hence the problem of fixing gap penality is resolved. Global alignments provide
information for evolutionary comparisons and local alignments are useful for structural
predictions [17].

The paper is structured as follows: Section 2 describes major approaches, methods and
state-of-the-art algorithms & tools used for sequence alignment of DNA sequences.
Section 3 presents performance evaluation including performance metrics and results of
evaluation for various sequence alignment algorithms & tools. Section 4 presents
conclusion drawn from the study.

  ISSN: 2005-4238 IJAST
  Copyright ⓒ 2020 SERSC                                                                 11252
International Journal of Advanced Science and Technology
                                                        Vol. 29, No. 3, (2020), pp. 11251 - 11265

2. Sequence Alignment Approaches
2.1 Dynamic Programming

Dynamic Programming is the approach that produces optimal alignment.
Needleman-Wunsch algorithm is a DP based technique which produces global alignment
and Smith-Waterman algorithm is another DP based technique which produces local
alignment. Here, for obtaining MSA, we try to maximize the Sum of Pairs score obtained
from the pairwise alignments of sequences.

There is no universally accepted objective function for MSA using DP approach. For
pairwise alignments, time complexity of DP is O(Ln) where L is length of sequence and
‘n’ is number of sequences. DP produces optimal alignment for a pair of sequences but
time complexity increases for multiple sequences. DP involves following steps [17]:

⚫    Every nucleotide in one sequence is compared to each and every nucleotide of the
     second sequence.

⚫    Results of this comparison are marked and stored in the form of m*n matrix where
     m*n defines size of the matrix.

⚫    All paths in the matrix are searched to find the optimal alignment with highest score.

DP provides optimum alignment for a given objective function for pairwise sequence
alignment problem by trace-back procedure whereas this trace-back procedure involves
exponential time for MSA [18]. MSA is an NP-complete problem where aim is to identify
an MSA with maximum score among the set of found alignments. Hence, for MSA, more
sophisticated heuristic methods are required [19]. Agarwal et al. (2005) proposed a more
efficient version of DP which produces an optimal alignment. Bayat et al. (2019)
proposed a DP based method which produces semi-global alignment where few of the
first and last bases of compared sequences can be skipped. This method extracts
‘Maximal Exact Matches (MEMs) from compared sequences using shift and compare
operations on the two sequences. This method is suitable where number of MEMs is
lower than total number of bases (or nucleotides) in the sequences.

2.2 Heuristic Approach

Heuristic methods are not capable of giving optimal solutions but they provide feasible
solution in short amount of time in contrast to DP based approach i.e. exact alignment
approaches which provide high quality, near optimal results [18].

2.2.1 Pairwise sequence alignment

For pairwise sequence alignment, BLAST, BLAT and FASTA are the popular tools based
on heuristic approach which produce faster solution in short amount of time.

2.2.1.1 BLAT: BLAT [20] stands for Blast like Local Alignment Tool. BLAT was written
by Jim Kent. Its working principle is similar to that of Blast with an improvement that it
stores index of reference sequence in memory rather than storing full sequence leading to
low memory requirement and enhanced speed of alignment. Index is used to find areas of
homology which can be further loaded into memory for a detailed alignment. BLAT is for
finding sequences with high similarity from same or closely-related species. BLAT is
available as a standalone tool, web application and its integrated with UCSC Genome
Browser [21]. The steps involved in working of BLAT are discussed below:

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                                11253
International Journal of Advanced Science and Technology
                                                         Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫    Break query sequence into query words. L-w+1 words are formed where L is length
     of sequence and w is word size (w=3 by default)

⚫    Break reference sequence into query words

⚫    Compare the word list with database and find exact matches

⚫    Extend the match to neighbouring regions

⚫    Find High Scoring Segment Pairs (HSPs)

2.2.1.2 BLAST: BLAST stands for Basic Local Alignment. It was developed in 1990 by
S.F. Altschul, W.Gish, W.Miller, E.W. Myers, D.J. Lipman and NCBI [22]. It produces
local alignment and can be used for both pairwise and protein sequence alignment. It
serves to DNA as well as protein sequence alignment. It can compare one query sequence
to a database of sequences. BLAST enables species identification, locating domains,
DNA mapping and annotation. BLAST has higher speed than FASTA program [23].
Different variants of BLAST are listed in table 1:

                                Table 1. Variants of BLAST

Program                      Query sequence                       Reference sequence

BLAST P                      Protein                              Protein

BLAST N                      Nucleotide                           Nucleotide

BLAST X                      Nucleotide (translated)              Protein

TBLASTN                      Protein                              Nucleotide (translated)

TBLASTX                      Nucleotide (translated)              Nucleotide (translated)

 The steps involved in working of BLAT are discussed below:

⚫    Break query sequence into query words. L-w+1 words are formed where L is length
     of sequence and w is word size (w=3 by default)

⚫    Break reference sequence into query words

⚫    Compare the word list with database and find exact matches

⚫    Extend the match to neighbouring regions

⚫    Find High Scoring Segment Pairs (HSPs)

2.2.1.3 FASTA: FASTA was developed in 1995. This is an improved version of FASTP
developed by D.J. Lipman and W.R. Pearson in 1985 [24-25]. FASTP was used for
protein sequences only whereas FASTA is suitable for DNA versus DNA, translated
protein versus DNA and for evaluating statistical significance. TFASTAX, TFASTAY,
FASTAX and FASTAY are the various programs of FASTA. FASTA is found to be more
accurate (in terms of sensitivity) than BLAST [23]. FASTA is derived from concept of
dot plot. It computes best diagonals from all frames of alignment. It looks for exact
matches between words in query sequence and reference sequence.

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                                 11254
International Journal of Advanced Science and Technology
                                                        Vol. 29, No. 3, (2020), pp. 11251 - 11265

 The steps involved in working of FASTA are discussed below:

⚫    Identify common k-words between I and J where I denotes query sequence and J
     denotes reference sequence. (words for DNA are 6 nucleotide long and words for
     proteins are 2 bases long like AGTCCA)

⚫    Score diagonals with k-word matches, identify 10 best diagonals. High scoring
     diagonals are selected where offset is given by i-j

⚫    Rescore initial regions with a substitution score matrix like PAM

⚫    Join initial regions using gaps; then penalise the gaps

⚫    Perform Dynamic Programming to find final alignments

The tool FASTA gives z-value and e-score to measure the significance of alignment.

2.2.2 Multiple Sequence Alignment

Progressive alignment and Iterative alignment are two popular techniques based on
heuristic approach used for MSA.

2.2.2.1 Progressive alignment: Progressive alignment is the most popular heuristic for
producing MSA [26]. It was developed by Feng and Doolittle [27]. In this technique, a
scoring matrix is prepared based on similarity. Sequences are aligned in the order of their
similarity. The steps involved in progressive alignment technique are discussed below:

⚫    Perform pairwise alignments using Needleman-Wunsch algorithm, Smith-Waterman
     algorithm, k-mer algorithm or k-tuple algorithm

⚫    Next, clustering of sequences is done using mBed or k-means algorithm [28]

⚫    Distance scores are obtained from similarity scores and construction of guide trees
     (or dendograms) is done from distance scores using Neighbour-Joining and
     Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

⚫    Based on the guide tree, most similar sequences are first aligned followed by less
     similar sequences and so on

It generally takes time of O(N2) for few thousands of sequences of medium length where
N is number of sequences [8].This approach was proposed by a number of researchers
[27], [29-33]. A number of tools are based on progressive alignment approach such as
ClustalW [34], Clustal Omega [8], MAFFT [35], Kalign [13], Probalign [36], MUSCLE
[37], DIALIGN [38], PRANK [39], FSA [40], T-Coffee [41], ProbCons (Notredame et al.,
2000), MULTALIGN (Barton and Sternberg, 1987), MULTAL[42], MAP [43], PCMA
[44], MUMMALS/PROMALS [45-46] and MSAProbs [47]. Progressive alignment
produces fast, efficient and reasonable alignment. Major problem with Progressive
alignment approach is that it considers only two sequences at one time where rest of the
sequences are ignored. This problem makes it a greedy approach and optimal results
cannot be found with this. Second problem with this approach is that quality of final result
relies upon initially selected pair of sequences hence error propagates throughout the
alignment and cannot be fixed at later stages. This problem was overcome by Gotoh.
Third problem is about selection of gap parameters [17].

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                                11255
International Journal of Advanced Science and Technology
                                                       Vol. 29, No. 3, (2020), pp. 11251 - 11265

➢    ClustalW

ClustalW [34] represents third generation of Clustal series of programs. It was developed
by Thompson in 1994. most popular methods for global MSA belong to this clustal family.
In 1988, first Clustal program was proposed by DES Higgins. It was designed to be used
on personal computer with quite low computing power. It was based on DP and
progressive alignment approach. In 1992, another tool named ClustalV was developed
where alignment of alignments was done termed as ‘profile alignment. Then the tree was
generated using Neighbor-Joining (NJ) method [48].The steps involved in working of
ClustalW are discussed below:

⚫    Perform alignment between every pair of sequences using k-tuple method proposed
     by Wilbur and Lipman [49] or Needleman-Wunsch algorithm [50]

⚫    Convert the similarity scores of pairwise alignments to distance scores

⚫    Construct guide tree based on distance scores using Neighbour-Joining method

⚫    Finally MSA is obtained by progressively aligning the sequences in the order given
     by the guide tree beginning from tip of the tree going down to its root

ClustalW is found to have better quality, sensitivity and speed as compared to its
counterparts. Sometimes, different weights are attached to gaps which causes problem of
having different result in the end [17].

➢    MUSCLE

MUSCLE stands for MUltiple Sequence Comparison by Log-Expectation. It is based on
progressive alignment approach. It performs Multiple sequence alignment. It has better
accuracy and higher speed of alignment than ClustalW2 and T-Coffee. The steps involved
in working of MUSCLE are discussed below [37]:

⚫    Guide tree is constructed using UPGMA method

⚫    Guide tree guides the progressive alignment to produce the initial MSA

⚫    Next, Kimura distance method is used to re-estimate the initial guide tree

⚫    Progressive alignment is done using second guide tree producing second MSA

⚫    If the SP score gets improved with second MSA then new alignment is kept
     otherwise it is discarded and first alignment is kept.

Edgar came up with MUSCLE-fast, an improved version of the program MUSCLE which
provides higher accuracy and speed. It can align 1000 sequences of average length of 282
in around 21 seconds on a personal computer [51].

➢    Clustal Omega

It has replaced older Clustal-W tool. It is the latest tool from Clustal suite of tools.
Accuracy of Clustal Omega is similar to its other high-quality counterparts whereas for
large number of sequences it performs better than its counterparts. It has lower execution
time and high accuracy. Seivers et al. In their study used Clustal Omega for 1,90,000
sequences on single processor in few hours [8].

⚫    Pairwise alignments are produced using k-tuple method (same as used by older
     Clustal-W)

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                               11256
International Journal of Advanced Science and Technology
                                                       Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫    Sequences are clustered using mBed method which works by embedding each
     sequence in a space of ‘n’ dimensions where ‘n’ is proportional to logN. mBed has a
     complexity of O(NlogN)

⚫    K-means++ clustering [28] is used for clustering of sequences. This algorithm
     eliminates the problem of selecting initial cluster centers in k-means hence improving
     its speed and accuracy

⚫    UPGMA method is used to construct guide tree

⚫    HHalign package is used to produce final MSA [8]. HHalign was proposed by
     Johannes Soding in 2005 [52].

➢    T-Coffee

T-Coffee [15] is based on progressive alignment approach. It performs both Pairwise and
multiple sequence alignment. It stands for tree based consistency objective function for
alignment evolution. It performs MSA using iterative approach. Its major shortcoming
can align up to 100 sequences only without affecting accuracy. T-Coffee is found to have
5-10% better accuracy than Clustal-W. The steps involved in working of T-Coffee are
discussed below:

⚫    Distance matrix is produced from pairwise alignments

⚫    Guide tree is formed using Neighbour-Joining method

⚫    Guide tree guides the grouping of sequences during MSA

⚫    DP is used to align two most similar sequences and process goes on until all
     sequences are aligned

Tommaso et al. (2011) developed a new interface for T-Coffee. There is a standard
T-Coffee mode for proteins and nucleotide sequences, M-Coffee mode which combines
alignment done by various methods and template based mode of T-Coffee. It is a
paralleled algorithm which provides higher accuracy and can align up to 150 sequences
[41].

➢    M-Coffee

M-Coffee method combines the alignment done by various methods. A distance matrix is
computed where each value depicts difference between two methods in terms of number
of bases aligned similarly in their results. Based on the distance matrix, methods are
arranged int he form of a tree. Then their outputs are combined as per the method tree. It
is found to deliver better results than any of the single method [14].

➢    MultAlin

This method proposed by Corpet in 1988 is based on DP approach of pairwise sequence
alignment. It performs multiple sequence alignment with hierarchical clustering. Steps
involved in its working are as follows:

⚫    Firstly, closest sequences are aligned to form set of sequences and then all sequences
     in these sets are aligned.

⚫    Now, sequences are aligned in hierarchical order forming another matrix.

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                               11257
International Journal of Advanced Science and Technology
                                                        Vol. 29, No. 3, (2020), pp. 11251 - 11265

⚫    If this alignment is different from the one produced in step 1, process is iterated until
     they converge [17].

➢    DIALIGN

This tool is used for pairwise as well as multiple sequence alignment. It can identify
similarity at local level when sequences are not globally identical. It came up in 1996 at
University of Bielefeld. Its original version works with the help of primary-sequence
based information only without requiring human input. Advanced versions of DIALIGN
make use of expert knowledge. It can be used for protein as well as DNA sequences.
Presently, Univeristy of Gottingen is carrying out extensive research on DIALIGN [53].
Schmollinger et al. (2004) came up with idea of executing DIALIGN on multiple
processors to reduce running time required for alignment. Running time of DIALIGN was
reduced up to 97% [54]. Subramanian et al. (2005) came up with DIALIGN-T, an
improved version of DIALIGN [55].

2.2.2.2 Iterative alignment: Iterative approach is an improvement of progressive
alignment. The underlying principle of this approach is same as that of progressive
alignment with repeated application of DP to perform re-alignment of sequences so that
overall alignment quality is improved. This also removes any errors introduced in initial
alignment hence enhancing overall accuracy of alignment [56]. Iterative alignment
approach is applicable to few hundred sequences only [8]. A number of algorithms and
tools are available based on iterative approach such as PRRP [57], MUSCLE [37],
Dialign [58], SAGA [59], MAFFT [35], PRIME [60] and T-Coffee [15].

➢    MAFFT

MAFFT stands for Multiple Alignment using Fast Fourier Transform. It is based on
progressive and iterative alignment approach [35]. The various steps involved in working
of MAFFT are discussed below:

⚫    Fast Fourier Transform is used to identify homologous regions

⚫    A simple scoring system is used to reduce the CPU time and improve the accuracy of
     alignment

MAFFT is based upon two-cycle heuristics viz. Progressive method (FFT-NS-2) and
iterative refinement method (FFT-NS-i). Firstly, FFT-NS-2 is used to calculate pairwise
distances and then FFT-NS-i is used to refine the calculated distances. FFT-NS-2 is more
accurate than FFT-NS-i whereas FFT-NS-i performs faster alignments. The option Part
tree can be used which offers high scalability in alignment for up to 50,000 sequences
[35].

➢    Kalign

Kalign is a global progressive alignment method based on a string matching algorithm
called Wu-Manber. It is used to calculate distance between sequences. It introduces factor
of local matches in the global alignment strategy. Distance between the two sequences is
given by Levenshtein edit distance. There is a distance ‘d’ between two sequences P and
Q if P can be changed to Q through transformations with application of ‘d’ number of
mismatches, insertions or deletions. Following steps are involved in working of Kalign:

⚫    Firstly, pairwise distances are calculated using k-tuple method.

⚫    Next, guide tree is formed using UPGMA or Neighbor-Joining method.

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                                11258
International Journal of Advanced Science and Technology
                                                      Vol. 29, No. 3, (2020), pp. 11251 - 11265

Kalign provides higher speed and accuracy even for large number of sequences. It is
found to be faster than ClustalW [13].

2.3 Burrows Wheeler Aligner (BWA)
BWA [61] is based upon Burrows Wheeler Transform (BWT). It is used for compression
and pattern matching. After performing BWT, a string comprising of last characters and
an index are obtained which enable pattern matching. This principle is used in BWA to
align the sequences. It is good for mapping less divergent sequences against a large
reference genome such as human genome. It has four different algorithms:

⚫    BWA-backtrack: It is used for sequence reads up to 100 bp.

⚫    BWA-SW: It is used for sequence reads up to 70 bp-1Mbp.

⚫    BWA-MEM: It is used for sequence reads ranging from 70 bp up to 1 Mbp.

⚫    BWA aln/SAMSE/SAMPE

2.4 Bowtie
There are many new algorithms and tools coming up for short sequence reads. BOWTIE
[62], Maq [63] and SOAP [64] are few prominent tools in this category. It produces very
fast alignments and requires less memory space. Bowtie 2 is an improved version of
Bowtie. Bowtie 2 is available as a part of another tool named Codon Code Aligner which
is GUI tool. Codon Code Aligner performs sequence alignment, assembly and mutation
detection.

2.5 Hidden Markov Model Approach

ProbCons [65] is a method based on Hidden Markov Model, a statistical model [18]. It is
a progressive alignment method which combines probabilistic modeling and consistency
based alignment. MUMMALS and PROMALS are extended form of ProbCons.

2.6 Natural Computing Approach

Methods based on heuristic approach are fast but they do not provide optimal solutions.
Natural computing algorithms are gaining importance as they provide optimal or
near-optimal solution. A number of hybrid techniques have been proposed by various
researchers based upon the natural computing algorithms [66-69]. These are also called
stochastic methods. They aim at solving complex and poorly defined optimization
problems. Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Ant Colony
Optimization (ACO) and Artificial Bee Colony are the prominent names in this category.
A number of genetic algorithms have been proposed as discussed by Chowdhury and
Garai in their study [18]. A new algorithm has been developed based on Flower
Pollination Algorithm (FPA) termed as Pairwise Sequence Alignment with Flower
Pollination Algorithm (PSAFPA). It is inspired from the pollination of flowering plant
species [70].

         Table 2. List of few Sequence Alignment Algorithms with their URLs

    Algorithm/Tool                                     URL

    ClustalW [34]                                      http://www.ebi.ac.uk/clustalw/

    ISSN: 2005-4238 IJAST
    Copyright ⓒ 2020 SERSC                                                              11259
International Journal of Advanced Science and Technology
                                                       Vol. 29, No. 3, (2020), pp. 11251 - 11265

  T-Coffee [15]                                         www.ebi.ac.uk/Tools/msa/tcoff
                                                        ee/

  DiAlign-T [55]                                        http://dialign-t.gobics.de/

  MAFFT [35]                                            www.ebi.ac.uk/Tools/msa/mafft
                                                        /

  BLAST [22]                                            blast.ncbi.nlm.nih.gov.

  BLAT [20]                                             http://genome.ucsc.edu/cgi-bin/
                                                        hgBlat

  FASTA [24-25]                                         ncbi.nlm.nih.gov

  Clustal Omega [8]                                     www.ebi.ac.uk/Tools/msa/clust
                                                        alo/

  Muscle [37]                                           www.ebi.ac.uk/Tools/msa/musc
                                                        le/

  PRIME [60]                                            http://prime.cbrc.jp/.

  Burrows Wheeler Aligner (BWA) [61]                    http://bio-bwa.sourceforge.net/

  Bowtie2 [62]                                          https://www.codoncode.com/ali
                                                        gner/

3. Performance Evaluation of Sequence Alignment tools
 Evaluation Metrics

As sequence alignment acts as very important pre-processing step for further high level
tasks such as identification of genetic variations, it becomes highly important to produce
alignments with high accuracy, high speed, good quality and low computational
complexity [5], [51], [71]. First aspect of accuracy is that how well the produced
alignment adheres to true alignment of sequences. Second aspect is that whether the
insertions, deletions and gaps are indicated at the right positions. Computational
complexity includes time, memory and CPU requirements. Complexity of an MSA tool is
calculated as O(Ln) where O is complexity, L is length of the sequence and ‘n’ is number
of sequences to be aligned [19]. Identity score and Identity percentage (percentage of
identity score) are quite suitable measures of alignment quality. Identity score reflects
number of matches found against the total number of compared nucleotide bases. Higher
the identity score , higher is the similarity between compared sequences hence better is
the alignment quality of algorithm (or tool). Identity percentage for unequal sequences is
measured by the following formula:

%Identity= {(2*S)/(L1+L2)} *100

Where L1 is Length of seq1

       L2 is Length of seq2

       S is No. of matches

  ISSN: 2005-4238 IJAST
  Copyright ⓒ 2020 SERSC                                                                 11260
International Journal of Advanced Science and Technology
                                                          Vol. 29, No. 3, (2020), pp. 11251 - 11265

For equal sequences, Identity percentage is given by the formula:

%Identity= (No.of matches/Total length of sequence)*100

One more way to evaluate an alignment is to compare this to an accurate reference
alignment. This reference alignment is true and is termed as test case. Julie D. Thompson
suggested this evaluation system. It is found that reference alignment i.e. test case
selected for evaluation affects the evaluation process and also all algorithms do not work
for given problems in test cases in the same manner. Lambert et al. (2003) in their study
suggest that users should select the alignment method depending upon sensitivity and
selectivity i.e. set of sequences to be aligned. Reliability of results lies in consensus, if can
be built from results of different methods.

3.2 Results and Discussion

In this study, various tools and algorithms are studied, implemented and their
performance is evaluated and compared. DNA sequences are retrieved from National
Center for Biotechnology Information (NCBI) (National Center for Biotechnology
Information available at https://www.ncbi.nlm.nih.gov/). In this study, Identity Percentage
is used as a measure of alignment quality to evaluate and compare the discussed
algorithms and tools. Results of sequence alignment for three pairs of sequences are
presented in table 3.

  Table 3. Percentage Identity values for various Sequence Alignment Algorithms

                                                            Percentage Identity
                    Algorithm/ Tool                 Seq1 &       Seq3&         Seq5 &
                                                     Seq2         Seq4          Seq6

            Clustal Omega Emboss Needle              47.8%       62.3%          35.6%

           Clustal Omega Emboss Stretcher            46.2%       64.9%          48.6%

            Clustal Omega Emboss Water               70.8%       78.7%          55.6%

           Clustal Omega Emboss Matcher              100%        78.7%          84.4%

                       MUSCLE                       44.19%       62.67%        54.05%

              T-Coffee (using M-Coffee)             38.46%       56.75%        50.64%

                  T-Coffee (ebi.ac.uk)              43.59%         83%         50.67%

                           MAFFT                    41.02%       59.45%        50.67%

              Blast (Needleman-Wunsch)                51%          65%           52%

                       Clustal W                    43.59%           -         50.67%

                        PSAFPA                      38.35%       45.11%        48.64%

From the results, it is observed that for pairwise sequence alignment, Clustal Omega
Emboss Matcher outperforms other tools & algorithms followed by Clustal Omega
Emboss Water further followed by Blast (Needleman-Wunsch algorithtm for global

   ISSN: 2005-4238 IJAST
   Copyright ⓒ 2020 SERSC                                                                   11261
International Journal of Advanced Science and Technology
                                                                Vol. 29, No. 3, (2020), pp. 11251 - 11265

alignment). Apart from these algorithms, Burrows Wheeler Aligner (BWA) and Bowtie2
(as part of Codon Code Aligner) are also implemented as part of this study.

4. Conclusions
The biggest challenge in sequence alignment is to produce alignment results with good
quality, high accuracy, high speed and low computational complexity. Multiple Sequence
Alignment (MSA) is a computationally intense task where complexity increases
manifolds than pairwise sequence alignment. Speed of alignment and computational
complexity is adversely affected when length of sequences increases.Heuristic methods
provide feasible alignment in faster way than the Dynamic Programming approach which
has high time complexity. Manual refinement of alignment results continues to dominate
as the fully automated algorithms are not able meet the demand. Big data technologies,
parallelism and cloud computing offer promising solution for alignment of large number
of sequences in shorter amount of time. It is observed from the evaluation of methods that
no single method performs good for every test sequence. Some methods are good for long
sequences whereas others provide high quality solution for small sequences. A hybrid
method combining the features of several state-of-the-art methods can be a good solution.

References
1. X. Xia, Editor, “Bioinformatics and the Cell: Modern Computational Approaches in Genomics: Proteomics
    and Transcriptomics”, vol. XVI.
2. E.M. Mohamed, H.M. Mousa and A.E. Keshk, “Comparative Analysis of Multiple Sequence Alignment
    Tools”, I.J. Information Technology and Computer Science, vol. 8, (2018), pp. 24-30.
3. R. Sachidanandam, D. Weissman, S.C. Schmidt, J.M. Kakol, L.D. Stein, G. Marth, S. Sherry, J.C. Mullikin,
    B.J. Mortimore, D.L. Willey, S.E. Hunt, C.G. Cole, P.C. Coggill, Z. Ning, J. Rogers, D.R. Bentley, P.Y.
    Kwok, E.R. Mardis, R.T. Yeh, B. Schultz, L.Cook, R. Davenport, M.Dante, L. Fulton, L. Hillier, R.H.
    Waterston, J.D. McPherson, B. Gilman, S. Schaffner, W.J. Van Etten, D. Reich, J. Higgins, M.J.Daly, B.
    Blumenstiel, J. Baldwin, N. Stange-Thomann, M.C. Zody, L. Linton, E.S. Lander, D. Altshuler and
    International SNP Map Working Group, “A map of human genome sequence variation containing 1.42
    million single nucleotide polymorphisms”, Nature, vol. 409, (2001), pp. 928-933.
4. F.S. Collins, L.D. Brooks and A. Chakravarti, “A DNA polymorphism discovery resource for research on
    human genetic variation”, Genome Research, vol.8, (1998), pp. 1229–1231.
5. C. Kemena and C. Notredame, “Upcoming challenges for multiple sequence alignment methods in the
    high-throughput era”, Bioinformatics, vol.25, no.19, (2009), pp. 2455-2465.
6. R.C. Edgar and S. Batzoglou, “Multiple Sequence Alignment”, Current Opinion in Structural Biology, vol.
    16, no.3, (2006), pp. 368-373.
7. C. Notredame, “Recent Evolutions of Multiple Sequence Alignment Algorithms”, PLoS Computational
    Biology, vol. 3, no.8, (2007).
8. F. Sievers, A. Wilm, D. Dineen,T.J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J.
    Soding, J.D. Thompson and D.G. Higgins, “Fast, scalable generation of high-quality protein multiple
    sequence alignments using Clustal Omega”, Molecular Systems Biology,vol. 7, (2011) .
9. M. Chatzou, C. Magis, J.M. Chang, , C. Kemena, G. Bussotti, I. Erb and C. Notredame, “Multiple
    sequence alignment modeling: methods and applications”, Briefings in Bioinformatics, vol. 17, no. 6,
    (2015), pp. 1009–1023.
10. D.J. Russell, Editor, “Multiple Sequence Alignment Methods”, Humana Press, New York: Humana Press,
    (2014).
11. C. Notredame, “Recent progress in multiple sequence alignment: a survey”, Pharmacogenomics, vol. 3, no.
    1, (2002), pp. 131-144.
12. H. Li and N. Homer, “A survey of sequence alignment algorithms for next-generation Sequencing”,
    Briefings in Bioinformatics, vol. 11, no. 5, (2010), pp. 473-483.
13. T. Lassmann and E.L.L. Sonnhammer, “Kalign – an accurate and fast multiple sequence alignment
    algorithm”, BMC Bioinformatics, vol. 6, (2005), pp. 298.
14. I.M. Wallace, O. O’Sullivan, D.G. Higgins and C. Notredame, “M-Coffee: combining multiple sequence
    alignment methods with T-Coffee”, Nucleic Acids Research, vol. 34, no. 6, (2006), pp. 1692–1699.
15. C. Notredame, D.G. Higgins and J. Heringa, “T-Coffee: A Novel Method for Fast and Accurate Multiple
    Sequence Alignment”, Journal of Molecular Biology, vol. 302, (2000), pp. 205-217.

   ISSN: 2005-4238 IJAST
   Copyright ⓒ 2020 SERSC                                                                          11262
International Journal of Advanced Science and Technology
                                                                 Vol. 29, No. 3, (2020), pp. 11251 - 11265

16. J.D. Thompson, B. Linard, O. Lecompte and O. Poch, “A Comprehensive Benchmark Study of Multiple
    Sequence Alignment Methods: Current Challenges and Future Perspectives”, PLoS ONE, vol. 6, no. 3,
    (2011).
17. C. Lambert, J.M.V. Campenhout, X. DeBolle and E. Depiereux, “Review of Common Sequence
    Alignment Methods: Clues to Enhance Reliability”, Current Genomics, vol. 4, (2003), pp. 131-146.
18. B. Chowdhury and G. Garai, “A review on multiple sequence alignment from the perspective of genetic
    algorithm”, Genomics, vol. 109, (2017), pp. 419-431.
19. J. Daugelaite, A.O. Driscoll and R.D. Sleator, “An Overview of Multiple Sequence Alignments and Cloud
    Computing in Bioinformatics”, ISRN Biomathematics, vol. 2013, (2013).
20. W.J. Kent, “BLAT-The BLAST-Like Alignment Tool”, Genome Research, vol. 12, (2002), pp. 656-664.
21. M. Bhagwat, L. Young and R.R. Robison, “Using BLAT to Find Sequence Similarity in Closely Related
    Genomes”, Current Protocols in Bioinformatics, (2012), pp. 10.8.1-10.8.24.
22. S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, “Basic alignment Search tools”, Journal
    of Molecular Biology, vol. 215, (1990), pp. 403-410.
23. E.S. Donkor, N.T.K.D. Dayie and T.K. Adiku, “Bioinformatics with basic local alignment search tool
    (BLAST) and fast alignment (FASTA)”, Journal of Bioinformatics and Sequence Analysis, vol. 6, no. 1,
    (2014), pp. 1-6.
24. D.J. Lipman and W.R. Pearson, “Rapid and sensitive protein similarity searches”, Science, vol. 227,
    (1985), pp. 1435-1441.
25. W.R. Pearson, “Rapid and sensitive sequence comparison with FASTP and FASTA”, Methods
    Enzymology, vol. 183, (1990), pp. 63-98.
26. G.J. Barton and M.J. Sternberg, “A strategy for the rapid multiple alignment of protein sequences.
    Confidence levels from tertiary structure comparisons”, Journal of Molecular Biology, vol. 198, (1987),
    pp. 327-337.
27. D.F. Feng and R.F. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic
    trees”, Journal of Molecular Evolution, vol. 25, (1987), pp. 351-360.
28. D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding”, Proceedings of the 18th
    Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied
    Mathematics.
29. Main Page-KVM, 2013, http://www.linux-kvm.org/page/
30. P. Hogeweg and B. Hesper, “The alignment of sets of sequences and the construction of phyletic trees : an
    integrated method”, Journal of Molecular Evolution, vol. 20, (1984), pp. 175-186.
31. M.S. Waterman and M.D. Perlwitz, “Line geometries for sequence comparisons”, Bulletin of
    Mathematical Biology, vol.46, (1984), pp. 567-577.
32. W.R. Taylor, “Multiple sequence alignment by a pairwise algorithm”, Computer Applications in the
    Biosciences, vol. 3, (1987), pp. 81-87.
33. D.G. Higgins and P.M.Sharp, “Fast and sensitive multiple sequence alignments on a microcomputer”,
    Computer Applications in the Biosciences, vol. 8, (1989), pp. 189-191.
34. J.D. Thompson, D.G. Higgins and T.J. Gibson, “CLUSTAL W: improving the sensitivity of progressive
    multiple sequence alignment through sequence weighting, position-specific gap penalties and weight
    matrix choice”, Nucleic Acids Research, vol. 22, no. 22, (1989), pp. 4673-4680.
35. K. Katoh and D.M. Standley, “MAFFT multiple sequence alignment software version 7: improvements in
    performance and usability”, Molecular Biology and Evolution, vol. 30, no. 4, (2013), pp. 772-780.
36. U. Roshan and D.R. Livesay, “Probalign: multiple sequence alignment using partition function posterior
    probabilities”, Bioinformatics, vol. 22, no. 22, (2006), pp. 2715–2721.
37. R.C. Edgar, “MUSCLE: a multiple sequence alignment method with reduced time and space complexity”,
    BMC Bioinformatics, vol. 5, (2004).
38. B. Morgenstern, “DIALIGN: multiple DNA and protein sequence alignment at BiBiServ”, Nucleic Acids
    Research, vol. 32, no. 2, (2004), pp.W33–W36.
39. A. Loytynoja and N. Goldman, “Phylogeny-aware gap placement prevents errors in sequence alignment
    and evolutionary analysis”, Science, vol. 320, no. 5883, (2008), pp. 1632–1635.

40. R.K. Bradley, A. Roberts, M. Smoot, S. Juvekar, J. Do, C. Dewey, I. Holmes and L. Pachter, “Fast
    statistical Alignment”, PLoS Computational Biology, vol. 5, no. 5, (2009).
41. P. Di Tommaso, S. Moretti, I. Xenarios, M. Orobitg, A. Montanyola, J.M. Chang, J.F. Taly and C.
    Notredame, “T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences
    using structural information and homology extension”, Nucleic Acids Research, vol. 39, no. 2, (2011), pp.
    W13–W17.

   ISSN: 2005-4238 IJAST
   Copyright ⓒ 2020 SERSC                                                                             11263
International Journal of Advanced Science and Technology
                                                                  Vol. 29, No. 3, (2020), pp. 11251 - 11265

42. W.R. Taylor, “Multiple sequence alignment by a pairwise algorithm”, Computer Applications in the
    Biosciences, vol. 3, (1987), pp. 81-87.
43. X. Huang, “On global sequence alignment”, Computer Applications in the Biosciences, vol. 10, (1994),
    pp. 227-235.
44. J. Pei, R. Sadreyev and N.V.Grishin, “PCMA: fast and accurate multiple sequence alignment based on
    profile consistency”, Bioinformatics, vol. 19, (2003), pp. 427-428.
45. J. Pei and N.V. Grishin, “MUMMALS: multiple sequence alignment improved by using hidden Markov
    models with local structural information”, Nucleic Acids Research, vol. 34, (2006), pp. 4364-4374.
46. J. Pei and N.V.Grishin, “PROMALS: towards accurate multiple sequence alignments of distantly related
    protein”, Bioinformatics, vol. 23, (2006), pp. 802-808.
47. Y. Liu, B. Schmidt and D.L. Maskell, “MSAProbs: multiple sequence alignment based on pair hidden
    Markov models and partition function posterior probabilities”, Bioinformatics, vol. 26, no. 16, (2010), pp.
    1958-1964.
48. R. Chenna, H. Sugawara, Koike, T.J. Gibson, D.G. Higgins and J.D. Thompson, “Multiple sequence
    alignment with the Clustal series of programs”, Nucleic Acids Research, vol. 31, no. 13, (2003), pp.
    3497-3500.
49. W.J. Wilbur and D.J. Lipman, “Rapid similarity searches of nucleic acid and protein data banks”, in the
    Proceedings of the National Academy of Sciences of the United States of America, vol. 80, no.3, (1983),
    pp. 726-730.
50. S.B. Needleman and C.D. Wunsch, “A general method applicable to the search for similarities in the
    amino acid sequence of two proteins”, Journal of Molecular Biology, vol. 48, no. 3, (1970), pp. 443-453.
51. R.C. Edgar and S. Batzoglou, “Multiple Sequence Alignment”, Current Opinion in Structural Biology, vol.
    16, no.3, (2006), pp. 368-373.
52. J. Soding, “Protein homology detection by HMM-HMM comparison”, SOLiDTM4 System (2013),
    (2005).
53. L.A. Ait, Z. Yamak and B. Morgenstern, “DIALIGN at GOBICS-multiple sequence alignment using
    various sources of external information”, Nucleic Acids Research, vol. 41, (2013), pp. W3-W7.
54. M. Schmollinger, K.Nieselt, M. Kaufmann and B. Morgenstern, “DIALIGN P: Fast pair-wise and
    multiple sequence alignment using parallel processors”, BMC Bioinformatics, vol.5, (2004), pp.128.
55. A.R. Subramanian, J. Weyer-Menkhoff, M. Kaufmann and B. Morgenstern, “DIALIGN-T: An improved
    algorithm for segment-based multiple sequence alignment”, BMC Bioinformatics, vol. 6, (2005), pp. 66.
56. D.W. Mount, “Using iterative methods for global multiple sequence alignment”, Cold Spring Harbor
    Protocols, vol. 4, no. 7, (2009).
57. O. Gotoh, “Optimal alignment between groups of sequences and its application to multiple sequence
    alignment”, Computer Applications in the Biosciences, vol. 9, no. 3, (1993), pp. 361-370.
58. B. Morgenstern, “DIALIGN: multiple DNA and protein sequence alignment at BiBiServ”, Nucleic Acids
    Research, vol. 32, no. 2, (2004), pp.W33–W36.
59. C. Notredame and D.G. Higgins, “SAGA: sequence alignment by genetic algorithm”, Nucleic Acids
    Research, vol. 24, no. 8, (1996), pp. 1515-1524.
60. S. Yamada, O. Gotoh and H. Yamana, “Improvement in accuracy of multiple sequence alignment using
    novel group-to-group sequence alignment algorithm with piecewise linear gap cost”, BMC
    Bioinformatics, vol. 7, (2006), pp. 524.
61. H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler Transform”,
    Bioinformatics, vol. 25, (2009), pp. 1754-1760.
62. B. Langmead, C.Trapnell, M. Pop and S.L. Salzberg, “Ultrafast and memory-efficient alignment of short
    DNA sequences to the human genome”, Genome Biology, vol. 10, no. 3, (2009).
63. H. Li, J. Ruan and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping
    quality scores”,Genome Research, vol. 18, no. 11, (2008), pp. 1851–1858.
64. R. Li,Y. Li, K. Kristiansen and J. Wang, “SOAP: short oligonucleotide alignment program”,
    Bioinformatics, vol. 24, no. 5, (2008), pp. 713-714.
65. C.B. Do, M.S.P. Mahabhashyam, M. Brudno and S. Batzoglou, “ProbCons: probabilistic
    consistency-based multiple sequence alignment”, Genome Research, vol. 15, no. 2, (2005), pp. 330-340.
66. S.R. Jangam and N. Chakraborti, “A novel method for alignment of two nucleic acid sequences using ant
    colony optimization and genetic algorithm”, Applied Soft Computing, vol. 7, no. 3, (2007),
    pp.1121-1130.
67. C. Gondro and B.P. Kinghorn, “A simple genetic algorithm for multiple sequence alignment”, Genetics
    and Molecular Research, vol. 6, no. 4, (2007), pp.964-982.

   ISSN: 2005-4238 IJAST
   Copyright ⓒ 2020 SERSC                                                                              11264
International Journal of Advanced Science and Technology
                                                              Vol. 29, No. 3, (2020), pp. 11251 - 11265

68. Z.J. Lee, A.F. Su, C.C. Chuang and K.H. Liu, “Genetic algorithm with ant colony optimization (GA-ACO)
    for multiple sequence alignment”, Applied Soft Computing, vol. 8, no.1, (2008), pp.55-78.
69. G. Garai and B. Chowdhury, “A cascaded pairwise biomolecular sequence alignment technique using
    evolutionary algorithm”, Information Sciences, vol. 297, (2015), pp.118-139.
70. Y. Kaur and N. Sohi, “Pairwise Sequence Alignment Method Using Flower Pollination Algorithm”, 4th
    IEEE International Conference on Signal Processing, Computing and Control, Sept 21-23, 2017, Solan,
    India.(2017), pp. 408-413.
71. C. Notredame, “Recent Evolutions of Multiple Sequence Alignment Algorithms”, PLoS Computational
    Biology, vol. 3, no.8, (2007).

   ISSN: 2005-4238 IJAST
   Copyright ⓒ 2020 SERSC                                                                        11265
You can also read