Using Perl for Bioinformatics - Science and Technology Support Group High Performance Computing

Page created by Ida Weaver
 
CONTINUE READING
Using Perl for Bioinformatics

Science and Technology Support Group
     High Performance Computing

      Ohio Supercomputer Center
         1224 Kinnear Road
      Columbus, OH 43212-1163
Table of Contents
•   Section 1                                      •   Section 3
     –   Concatenate sequences                          – Read FASTA Files
     –   Transcribe DNA to RNA                          – Exercises 3
     –   Reverse complement of sequences
                                                   •   Section 4
     –   Read sequence data from files
     –   Searching for motifs in DNA or proteins        – GenBank Files and Libraries
     –   Exercises 1                                    – Exercises 4
•   Section 2                                      •   Section 5
     – Subroutines                                      – PDB
     – Mutations and Randomization                      – Exercises 5
     – Translating DNA into Proteins               •   Section 6
       using Libraries of Subroutines                   – Blast Files
     – BioPerl Modules                                  – Exercises 6
     – Exercises 2

                                                                                                     2
                                                                          Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-1 : Concatenation of two strings of DNA
• Concatenating two DNA sequences defined by two perl variables.
     – Two character sequences assigned to scalar variables.
     – The two sequences are used to create a third variable.
     – The third variable is the concatenated sequence by use of the ‘.’.
•   Use ‘print’ command to print concatenated sequence stdout.
     – Example 1-1 uses many different routines to print out the concatenated sequence.
     – Use of the newline character, “\n”.

                                                                                                       3
                                                                            Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-1
#!/usr/bin/perl -w
# Example 1-1 Concatenating DNA                                                  Note the different uses of the assignment to DNA3 achieve the
                                                                                 same result:
# Store two DNA fragments into two variables called $DNA1 and $DNA2              1.       $DNA3 = “$DNA1$DNA2”;
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';                                     2.       $DNA3 = $DNA1.$DNA2;
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';

# Print the DNA onto the screen                                                  Results of running example 1-1:
print "Here are the original two DNA fragments:\n\n";
                                                                                 Here are the original two DNA fragments:
print $DNA1, "\n";                                                               ACGGGAGGACGGGAAAATTACTACGGCATTAGC
                                                                                 ATAGTGCCGTGAGAGTGATGTAGTA
print $DNA2, "\n\n";                                                             Here is the concatenation of the first two fragments (version 1):
                                                                                 ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
# Concatenate the DNA fragments into a third variable and print them             GTAGTA
# Using "string interpolation"                                                   Here is the concatenation of the first two fragments (version 2):
$DNA3 = "$DNA1$DNA2";                                                            ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
                                                                                 GTAGTA
print "Here is the concatenation of the first two fragments (version 1):\n\n";   Here is the concatenation of the first two fragments (version 3):
                                                                                 ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
print "$DNA3\n\n";
                                                                                 GTAGTA
# An alternative way using the "dot operator":
# Concatenate the DNA fragments into a third variable and print them
$DNA3 = $DNA1 . $DNA2;

print "Here is the concatenation of the first two fragments (version 2):\n\n";

print "$DNA3\n\n";

# Print the same thing without using the variable $DNA3
print "Here is the concatenation of the first two fragments (version 3):\n\n";

print $DNA1, $DNA2, "\n";

exit;

                                                                                                                                                               4
                                                                                                                                    Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-2 : Transcribing DNA to RNA
• Converting all thymine with uracil in the DNA
    – Replace all the ‘T’ characters in the string with ‘U’.
    – Use binding operator ‘=~’.
    – Regular expression substitution, globally, ‘s/T/U/g’.

                                                                                          5
                                                               Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-2
#!/usr/bin/perl -w                              1.      Assign the variable $RNA to the string $DNA.
# Transcribing DNA into RNA                     2.      $RNA =~ s/T/U/g; is evaluated as substitute all uppercase T’s with uppercase U’s.
# The DNA
$DNA =                                          Results of running example 1-2:
’ACGGGAGGACGGGAAAATTACTACGG
CATTAGC’;                                       Here is the starting DNA:
                                                ACGGGAGGACGGGAAAATTACTACGGCATTAGC
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";          Here is the result of transcribing the DNA to RNA:
print "$DNA\n\n";                               ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
# Transcribe the DNA to RNA by substituting
# all T’s with U’s.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the
DNA to RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;

                                                                                                                                    6
                                                                                                        Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-3 : Calculating the Reverse Compliment of a DNA strand
• Find the reverse of the DNA string.
• Calculate the compliment of the reversed string.
    – Substitute for all bases their compliment.
         • A -> T; T -> A; C -> G; G -> C.
    – Could use the substitute function of the regular expression
         •   $var =~ s/A/T/g;
         •   $var =~ s/T/A/g;
         •   $var =~ s/C/G/g;
         •   $var =~ s/G/C/g;
    – This would result in error!?
    – Fortunately there is an operation with regular expressions called ‘translator’.

                                                                                                    7
                                                                         Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-3
#!/usr/bin/perl -w                                      Note that the translator replaces the characters in the first sequence with the
# Calculating the reverse complement of strand of DNA   corresponding character in the second sequence. In this example both
# The DNA                                               uppercase and lowercase replacement of the bases are translated.
$DNA =ACGGGAGGACGGGAAAATTACTACGGCATTAGC’;
# Print the DNA onto the screen                         Results of running example 1-3:
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";                                       Here is the starting DNA:
                                                        ACGGGAGGACGGGAAAATTACTACGGCATTAGC
# Make a new copy of the DNA
$revcom = reverse $DNA;                                 Here is the reverse complement DNA:
# See the text for a discussion of tr///                GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
exit;

                                                                                                                                  8
                                                                                                     Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-4 : Reading protein sequences from a file
• Use ‘open’.
     – Use a character string variable.
     – open(FILEPOINTER, $filename);
•   Read in the contents.
     – Use angle brackets, ‘’.
     – Need to create a loop to read in all lines
•   Read from a file named in the command line.
     –   Use angle brackets, ‘’.
     –   Do not need to create a filepointer.
     –   Read into an array
     –   Need to create a loop to read in all lines of the array

                                                                                              9
                                                                   Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions
Example 1-4
                                                                       The filename is set by assigning the string variable $proteinfilename. The
#!/usr/bin/perl -w
                                                                       ‘while’ loop reads in from the file one line at a time. Each line from the file
$longprotein = '';                                                     is concatenated on the end of the previous string. It is good programming
                                                                       practice to close the file pointer when done. Note how the output is each line of
# Example 4-5 Reading protein sequence data from a file                the file is on a newline.
# Usage: perl example1-4.pl
                                                                       Results of running example 1-4:
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep';                            Here is the protein:
                                                                       MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
# First we have to "open" the file, and associate                      SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
# a "filehandle" with it. We choose the filehandle                     GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);

# Now we do the actual reading of the protein sequence from the
# file by using the angle brackets < and > to get the input from the
# filehandle. We store the data into our variable $protein.
while ($protein = ) {
        $longprotein .= $protein;
}

# Now that we've got our data, we can close the file.
close PROTEINFILE;

# Print the protein onto the screen
print "Here is the protein:\n\n";

print $longprotein;

exit;

                                                                                                                                                   10
                                                                                                                        Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions
Example 1-4
                                                                       The filename is given as an argument on the command line. This is much more
#!/usr/bin/perl –w
                                                                       convenient than writing a different perl script for each file we need to open. The
                                                                       command:
$longprotein = '';                                                               @data_from_file = ;
                                                                       treats each list on the command line as a file, opens each file, and then reads each
# Example 4-5 Reading protein sequence data from a file                          line of the file into the array. Creating a filehandle is not needed.
# Usage: perl example1-4b.pl filename
                                                                       The ‘foreach’ loop then retrieves each element of the array, discards the newline
                                                                                at the end, then concatenates the string onto the end of the string
# The filename of the file containing the protein sequence data                 variable $longprotein.
# is in the command line. The '' is shortcut for .
# the treats the @ARGV array as a list of
# filenames, returning the contents                                    Results of running example 1-4b:
# of those files one line at a time. The contents of those files are
# available to the program, using the angle brackets ,               Here is the protein:
                                                                       MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQDSVLQDRSMPHQEILAAD
# without a filehandle.                                                       EVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQGLQYALNVPISVK
@data_from_file = ;                                                         QEITFTDVSEQLMRDKKQIR

# Using the foreach loop, we access the data from the array,
# one line at a time. Removing the 'newline' from the string,
# concatenate to the string variable, making one long protein
# string.
foreach (@data_from_file) {
       chop $_;
       $longprotein .= $_;
}

# Print the protein onto the screen
print "Here is the protein:\n";
print $longprotein."\n";
exit;

                                                                                                                                                    11
                                                                                                                         Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions

Example 1-5 : Searching for motifs in DNA or proteins
• Prompt the user for filename and protein strings
     – Specify a filename to open
     – open(FILEPOINTER, $filename);
•   Read in the contents.
     – Read the lines of the file into an array.
     – Concatenate all lines of the array into a scalar variable.
     – Remove all newlines and blanks from the scalar variable.
•   Compare the motif entered from the terminal to the protein string.
     – Use regular expression comparison.
     – Exit the program when motif only contains whitespaces.

                                                                                             12
                                                                    Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions
Example 1-5
                                                                       The filename is given as standard input to the question:
#!/usr/bin/perl -w
                                                                                $proteinfilename = ;
# Example 5-3 Searching for motifs
                                                                       The ‘unless’ condition checks for the presence of the file, exiting if not found:
# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the keyboard                    unless ( open(PROTEINFILE, $proteinfilename) )
print "Please type the filename of the protein sequence data: ";
                                                                       Each line of the file is then put into an array, @protein, after which the filehandle
                                                                       is closed:
$proteinfilename = ;                                                       @protein = ;
# Remove the newline from the protein filename                         By using ‘join’ each line in the array is put into one long character string,
chomp $proteinfilename;                                                including newline characters:
# open the file, or exit                                                        $protein = join( '', @protein);
unless ( open(PROTEINFILE, $proteinfilename) ) {
   print "Cannot open file \"$proteinfilename\"\n\n";
   exit;                                                               All whitespaces, including newlines, tabs and blanks, are then removed.
}                                                                               $protein =~ s/\s//g;

# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = ;

# Close the file - we've read all the data into @protein now.
close PROTEINFILE;

# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$protein = join( '', @protein);

# Remove whitespace
$protein =~ s/\s//g;

                                                                                                                                                       13
                                                                                                                           Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions
Example 1-5 (cont’d)                                           The loop controls the search for the character string in the entire protein string.
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.                                  The variable $motif is assigned the character string typed in the shell:
# Exit if no motif is entered.                                          $motif = ;
do {
                                                               The newline character is removed from the end of the string:
   print "Enter a motif to search for: ";                             chomp $motif;
   $motif = ;
   # Remove the newline at the end of $motif                   The character string $motif is compared to the protein string for a match:
   chomp $motif;                                                         $protein =~ /$motif/

  # Look for the motif                                         When the user types nothing but whitespaces, the program exits:
  if ( $protein =~ /$motif/ ) {                                       until ( $motif =~ /^\s*$/ );
     print "I found it!\n\n";
  } else {
                                                               Results from running example1-5.pl:
     print "I couldn\'t find it.\n\n";                         Please type the filename of the protein sequence data: NM_021964fragment.pep
  }                                                            Enter a motif to search for: SVLQ
                                                               I found it!
                                                               Enter a motif to search for: sqlv
# exit on an empty user input                                  I couldn’t find it.
                                                               Enter a motif to search for: QDSV
} until ( $motif =~ /^\s*$/ );                                 I found it!
                                                               Enter a motif to search for: HERLPQGLQ
# exit the program                                             I found it!
                                                               Enter a motif to search for:
exit;                                                          I couldn’t find it.

                                                                                                                                              14
                                                                                                                  Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions
Exercises for Section 1
1.   Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end
     of any statement of one of our working programs and examining the error messages that result, if any. Try
     changing other syntactical items: add a parenthesis or a curly brace; misspell some command, like "print" or
     some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even
     after getting to know the language well, it is still common to have some syntax errors as you gradually add code
     to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the
     line where the error is?
2.   Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write
     another that prints the DNA in uppercase (ACGT). Use the function tr///.
3.   Do the same thing as Exercise 2, but use the string directives \U and \L for upper- and lowercase. For instance,
     print "\U$DNA" prints the data in $DNA in uppercase.
4.   Prompt the user to enter two (short) strings of DNA. Concatenate the two strings of DNA by appending the
     second to the first using the .= assignment operator. Print the two strings as concatenated, and then print the
     second string lined up over its copy at the end of the concatenated strings. For example, if the input strings are
     AAAA and TTTT, print: AAAATTTT
                                     TTTT
5.   Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions.
     Use the substr function, and examine each base one at a time in the original while you build up the reverse
     complement. (Hint: you might find it easier to examine the original right to left, rather than left to right,
     although either is possible.)
6.   Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C
     in the DNA.)
7.   Modify Example 1-5 to not only find motifs by regular expressions but to print out the motif that was found. For
     example, if you search, using regular expressions, for the motif EE.*EE, your program should print
     EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set
     to hold the pattern that was matched.
8.   Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl
     functions substr or slice.

                                                                                                                             15
                                                                                                    Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules

Example 2-1 : Counting bases in DNA string, using subroutines.
• Subroutines are very efficient
     – Write once, use many times.
     – Routines which have a pervasive utility may be stored in a library for future use.
•   Lexical scoping using ‘my’ declaration
     –   Important to understand the scope of variables
     –   Use ‘my’ to declare variables with in the scope of the code
     –   Variable names may be used in different code segments
     –   Declare ‘use strict’ to enforce variables to be defined with ‘my’
•   Use special array to pass arguments to subroutine
     – my($var1, $var2, $var3) = @_;
     – This will assign the values of arguments passed to the subroutine to the named
       variables
     – Mistake of not using the @_
           • Variables will not have their passed values

                                                                                                      16
                                                                             Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-1                                                         The command ‘use strict’ requires all variables to use the ‘my’ declaration for
#!/usr/bin/perl -w                                                  all variables. This will limit the scope of any variable.
# Example 2-1 Counting the number of G's in some DNA on the
# command line                                                      Declare a string variable to keep usage line.
use strict;
                                                                    The ‘unless’ condition will make sure there are arguments on the command line.
                                                                    The special array, @ARGV, exists only if there are arguments present on the
# Collect the DNA from the arguments on the command line            command line.
# when the user calls the program.
# If no arguments are given, print a USAGE statement and exit.      Assign the value of the character string in the command line to the variable $dna.
# $0 is a special variable that has the name of the program.        Here the first value of the array of argument array, and in this case the only
                                                                    argument, is represented by the variable $ARGV[0]. Here the individual
my($USAGE) = "$0 DNA\n\n";                                          elements of an array are references by the syntax $array1[n].

# @ARGV is an array containing allcommand-line arguments.
#
# If it is empty, the test will fail and the print USAGE and exit
# statements will be called.
unless(@ARGV) {
   print $USAGE;
   exit;
}

# Read in the DNA from the argument on the command line.
my($dna) = $ARGV[0];

                                                                                                                                                17
                                                                                                                     Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-1 (cont’d)                                                     The subroutine ‘countG’ takes a character string as an argument and returns a
# Call the subroutine that does the real work, and collect the result.           number.
my($num_of_Gs) = countG ( $dna );
                                                                         The line “my($num_of_Gs) = countG($dna);” passes the dna sequence to the
                                                                                  subroutine ‘countG’ and assingns the returned number to the variable
# Report the result and exit.                                                     ‘$num_of_Gs’.
print "\nThe DNA $dna has $num_of_Gs G\'s in it!\n\n";
exit;                                                                    The variable $dna, now lexically scoped only to the subroutine, is assigned the
                                                                                 value passed.

########################################                                 The variable count is initialized to the value ‘0’.
# Subroutines for Example 2-1
########################################                                 The translate of the dna string, $dna =~ tr/Gg//, will effectively remove any
                                                                                  upper or lower case G from the string.
sub countG {
# return a count of the number of G's in the argument $dna               The assignment to the variable $count is a count of the list which is the
# initialize arguments and variables                                              successful tranlations, and is returned.
  my($dna) = @_;
                                                                         Results from running example2-1.pl:
  my($count) = 0;                                                        perl example2-1.pl CGGATTTAGCGCGT

# Use the tr on the regular expression for                               The DNA CGGATTTAGCGCGT has 5 G's in it!
# counting nucleotides in DNA
  $count = ( $dna =~ tr/Gg//);
  return $count;
}

                                                                                                                                                        18
                                                                                                                               Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules

Example 2-2 : Creating mutant DNA using Perl’s random number generator
• Simulate mutating DNA using random number generator
     – Randomly pick a nucleotide in a DNA string
     – Randomly pick a basis from the four, A, C, T, G
     – Replace the picked nucleotide in the selected position of the DNA string with the
       randomly selected basis
•   Random number algorithms are only psuedo-random numbers
     – With the same seed, random number generators will produce the series of numbers
     – Algorithms are designed to give an even distribution of values
•   Random numbers require a ‘seed’
     – Should be selected randomly, as well
     – Different seed values will produce different sequences of random numbers
     – If program security and privacy issues, patient records,is important, you should
       consult the Perldocumentation, and the Math::Random and Math::TrulyRandom
       modules from CPAN

                                                                                                 19
                                                                        Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-2                                                             This is the main program which seeds the random number algorithm and calls the
#!/usr/bin/perl -w
                                                                        subroutine, mutate().
# Example 2-2 Mutate DNA
# using a random number generator to randomly select bases to mutate
                                                                        The call to srand() uses the seed of ‘time|$$’, OR’s the current time with the
use strict;
                                                                        process id, creating a unique seed. This is not a very secure method but it will do
use warnings;
                                                                        for our purposes.
# Declare the variables
# The DNA is chosen to make it easy to see mutations:                   The argument to mutate() is the current DNA string.
my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';

# $i is a common name for a counter variable, short for "integer"
my $i;
my $mutant;

# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);

$mutant = mutate($DNA);
print "\nMutate DNA\n\n";
print "\nHere is the original DNA:\n\n";
print "$DNA\n";
print "\nHere is the mutant DNA:\n\n";
print "$mutant\n";
# Let's put it in a loop and watch that bad boy accumulate mutations:
print "\nHere are 10 more successive mutations:\n\n";
for ($i=0 ; $i < 10 ; ++$i) {
   $mutant = mutate($mutant);
   print "$mutant\n";
}
exit;

                                                                                                                                                     20
                                                                                                                          Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d)                                                     The subroutine mutate() takes the argument from the special array @_ and
########################################################
                                                                         assigns it to the variable $dna.
# Subroutines for Example 2-2
########################################################
                                                                         The array @ nucleotides is intialized with the values which are our nucleotides.
# A subroutine to perform a mutation in a string of DNA
#
# WARNING: make sure you call srand to seed the
                                                                         The subroutine randomposition() takes the current dna string and returns a
# random number generator before you call this function.
                                                                         position within the string.

sub mutate {                                                             The subroutine randomnucleotide() takes the our array of bases and returns a
  my($dna) = @_;                                                         randomly selected value.

    my(@nucleotides) = ('A', 'C', 'G', 'T');                             Finally, the perl module substr() takes the DNA string, the random position, a
                                                                         length of our substitution string, here it is 1, the replacement string and returns
    # Pick a random position in the DNA
                                                                         the new string in the variable $dna.
    my($position) = randomposition($dna);

    # Pick a random nucleotide
    my($newbase) = randomnucleotide(@nucleotides);

    # Insert the random nucleotide into the random position in the DNA
    # The substr arguments mean the following:
    # In the string $dna at position $position change 1 character to
    # the string in $newbase
    substr($dna,$position,1,$newbase);

    return $dna;
}

                                                                                                                                                        21
                                                                                                                            Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d)                                             Randomnucleotide() passes our array of bases to the function randomelement(),
# A subroutine to randomly select an element from an array
                                                                 and in turn, returns the randomly chosen nucleotide.
#
# WARNING: make sure you call srand to seed the
                                                                 In randomelement(), an array is given to the function and returns a randomly
# random number generator before you call this function.
                                                                 selected element from the array. How is this done? Rand() expects a scalar
sub randomelement {
                                                                 value, evaluating the array @array in a scalar context, the size of @array. Perl
                                                                 was designed to take as array subscripts the integer part of a floating-point value.
    my(@array) = @_;                                             Here $array[rand @array] returns the element of the array associated with the
                                                                 subscript randomly chosen from 0 to n-1, where n is the length of the array.
    # Here the code is succinctly represented rather than
    # “return $array[int rand scalar @array];”
    return $array[rand @array];
}

# randomnucleotide
#
# A subroutine to select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.

sub randomnucleotide {

    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array.
    # The elements of the array are numbered 0 to size-1
    return randomelement(@nucleotides);
}

                                                                                                                                                  22
                                                                                                                      Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d)
                                                                     Randomposition() takes an string argument and calculates a random position
# randomposition                                                     withing the string. It is very concise and useful. The return command could have
#                                                                    been written:
# A subroutine to randomly select a position in a string.                      return (int (rand (length $string)));
#                                                                    Certainly, this is more understandable, but I believe there is no loss of clarity as
# WARNING: make sure you call srand to seed the                      in Perl we can write these as a sequence of Perl modules. Chaining single-argument
# random number generator before you call this function.             functions is often done in Perl.

sub randomposition {                                                 Rand() takes the length as an argument and calculates a floating point number
                                                                     between 0 and the length. Int() will round the floating point number down to a
                                                                     range of integers, 0 to length-1.
    my($string) = @_;
                                                                     Results from running example2-2.pl:
    # Notice the "nested" arguments:
    #                                                                Mutate DNA
    # $string is the argument to length
                                                                     Here is the original DNA:
    # length($string) is the argument to rand
    # rand(length($string))) is the argument to int                  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    # int(rand(length($string))) is the argument to return
    #                                                                Here is the mutant DNA:
    # rand returns a decimal number between 0 and its argument.      AAAAAAAAAAAAAAAAAAAAAAAGAAAAAA
    # int returns the integer portion of a decimal number.
    #                                                                Here are 10 more successive mutations:
    # The whole expression returns a random number
                                                                     AAAAAAAAAAAAAAAAAAAAAAAGAAAAAG
    # between 0 and length-1,                                        AAAAAAAAAAAAAAAAAAAACAAGAAAAAG
    # which is how the positions in a string are numbered in Perl.   AAAAAAAAAAAAAAAAAAAACAAGAAAAAG
    #                                                                CAAAAAAAAAAAAAAAAAAACAAGAAAAAG
                                                                     CAAAAAAAAAAAAAAAAAAACAAGATAAAG
                                                                     CAAAAAAAAAAAGAAAAAAACAAGATAAAG
    return int rand length $string;                                  CAAAAAAAAAAAGAACAAAACAAGATAAAG
}                                                                    GAAAAAAAAAAAGAACAAAACAAGATAAAG
                                                                     GAAAAAAAAAAAGAACAAAAGAAGATAAAG
                                                                     GAAAAAAAAAAAGAACAAAAGCAGATAAAG

                                                                                                                                                        23
                                                                                                                           Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules

Example 2-3 : Translating DNA into proteins … using modules
• First transcribe DNA to RNA
• Translate RNA to amino acids
     – Four bases, A, U, C, G
     – Codon defined by sequece of three bases
     – 64 possible combinations, 43.
     – There are only 20 amino acids and a stop
     – Redundancy with codons, more than one codon represents each amino acid
     – Refer to Table 1 on page ??
•   Use subroutine defined in BegPerlBioinfo.pm
     –   Specify module filename in perl code
     –   If not installed in a known library path, need “use lib ‘pathname’” to specify where to find the
         module
•   Module codon2aa() returns a single character amino acid from the 3-character
    codon input
•   Need to write a loop which will grab 3 characters while stepping through the
    RNA sequence

                                                                                                              24
                                                                                     Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-3                                           Example 2-3                                  Example 2-3
#                                                       'CAT' => 'H',     # Histidine               'CCT' => 'P',    # Proline
# codon2aa                                              'CAA' => 'Q',     # Glutamine                'CAC' => 'H',    # Histidine
#                                                       'CAG' => 'Q',     # Glutamine
# A subroutine to translate a DNA 3-character                                                        'GTA' => 'V',    # Valine
                                                        'CGA' => 'R',     # Arginine                 'GTC' => 'V',    # Valine
# codon to an amino acid                                'CGC' => 'R',     # Arginine
# Using hash lookup                                                                                  'GTG' => 'V',    # Valine
                                                        'CGG' => 'R',     # Arginine                 'GTT' => 'V',    # Valine
                                                        'CGT' => 'R',     # Arginine                 'GCA' => 'A',    # Alanine
sub codon2aa {                                          'ATA' => 'I',    # Isoleucine
  my($codon) = @_;                                                                                   'GCC' => 'A',    # Alanine
                                                        'ATC' => 'I',    # Isoleucine                'GCG' => 'A',    # Alanine
                                                        'ATT' => 'I',    # Isoleucine                'GCT' => 'A',    # Alanine
  $codon = uc $codon;                                   'ATG' => 'M',      # Methionine              'GAC' => 'D',    # Aspartic Acid
                                                        'ACA' => 'T',     # Threonine                'GAT' => 'D',    # Aspartic Acid
  my(%genetic_code) = (                                 'ACC' => 'T',     # Threonine                'GAA' => 'E',    # Glutamic Acid
                                                        'ACG' => 'T',     # Threonine                'GAG' => 'E',    # Glutamic Acid
  'TCA' => 'S',   # Serine                              'ACT' => 'T',     # Threonine
  'TCC' => 'S',   # Serine                                                                           'GGA' => 'G',    # Glycine
                                                        'AAC' => 'N',     # Asparagine               'GGC' => 'G',    # Glycine
  'TCG' => 'S',   # Serine                              'AAT' => 'N',     # Asparagine
  'TCT' => 'S',   # Serine                                                                           'GGG' => 'G',    # Glycine
                                                        'AAA' => 'K',      # Lysine                  'GGT' => 'G',    # Glycine
  'TTC' => 'F',   # Phenylalanine                       'AAG' => 'K',      # Lysine
  'TTT' => 'F',   # Phenylalanine                                                                    );
                                                        'AGC' => 'S',     # Serine
  'TTA' => 'L',   # Leucine                             'AGT' => 'S',     # Serine
  'TTG' => 'L',   # Leucine                                                                            if(exists $genetic_code{$codon}) {
                                                        'AGA' => 'R',     # Arginine                      return $genetic_code{$codon};
  'TAC' => 'Y',    # Tyrosine                           'AGG' => 'R',     # Arginine
  'TAT' => 'Y',    # Tyrosine                                                                          }
                                                         'CCC' => 'P',    # Proline                    else{
  'TAA' => '_',   # Stop
                                                         'CCG' => 'P',     # Proline                       print STDERR "Bad codon \"$codon\"!!\n";
  'TAG' => '_',   # Stop
  'TGC' => 'C',    # Cysteine                                                                              exit;
  'TGT' => 'C',   # Cysteine                                                                           }
  'TGA' => '_',   # Stop                                                                           }
  'TGG' => 'W',     # Tryptophan
  'CTA' => 'L',   # Leucine
  'CTC' => 'L',   # Leucine              This subroutine takes, as an argument, a three character DNA sequence and returns the single character
  'CTG' => 'L',   # Leucine               representation of the amino acid. The data type used is a hash lookup. The condition
  'CTT' => 'L',   # Leucine                       ‘if (exists $genetic_code($codon))
  'CCA' => 'P',   # Proline              searches for a match between the 3 characters of the codon and the list of keys in the hash. The associated value
                                         of the key, if found, is returned. Otherwise an error is reported and the program terminates. This module is
                                         included in the module BeginPerlBioinf.pm, which will be used with other subroutines, throughout the rest
                                         of the workshop.

                                                                                                                                                             25
                                                                                                                                    Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules
Example 2-3                                                                     This is the perl code which, with only a few lines, translates DNA into a
#!/usr/bin/perl -w                                                              protein sequence. The command ‘use lib …’ instructs the perl compiler to
# Example 2-3 : Translate DNA into protein                                      append the search path for necessary libraries, like BeginPerlBioinfo.pm.
                                                                                BeginPerlBioinfo.pm is a part of the book Beginning Perl for
use lib ‘../ModLib/’;                                                           Bioinformatics, by James Tysdall.
use strict;
use warnings;                                                                   The ‘for’ loop references the dna string sequence by threes starting at the 0
use BeginPerlBioinfo; # This does not require the ‘.pm’ in the ‘use’ command    Index :
                                                                                         0 3        6 9 ….
# Initialize variables                                                                   CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC
my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC';
                                                                                The 3 character substring is assigned to the $codon variable by the perl
my $protein = '';
                                                                                command ‘substr’. Then $protein, returned by the subroutine codon2aa() is
my $codon;
                                                                                appended to the end of the current protein string.
# Translate each three-base codon into an amino acid, and append to a protein   Results from running example2-3.pl:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
   $codon = substr($dna,$i,3);                                                  I translated the DNA
   $protein .= codon2aa($codon);
}
                                                                                CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC

                                                                                 into the protein
print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";
                                                                                RRLRTGLARVGR
exit;

                                                                                                                                                         26
                                                                                                                              Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN

Example 2-4 : Installing and testing bioperl
• http://bioperl.org
• The Bioperl Project is an international association of developers of open
   source Perl tools for bioinformatics, genomics and life science research.
• The Bioperl server provides an online resource for modules, scripts, and web
   links for developers of Perl-based software for life science research.
• Bioperl modules and documentation are very extensive
• Good examples to illustrate uses
• Will discuss installation of bioperl
• Also take a quick look at some test scripts
• In Chapter 9 of Mastering Perl for Bioinformatics, James Tisdall gives a
   personal account of installing bioperl.
    – Depends on installing using CPAN shell
    – Linux installations vary from site to site, so it is advised that someone with
      administrator privileges install bioperl

                                                                                                   27
                                                                          Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN

Example 2-4 : Installing and testing bioperl
• My own experiences were slightly different
     – Download the core bioperl install file, version 1.4 the most recent
     – Follow the make instructions included in the INSTALL documentation
     – Carefully follow the ‘make test’ instruction
          • Make sure you have an internet connection
     – Note where the test script fails
          • You will see module names like LPW, IO::Strings, etc.
     – I noticed that the LPW and IO::Strings were involved in quite a few failures
•   Here is where I installed the missing modules using the CPAN shell
     – >> perl –MCPAN –e shell
     – At the CPAN prompt, install the missing module
          • cpan > install LPW
     – After exiting the CPAN shell, try ‘make test’ to see if it lessens the failed responses
•   After concluding that the failures won’t impede using bioperl, use the ‘make
    install’
•   This usually puts the modules in /usr/lib/perl5/5.x.x/site_perl, on Linux
    systems

                                                                                                    28
                                                                           Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN
Example bptest0.pl                                             These simple tests measure if bioperl is installed correctly.
#!/usr/bin/perl –w
                                                               Test ‘bptest0.pl’ simply checks if Perl can find Bio::Perl. If it doesn’t complain,
                                                               we are one step closer.
use Bio::Perl;

exit;

######################################################
Example bptest1.pl
#!/usr/bin/perl -w
# Example to Test the Bioperl installation
use Bio::Perl;                                         In the file ‘bptest1.pl’, we need internet access. The perl program retrieves a
# Must use this script with an internet connection     swissprot sequence and prints it to a file, ‘roa1.fasta’, in FASTA format.
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
write_sequence("> roa1.fasta", 'fasta', $seq_object);
exit;

######################################################
Example bptest2.pl
#!/usr/bin/perl –w
# Example to Test the Bioperl installation                     The last perl script uses NCBI to BLAST a sequence and saves the results to a
use Bio::Perl;                                                 file. This should be used judiciously as we don’t want to abuse the computing
# Must use this script with an internet connection             cycles of NCBI. These requests should be done for individual searches.
                                                               Download the blast package locally to do large numbers of BLAST searches.
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
$blast_result = blast_sequence(($seq_object);
write_blast(“>raol1.blast”, $blast_result);
exit;

                                                                                                                                              29
                                                                                                                  Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Bioperl
Exercises for Section 2

1.   Write a subroutine to concatenate two strings of DNA.

2.   Write a subroutine to report the percentage of each nucleotide in DNA. Count the number of each nucleotide,
     divide by the total length of the DNA, then multiply by 100 to get the percentage. Your arguments should be the
     DNA and the nucleotide you want to report on. The int function can be used to discard digits after the decimal
     point, if needed.

3.   Write a module that contains subroutines that report various statistics on DNA sequences, for instance length,
     GC content, presence or absence of poly-T sequences (long stretches of mostly T’s at the 5’ (left) end of many
     $DNA sequences), or other measures of interest.

4.   Write a program that asks you to pick an amino acid and then keeps (randomly) guessing which amino acid you
     picked.

5.   Write a program to mutate protein sequence, similar to the code in Example 2-2 that mutates DNA.

6.   Write a program that uses Bioperl to perform a BLAST search at the NCBI web site, then use Bioperl to parse
     the BLAST output.

                                                                                                                           30
                                                                                                  Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames

•   Many different formats for saving sequence data and annotations in files
•   Perhaps as many as 20 such formats for DNA
•   Some of the most popular
     – FASTA and BLAST, Basic Local Alignment Search Technique, both using the
       FASTA format
     – Genetic Sequence Data Bank (GenBank)
     – European Molecular Biology Laboratory (EMBL)
•   In this section we will focus on reading FASTA format
•   Sample of FASTA format:
                      > sample dna | (This is a typical fasta header.)
                      agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
                      tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
                      gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
                      tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
                      cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
                      cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
                      gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
                      cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
                      ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga

                                                                                                    31
                                                                           Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames

Example 3-1: Reading FASTA format and extract sequence data
• Write three subroutines and rely on regular expressions
• First subroutine will get data from a file
     –   Read filename from command li neargument = filename
     –   open file
           •   if can’t open, print error message and exit
     –   read in data
     –   return array which contains each line of the file, @data
•   Second subroutine extracts sequence data from fasta file
     –   Read in array of file data in fasta format
     –   Discard all header, blank and comment lines
     –   If first character of first line is >, discard it
     –   Read in the rest of the file, joined in a scalar,
     –   edit out non-sequence data, white spaces
     –   return sequence
•   Third subroutine writes the sequence data
     –   More often than not, the sequence to print is longer than most page widths
     –   Need to specify a length parameter to control the output

                                                                                                               32
                                                                                      Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames
Example 3-1                                                 Get_file_data() take a string argument, the filename. The unless condition
# get_file_data                                             attempts to open a file. If unsuccessful, it prints an error statement and exits the
#                                                           program.
# A subroutine to get data from a file given its filename
sub get_file_data {                                         If the file exists, it saves each line of the file, one by one, into the array
                                                            @filedata. Returns the array to the main routine, after closing the file pointer, of
  my($filename) = @_;                                       course.
  use strict;
  use warnings;
  # Initialize variables
  my @filedata = ( );
  unless ( open (GET_FILE_DATA, $filename) ) {
    print STDERR "Cannot open file \"$filename\"\n\n";
    exit;
  }

    @filedata = ;
    close GET_FILE_DATA;
    return @filedata;
}

                                                                                                                                           33
                                                                                                               Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames
Example 3-1                                                                  Extract_sequence_from_fasta_data() takes the array that is the contents of the
# extract_sequence_from_fasta_data                                           fasta file. The foreach loop takes each of the elements of the array, a complete
#                                                                            line of the file, and assigns it to the variable $line. The different conditions help
# A subroutine to extract FASTA sequence data from an array                  us ignore the blank, comment and header lines:
sub extract_sequence_from_fasta_data {                                       •         /^\s*$/ looks for lines that have just white spaces from beginning to end
  my(@fasta_file_data) = @_;                                                 •         /^\s*#/ look for lines which have the pound character, preceded by white
  use strict;                                                                          spaces, as a comment line
  use warnings;                                                              •         /^>/ look for lines which have the ‘greater-than’ symbol at the
                                                                                       beginning of the line, the fasta header line
  # Declare and initialize variables
                                                                             •         all other lines are concatenated together into the $sequence variable
  my $sequence = ’’;
  foreach my $line (@fasta_file_data) {                                      When all is done, all white space characters are removed:
    # discard blank line                                                             $sequence =~ s/\s//g;
    if ($line =~ /^\s*$/) {
       next;                                                                 The sequence is returned to the calling routine.
    # discard comment line
    } elsif($line =~ /^\s*#/) {
      next;
    # discard fasta header line
    } elsif($line =~ /^>/) {
      next;
    # keep line, add to sequence string
    } else {
      $sequence .= $line;
    }
  }

    # remove non-sequence data (in this case,whitespace) from $sequence string
    $sequence =~ s/\s//g;
    return $sequence;
}

                                                                                                                                                           34
                                                                                                                                Using Perl for Bioinformatics
Section 3 : Fasta file format
Example 3-1                                                               Finally, the print_sequence() routine takes the cleaned string and an integer
# print_sequence
                                                                          specifying the number of characters to print, per line. Again notice that the
#
                                                                          variables are assigned from the special array, @_. This is accomplished by the
# A subroutine to format and print sequencedata
                                                                          for for loop and the substr module. The print command takes a substring of the
sub print_sequence {
                                                                          complete string on a new line.
    my($sequence, $length) = @_;
    use strict;
    use warnings;
    # Print sequence in lines of $length                                  Well, now that we have the produced the subroutines needed for our program,
    for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {    these subroutines have been installed in the BeginPerlBioinfo.pm module. Our
     print substr($sequence, $pos, $length), "\n";                        program may be succinctly written as in the code to the left. The final command
  }                                                                       prints the sequence, passing the character string and the length to the
}                                                                         print_sequence subroutine.

                                                                          Output from example3-1
                                                                          agatggcggcgctgaggggtcttgg
Example 3-1                                                               gggctctaggccggccacctactgg
#!/usr/bin/perl                                                           tttgcagcggagacgacgcatgggg
                                                                          cctgcgcaataggagtacgctgcct
# Read a fasta file and extract the sequence data                         gggaggcgtgactagaagcggaagt
use lib ‘../ModLib/’; # Must point to where BeginPerlBioinfo.pm resides   agttgtgggcgcctttgcaaccgcc
use strict;                                                               tgggacgccgccgagtggtctgtgc
                                                                          aggttcgcgggtcgctggcgggggt
use warnings;                                                             Cgtgagggagtgcgccgggagcgga
use BeginPerlBioinfo;
                                                                          …
# Declare and initialize variables
my @file_data = ( );
                                                                          gaagttcgggggccccaacaagatc
my $dna = ’’;
                                                                          cggcagaagtgccggctgcgccagt
# Read in the contents of the file "sample.dna"                           gccagctgcgggcccgggaatcgta
@file_data = get_file_data("sample.dna");                                 caagtacttcccttcctcgctctca
                                                                          ccagtgacgccctcagagtccctgc
# Extract the sequence data from the contents of the file "sample.dna"    caaggccccgccggccactgcccac
$dna = extract_sequence_from_fasta_data(@file_data);                      ccaacagcagccacagccatcacag
                                                                          aagttagggcgcatccgtgaagatg
# Print the sequence in lines 25 characters long                          agggggcagtggcgtcatcaacagt
print_sequence($dna, 25);                                                 caaggagcctcctgaggctacagcc
exit;                                                                     acacctgagccactctcagatgagg
                                                                          accta

                                                                                                                                                    35
                                                                                                                          Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames

Example 3-2: Translate a DNA sequence in all six reading frames
• Given a sequence of DNA, it is necessary to examine all six reading frames of
   the DNA to find the coding regions the cell uses to make proteins
• Genes very often occur in pieces that are spliced together during the
   transcription/translation process
• Since the codons are three bases long, the translation happens in three
   "frames,“ starting at the first base, or the second, or perhaps the third.
• Each starting place gives a different series of codons, and, as a result, a
   different series of amino acids.
• Examine all six reading frames of a DNA sequence and to look at the resulting
   protein translations
• Stop codons are definite breaks in the DNA => protein translation process
• If a stop codon is reached, the translation stops
• We need some code to represent the reverse compliment of the DNA
• Need to break both strings into the representative frames
• Translate each frame of DNA to protein

                                                                                        36
                                                               Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames
Example 3-2                                                               We are going to reuse our old code from Section 1, revcom(). We have to
# revcom                                                                          rewrite it as a subroutine.
#
# A subroutine to compute the reverse complement of DNA sequence          Now we need to design that subroutine which will break the DNA strings
sub revcom {                                                              into our frames and translate the string into proteins. Our old perl command
   my($dna) = @_;
                                                                          substr() should do the trick for taking apart our frames. The unless($end)
   # First reverse the sequence
                                                                          condition checks for a value in the variable $end, if no value then it
   my($revcom) = reverse($dna);                                           calculates the end value as the length of the sequence. The length of the
   # Next, complement the sequence, dealing with upper and lower case               desired sequence doesn’t change with the change in indices, since:
   # A->T, T->A, C->G, G->C                                               (end - 1) - (start - 1) + 1 = end - start + 1
   $revcom =~ tr/ACGTacgt/TGCAtgca/;
   return $revcom;                                                        Translating to peptides we revisite our codon2aa() subroutine, from Section
}                                                                         2. This has been included in a subroutine dna2peptide() which is, already, in
                                                                          BeginPerlBioin.pm.
# translate_frame
#
# A subroutine to translate a frame of DNA
sub translate_frame {
   my($seq, $start, $end) = @_;
   my $protein;
   # To make the subroutine easier to use, you won’t need to specify
   # the end point--it will just go to the end of the sequence
   # by default.
   unless($end) {
     $end = length($seq);
   }
   # Finally, calculate and return the translation
   return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) );
}

                                                                                                                                                 37
                                                                                                                      Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames
Example 3-2
#!/usr/bin/perl                                                          Now that we have done all that work, and it appears that our subroutines will
# Translate a DNA sequence in all six reading frames                     provide us with the functon we need, these routines are provided in
use lib ‘../ModLib’;                                                     BeginPerlBioinf.pm. So, the Perl program is a short exercise and is very
use strict;
use warnings;
                                                                         modular.
use BeginPerlBioinfo;
# Initialize variables                                                   Output from example 3-2
my @file_data = ( );                                                     -------Reading Frame 1--------
my $dna = ’’;
my $revcom = ’’;                                                         RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAE
my $protein = ’’;                                                        WSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKP
# Read in the contents of the file "sample.dna"                          DINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKS
@file_data = get_file_data("sample.dna");                                RERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSP
# Extract the sequence data from the contents of the file "sample.dna"   QPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKI
$dna = extract_sequence_from_fasta_data(@file_data);                     RQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRI
# Translate the DNA to protein in six reading frames                     REDEGAVASSTVKEPPEATATPEPLSDEDL
# and print the protein in lines 70 characters long
print "\n -------Reading Frame 1--------\n\n";                           …
$protein = translate_frame($dna, 1);
print_sequence($protein, 70);                                            -------Reading Frame 5--------
print "\n -------Reading Frame 2--------\n\n";
$protein = translate_frame($dna, 2);                                     RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDS
print_sequence($protein, 70);                                            EGVTGESEEGKYLYDSRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRH
print "\n -------Reading Frame 3--------\n\n";                           ASHSPHMRADRLICCCCCW_CWLGVATKGCGEDLWGEAEPRASMAPTPVPDPARR
$protein = translate_frame($dna, 3);                                     CRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFSLHSRQYHSRM
print_sequence($protein, 70);                                            ALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGG
# Calculate reverse complement                                           SGSEPSPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAY
$revcom = revcom($dna);                                                  SYCAGPMRRLRCKPVGGRPRAPKTPQRRH
print "\n -------Reading Frame 4--------\n\n";
$protein = translate_frame($revcom, 1);                                  -------Reading Frame 6--------
print_sequence($protein, 70);
print "\n -------Reading Frame 5--------\n\n";                           GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTL
$protein = translate_frame($revcom, 2);                                  RASLVRARKGSTCTIPGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADM
print_sequence($protein, 70);                                            PHTHHTCGLTV_SAAAAAGDAGWVWPPRAAERICGAKQSPEQAWPQPLSLTLPGA
print "\n -------Reading Frame 6--------\n\n";                           AGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSLCTPDSTTPGW
$protein = translate_frame($revcom, 3);                                  PWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEA
print_sequence($protein, 70);                                            LGLNHLPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRT
exit;                                                                    PIAQAPCVVSAANQ_VAGLEPPRPLSAAI

                                                                                                                                               38
                                                                                                                     Using Perl for Bioinformatics
Section 3 : FASTA file format
Exercises for Section 3
1.   Add to the Perl program in Example 3-1 a translation from DNA to protein and print out the protein.

2.   Write a subroutine that checks a string and returns true if it’s a DNA sequence. Write another that checks for
     protein sequence data.

3.   Write a program that can search by name for a gene in an unsorted array.

4.   Write a subroutine that inserts an element into a sorted array. Hint: use the splice Perl function to insert the
     element.

5.   Write a subroutine that checks an array of data and returns true if it’s in FASTA format. Note that FASTA
     expects the standard IUB/IUPAC amino acid and nucleic acid codes, plus the dash (-) that represents a gap of
     unknown length. Also, the asterisk (*) represents a stop codon for amino acids. Be careful using an asterisk in
     regular expressions; use a \* to escape it to match an actual asterisk.

                                                                                                                                39
                                                                                                       Using Perl for Bioinformatics
Section 4 : GenBank (Genetic Sequence Data Bank) Files

•   International repository of known genetic sequences from a variety of
    organisms
•   GenBank is a flat file, an ASCII text file, that is easily readable
•   GenBank referred to as a databank or data store
     – Databases have a relational structure
     – includes associated indices
     – links and a query language.
•   Perl modules and constructs are ideal for processing flat files
•   For additional bioinformatics software, reference these web sites
     –   National Center for Biotechnology Information (NCBI)
     –   National Institutes of Health (NIH), http://proxy.lib.ohio-state.edu:2224
     –   European Bioinformatics Institute (EBI), http://www.ebi.ac.uk
     –   European Molecular Biology Laboratory (EMBL), http://www.embl-heidelberg.de/
•   Let’s take a look at a short GenBank file

                                                                                             40
                                                                    Using Perl for Bioinformatics
Section 4 : GenBank Files
Example of a short GenBank file;                                                                        /cell_line="HuS-L12"
LOCUS         AB031069 2487 bp mRNA PRI 27-MAY-2000                                                     /cell_type="lung fibroblast"
DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,                                /dev_stage="embryo"
             complete cds.                                                                      gene 229..2199
ACCESSION AB031069                                                                                      /gene="PCCX1"
VERSION      AB031069.1 GI:8100074                                                              CDS 229..2199
KEYWORDS .                                                                                               /gene="PCCX1"
SOURCE       Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to                          /note="a nuclear protein carrying a PHD finger and a CXXC
            mRNA.                                                                                       domain"
ORGANISM Homo sapiens                                                                                   /codon_start=1
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;                           /product="protein containing CXXC domain 1"
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.                                  /protein_id="BAA96307.1"
REFERENCE 1 (sites)                                                                                     /db_xref="GI:8100075"
AUTHORS     Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and                              /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
            Takano,T.                                                                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
TITLE       PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,           RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
            is regulated by proteolysis                                                   QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
JOURNAL    Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)                         FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
MEDLINE    20261256                                                                       EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
REFERENCE 2 (bases 1 to 2487)                                                             KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
           Takano,T.                                                                      FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
TITLE      Direct Submission                                                              YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.                       PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
           Tadahiro Fujino, Keio University School of Medicine, Department of             AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
           Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan      BASE COUNT 564 a 715 c 768 g 440 t
           (E-mail:fujino@microb.med.keio.ac.jp,                                  ORIGIN
          Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)                     (cont’d on next page)
FEATURES Location/Qualifiers
          source 1..2487
                   /organism="Homo sapiens"
                    /db_xref="taxon:9606"
                    /sex="male"

                                                                                                                                                         41
                                                                                                                                Using Perl for Bioinformatics
Section 4 : GenBank Files
Example of a short GenBank filw (cont’d):
                                                                            For a view of the complete file and it’s format, look at ‘record.gb’ in Section 4
      1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
                                                                            of the exercises.
     61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
    121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt   A typical GenBank entry is packed with information. With perl we will be
    181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat   able to separate the different parts. For instance, by extracting the sequence,
    241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat   we can search for motifs, calculate statistics on the sequence, or compare with
                                                                            other sequences. Also, separating the various parts of the data annotation, we
…                                                                           have access to ID numbers, gene names, genus and species, publications, etc.
                                                                            The FEATURES table part of the annotation includes specific information
                                                                            about the DNA, such as the locations of exons, regulatory regions, important
  2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
                                                                            mutations, and so on. The format specification of GenBank files and a great
  2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt    deal of other information about GenBank can be found in theGenBank release
  2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat    notes, gbrel.txt, on the GenBank web site at
  2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag    ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
  2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
  2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
  2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

                                                                                                                                                      42
                                                                                                                           Using Perl for Bioinformatics
You can also read