Using Perl for Bioinformatics - Science and Technology Support Group High Performance Computing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Using Perl for Bioinformatics
Science and Technology Support Group
High Performance Computing
Ohio Supercomputer Center
1224 Kinnear Road
Columbus, OH 43212-1163Table of Contents
• Section 1 • Section 3
– Concatenate sequences – Read FASTA Files
– Transcribe DNA to RNA – Exercises 3
– Reverse complement of sequences
• Section 4
– Read sequence data from files
– Searching for motifs in DNA or proteins – GenBank Files and Libraries
– Exercises 1 – Exercises 4
• Section 2 • Section 5
– Subroutines – PDB
– Mutations and Randomization – Exercises 5
– Translating DNA into Proteins • Section 6
using Libraries of Subroutines – Blast Files
– BioPerl Modules – Exercises 6
– Exercises 2
2
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-1 : Concatenation of two strings of DNA
• Concatenating two DNA sequences defined by two perl variables.
– Two character sequences assigned to scalar variables.
– The two sequences are used to create a third variable.
– The third variable is the concatenated sequence by use of the ‘.’.
• Use ‘print’ command to print concatenated sequence stdout.
– Example 1-1 uses many different routines to print out the concatenated sequence.
– Use of the newline character, “\n”.
3
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-1
#!/usr/bin/perl -w
# Example 1-1 Concatenating DNA Note the different uses of the assignment to DNA3 achieve the
same result:
# Store two DNA fragments into two variables called $DNA1 and $DNA2 1. $DNA3 = “$DNA1$DNA2”;
$DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 2. $DNA3 = $DNA1.$DNA2;
$DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA';
# Print the DNA onto the screen Results of running example 1-1:
print "Here are the original two DNA fragments:\n\n";
Here are the original two DNA fragments:
print $DNA1, "\n"; ACGGGAGGACGGGAAAATTACTACGGCATTAGC
ATAGTGCCGTGAGAGTGATGTAGTA
print $DNA2, "\n\n"; Here is the concatenation of the first two fragments (version 1):
ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
# Concatenate the DNA fragments into a third variable and print them GTAGTA
# Using "string interpolation" Here is the concatenation of the first two fragments (version 2):
$DNA3 = "$DNA1$DNA2"; ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
GTAGTA
print "Here is the concatenation of the first two fragments (version 1):\n\n"; Here is the concatenation of the first two fragments (version 3):
ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT
print "$DNA3\n\n";
GTAGTA
# An alternative way using the "dot operator":
# Concatenate the DNA fragments into a third variable and print them
$DNA3 = $DNA1 . $DNA2;
print "Here is the concatenation of the first two fragments (version 2):\n\n";
print "$DNA3\n\n";
# Print the same thing without using the variable $DNA3
print "Here is the concatenation of the first two fragments (version 3):\n\n";
print $DNA1, $DNA2, "\n";
exit;
4
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-2 : Transcribing DNA to RNA
• Converting all thymine with uracil in the DNA
– Replace all the ‘T’ characters in the string with ‘U’.
– Use binding operator ‘=~’.
– Regular expression substitution, globally, ‘s/T/U/g’.
5
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-2
#!/usr/bin/perl -w 1. Assign the variable $RNA to the string $DNA.
# Transcribing DNA into RNA 2. $RNA =~ s/T/U/g; is evaluated as substitute all uppercase T’s with uppercase U’s.
# The DNA
$DNA = Results of running example 1-2:
’ACGGGAGGACGGGAAAATTACTACGG
CATTAGC’; Here is the starting DNA:
ACGGGAGGACGGGAAAATTACTACGGCATTAGC
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n"; Here is the result of transcribing the DNA to RNA:
print "$DNA\n\n"; ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
# Transcribe the DNA to RNA by substituting
# all T’s with U’s.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the
DNA to RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;
6
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-3 : Calculating the Reverse Compliment of a DNA strand
• Find the reverse of the DNA string.
• Calculate the compliment of the reversed string.
– Substitute for all bases their compliment.
• A -> T; T -> A; C -> G; G -> C.
– Could use the substitute function of the regular expression
• $var =~ s/A/T/g;
• $var =~ s/T/A/g;
• $var =~ s/C/G/g;
• $var =~ s/G/C/g;
– This would result in error!?
– Fortunately there is an operation with regular expressions called ‘translator’.
7
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-3
#!/usr/bin/perl -w Note that the translator replaces the characters in the first sequence with the
# Calculating the reverse complement of strand of DNA corresponding character in the second sequence. In this example both
# The DNA uppercase and lowercase replacement of the bases are translated.
$DNA =ACGGGAGGACGGGAAAATTACTACGGCATTAGC’;
# Print the DNA onto the screen Results of running example 1-3:
print "Here is the starting DNA:\n\n";
print "$DNA\n\n"; Here is the starting DNA:
ACGGGAGGACGGGAAAATTACTACGGCATTAGC
# Make a new copy of the DNA
$revcom = reverse $DNA; Here is the reverse complement DNA:
# See the text for a discussion of tr/// GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
exit;
8
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-4 : Reading protein sequences from a file
• Use ‘open’.
– Use a character string variable.
– open(FILEPOINTER, $filename);
• Read in the contents.
– Use angle brackets, ‘’.
– Need to create a loop to read in all lines
• Read from a file named in the command line.
– Use angle brackets, ‘’.
– Do not need to create a filepointer.
– Read into an array
– Need to create a loop to read in all lines of the array
9
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-4
The filename is set by assigning the string variable $proteinfilename. The
#!/usr/bin/perl -w
‘while’ loop reads in from the file one line at a time. Each line from the file
$longprotein = ''; is concatenated on the end of the previous string. It is good programming
practice to close the file pointer when done. Note how the output is each line of
# Example 4-5 Reading protein sequence data from a file the file is on a newline.
# Usage: perl example1-4.pl
Results of running example 1-4:
# The filename of the file containing the protein sequence data
$proteinfilename = 'NM_021964fragment.pep'; Here is the protein:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
# First we have to "open" the file, and associate SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
# a "filehandle" with it. We choose the filehandle GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);
# Now we do the actual reading of the protein sequence from the
# file by using the angle brackets < and > to get the input from the
# filehandle. We store the data into our variable $protein.
while ($protein = ) {
$longprotein .= $protein;
}
# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print $longprotein;
exit;
10
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-4
The filename is given as an argument on the command line. This is much more
#!/usr/bin/perl –w
convenient than writing a different perl script for each file we need to open. The
command:
$longprotein = ''; @data_from_file = ;
treats each list on the command line as a file, opens each file, and then reads each
# Example 4-5 Reading protein sequence data from a file line of the file into the array. Creating a filehandle is not needed.
# Usage: perl example1-4b.pl filename
The ‘foreach’ loop then retrieves each element of the array, discards the newline
at the end, then concatenates the string onto the end of the string
# The filename of the file containing the protein sequence data variable $longprotein.
# is in the command line. The '' is shortcut for .
# the treats the @ARGV array as a list of
# filenames, returning the contents Results of running example 1-4b:
# of those files one line at a time. The contents of those files are
# available to the program, using the angle brackets , Here is the protein:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQDSVLQDRSMPHQEILAAD
# without a filehandle. EVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQGLQYALNVPISVK
@data_from_file = ; QEITFTDVSEQLMRDKKQIR
# Using the foreach loop, we access the data from the array,
# one line at a time. Removing the 'newline' from the string,
# concatenate to the string variable, making one long protein
# string.
foreach (@data_from_file) {
chop $_;
$longprotein .= $_;
}
# Print the protein onto the screen
print "Here is the protein:\n";
print $longprotein."\n";
exit;
11
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-5 : Searching for motifs in DNA or proteins
• Prompt the user for filename and protein strings
– Specify a filename to open
– open(FILEPOINTER, $filename);
• Read in the contents.
– Read the lines of the file into an array.
– Concatenate all lines of the array into a scalar variable.
– Remove all newlines and blanks from the scalar variable.
• Compare the motif entered from the terminal to the protein string.
– Use regular expression comparison.
– Exit the program when motif only contains whitespaces.
12
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-5
The filename is given as standard input to the question:
#!/usr/bin/perl -w
$proteinfilename = ;
# Example 5-3 Searching for motifs
The ‘unless’ condition checks for the presence of the file, exiting if not found:
# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the keyboard unless ( open(PROTEINFILE, $proteinfilename) )
print "Please type the filename of the protein sequence data: ";
Each line of the file is then put into an array, @protein, after which the filehandle
is closed:
$proteinfilename = ; @protein = ;
# Remove the newline from the protein filename By using ‘join’ each line in the array is put into one long character string,
chomp $proteinfilename; including newline characters:
# open the file, or exit $protein = join( '', @protein);
unless ( open(PROTEINFILE, $proteinfilename) ) {
print "Cannot open file \"$proteinfilename\"\n\n";
exit; All whitespaces, including newlines, tabs and blanks, are then removed.
} $protein =~ s/\s//g;
# Read the protein sequence data from the file, and store it
# into the array variable @protein
@protein = ;
# Close the file - we've read all the data into @protein now.
close PROTEINFILE;
# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$protein = join( '', @protein);
# Remove whitespace
$protein =~ s/\s//g;
13
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Example 1-5 (cont’d) The loop controls the search for the character string in the entire protein string.
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found. The variable $motif is assigned the character string typed in the shell:
# Exit if no motif is entered. $motif = ;
do {
The newline character is removed from the end of the string:
print "Enter a motif to search for: "; chomp $motif;
$motif = ;
# Remove the newline at the end of $motif The character string $motif is compared to the protein string for a match:
chomp $motif; $protein =~ /$motif/
# Look for the motif When the user types nothing but whitespaces, the program exits:
if ( $protein =~ /$motif/ ) { until ( $motif =~ /^\s*$/ );
print "I found it!\n\n";
} else {
Results from running example1-5.pl:
print "I couldn\'t find it.\n\n"; Please type the filename of the protein sequence data: NM_021964fragment.pep
} Enter a motif to search for: SVLQ
I found it!
Enter a motif to search for: sqlv
# exit on an empty user input I couldn’t find it.
Enter a motif to search for: QDSV
} until ( $motif =~ /^\s*$/ ); I found it!
Enter a motif to search for: HERLPQGLQ
# exit the program I found it!
Enter a motif to search for:
exit; I couldn’t find it.
14
Using Perl for BioinformaticsSection 1 : Sequences and Regular Expressions
Exercises for Section 1
1. Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end
of any statement of one of our working programs and examining the error messages that result, if any. Try
changing other syntactical items: add a parenthesis or a curly brace; misspell some command, like "print" or
some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even
after getting to know the language well, it is still common to have some syntax errors as you gradually add code
to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the
line where the error is?
2. Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write
another that prints the DNA in uppercase (ACGT). Use the function tr///.
3. Do the same thing as Exercise 2, but use the string directives \U and \L for upper- and lowercase. For instance,
print "\U$DNA" prints the data in $DNA in uppercase.
4. Prompt the user to enter two (short) strings of DNA. Concatenate the two strings of DNA by appending the
second to the first using the .= assignment operator. Print the two strings as concatenated, and then print the
second string lined up over its copy at the end of the concatenated strings. For example, if the input strings are
AAAA and TTTT, print: AAAATTTT
TTTT
5. Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions.
Use the substr function, and examine each base one at a time in the original while you build up the reverse
complement. (Hint: you might find it easier to examine the original right to left, rather than left to right,
although either is possible.)
6. Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C
in the DNA.)
7. Modify Example 1-5 to not only find motifs by regular expressions but to print out the motif that was found. For
example, if you search, using regular expressions, for the motif EE.*EE, your program should print
EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set
to hold the pattern that was matched.
8. Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl
functions substr or slice.
15
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-1 : Counting bases in DNA string, using subroutines.
• Subroutines are very efficient
– Write once, use many times.
– Routines which have a pervasive utility may be stored in a library for future use.
• Lexical scoping using ‘my’ declaration
– Important to understand the scope of variables
– Use ‘my’ to declare variables with in the scope of the code
– Variable names may be used in different code segments
– Declare ‘use strict’ to enforce variables to be defined with ‘my’
• Use special array to pass arguments to subroutine
– my($var1, $var2, $var3) = @_;
– This will assign the values of arguments passed to the subroutine to the named
variables
– Mistake of not using the @_
• Variables will not have their passed values
16
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-1 The command ‘use strict’ requires all variables to use the ‘my’ declaration for
#!/usr/bin/perl -w all variables. This will limit the scope of any variable.
# Example 2-1 Counting the number of G's in some DNA on the
# command line Declare a string variable to keep usage line.
use strict;
The ‘unless’ condition will make sure there are arguments on the command line.
The special array, @ARGV, exists only if there are arguments present on the
# Collect the DNA from the arguments on the command line command line.
# when the user calls the program.
# If no arguments are given, print a USAGE statement and exit. Assign the value of the character string in the command line to the variable $dna.
# $0 is a special variable that has the name of the program. Here the first value of the array of argument array, and in this case the only
argument, is represented by the variable $ARGV[0]. Here the individual
my($USAGE) = "$0 DNA\n\n"; elements of an array are references by the syntax $array1[n].
# @ARGV is an array containing allcommand-line arguments.
#
# If it is empty, the test will fail and the print USAGE and exit
# statements will be called.
unless(@ARGV) {
print $USAGE;
exit;
}
# Read in the DNA from the argument on the command line.
my($dna) = $ARGV[0];
17
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-1 (cont’d) The subroutine ‘countG’ takes a character string as an argument and returns a
# Call the subroutine that does the real work, and collect the result. number.
my($num_of_Gs) = countG ( $dna );
The line “my($num_of_Gs) = countG($dna);” passes the dna sequence to the
subroutine ‘countG’ and assingns the returned number to the variable
# Report the result and exit. ‘$num_of_Gs’.
print "\nThe DNA $dna has $num_of_Gs G\'s in it!\n\n";
exit; The variable $dna, now lexically scoped only to the subroutine, is assigned the
value passed.
######################################## The variable count is initialized to the value ‘0’.
# Subroutines for Example 2-1
######################################## The translate of the dna string, $dna =~ tr/Gg//, will effectively remove any
upper or lower case G from the string.
sub countG {
# return a count of the number of G's in the argument $dna The assignment to the variable $count is a count of the list which is the
# initialize arguments and variables successful tranlations, and is returned.
my($dna) = @_;
Results from running example2-1.pl:
my($count) = 0; perl example2-1.pl CGGATTTAGCGCGT
# Use the tr on the regular expression for The DNA CGGATTTAGCGCGT has 5 G's in it!
# counting nucleotides in DNA
$count = ( $dna =~ tr/Gg//);
return $count;
}
18
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-2 : Creating mutant DNA using Perl’s random number generator
• Simulate mutating DNA using random number generator
– Randomly pick a nucleotide in a DNA string
– Randomly pick a basis from the four, A, C, T, G
– Replace the picked nucleotide in the selected position of the DNA string with the
randomly selected basis
• Random number algorithms are only psuedo-random numbers
– With the same seed, random number generators will produce the series of numbers
– Algorithms are designed to give an even distribution of values
• Random numbers require a ‘seed’
– Should be selected randomly, as well
– Different seed values will produce different sequences of random numbers
– If program security and privacy issues, patient records,is important, you should
consult the Perldocumentation, and the Math::Random and Math::TrulyRandom
modules from CPAN
19
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-2 This is the main program which seeds the random number algorithm and calls the
#!/usr/bin/perl -w
subroutine, mutate().
# Example 2-2 Mutate DNA
# using a random number generator to randomly select bases to mutate
The call to srand() uses the seed of ‘time|$$’, OR’s the current time with the
use strict;
process id, creating a unique seed. This is not a very secure method but it will do
use warnings;
for our purposes.
# Declare the variables
# The DNA is chosen to make it easy to see mutations: The argument to mutate() is the current DNA string.
my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';
# $i is a common name for a counter variable, short for "integer"
my $i;
my $mutant;
# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);
$mutant = mutate($DNA);
print "\nMutate DNA\n\n";
print "\nHere is the original DNA:\n\n";
print "$DNA\n";
print "\nHere is the mutant DNA:\n\n";
print "$mutant\n";
# Let's put it in a loop and watch that bad boy accumulate mutations:
print "\nHere are 10 more successive mutations:\n\n";
for ($i=0 ; $i < 10 ; ++$i) {
$mutant = mutate($mutant);
print "$mutant\n";
}
exit;
20
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d) The subroutine mutate() takes the argument from the special array @_ and
########################################################
assigns it to the variable $dna.
# Subroutines for Example 2-2
########################################################
The array @ nucleotides is intialized with the values which are our nucleotides.
# A subroutine to perform a mutation in a string of DNA
#
# WARNING: make sure you call srand to seed the
The subroutine randomposition() takes the current dna string and returns a
# random number generator before you call this function.
position within the string.
sub mutate { The subroutine randomnucleotide() takes the our array of bases and returns a
my($dna) = @_; randomly selected value.
my(@nucleotides) = ('A', 'C', 'G', 'T'); Finally, the perl module substr() takes the DNA string, the random position, a
length of our substitution string, here it is 1, the replacement string and returns
# Pick a random position in the DNA
the new string in the variable $dna.
my($position) = randomposition($dna);
# Pick a random nucleotide
my($newbase) = randomnucleotide(@nucleotides);
# Insert the random nucleotide into the random position in the DNA
# The substr arguments mean the following:
# In the string $dna at position $position change 1 character to
# the string in $newbase
substr($dna,$position,1,$newbase);
return $dna;
}
21
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d) Randomnucleotide() passes our array of bases to the function randomelement(),
# A subroutine to randomly select an element from an array
and in turn, returns the randomly chosen nucleotide.
#
# WARNING: make sure you call srand to seed the
In randomelement(), an array is given to the function and returns a randomly
# random number generator before you call this function.
selected element from the array. How is this done? Rand() expects a scalar
sub randomelement {
value, evaluating the array @array in a scalar context, the size of @array. Perl
was designed to take as array subscripts the integer part of a floating-point value.
my(@array) = @_; Here $array[rand @array] returns the element of the array associated with the
subscript randomly chosen from 0 to n-1, where n is the length of the array.
# Here the code is succinctly represented rather than
# “return $array[int rand scalar @array];”
return $array[rand @array];
}
# randomnucleotide
#
# A subroutine to select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.
# The elements of the array are numbered 0 to size-1
return randomelement(@nucleotides);
}
22
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-2 (cont’d)
Randomposition() takes an string argument and calculates a random position
# randomposition withing the string. It is very concise and useful. The return command could have
# been written:
# A subroutine to randomly select a position in a string. return (int (rand (length $string)));
# Certainly, this is more understandable, but I believe there is no loss of clarity as
# WARNING: make sure you call srand to seed the in Perl we can write these as a sequence of Perl modules. Chaining single-argument
# random number generator before you call this function. functions is often done in Perl.
sub randomposition { Rand() takes the length as an argument and calculates a floating point number
between 0 and the length. Int() will round the floating point number down to a
range of integers, 0 to length-1.
my($string) = @_;
Results from running example2-2.pl:
# Notice the "nested" arguments:
# Mutate DNA
# $string is the argument to length
Here is the original DNA:
# length($string) is the argument to rand
# rand(length($string))) is the argument to int AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
# int(rand(length($string))) is the argument to return
# Here is the mutant DNA:
# rand returns a decimal number between 0 and its argument. AAAAAAAAAAAAAAAAAAAAAAAGAAAAAA
# int returns the integer portion of a decimal number.
# Here are 10 more successive mutations:
# The whole expression returns a random number
AAAAAAAAAAAAAAAAAAAAAAAGAAAAAG
# between 0 and length-1, AAAAAAAAAAAAAAAAAAAACAAGAAAAAG
# which is how the positions in a string are numbered in Perl. AAAAAAAAAAAAAAAAAAAACAAGAAAAAG
# CAAAAAAAAAAAAAAAAAAACAAGAAAAAG
CAAAAAAAAAAAAAAAAAAACAAGATAAAG
CAAAAAAAAAAAGAAAAAAACAAGATAAAG
return int rand length $string; CAAAAAAAAAAAGAACAAAACAAGATAAAG
} GAAAAAAAAAAAGAACAAAACAAGATAAAG
GAAAAAAAAAAAGAACAAAAGAAGATAAAG
GAAAAAAAAAAAGAACAAAAGCAGATAAAG
23
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-3 : Translating DNA into proteins … using modules
• First transcribe DNA to RNA
• Translate RNA to amino acids
– Four bases, A, U, C, G
– Codon defined by sequece of three bases
– 64 possible combinations, 43.
– There are only 20 amino acids and a stop
– Redundancy with codons, more than one codon represents each amino acid
– Refer to Table 1 on page ??
• Use subroutine defined in BegPerlBioinfo.pm
– Specify module filename in perl code
– If not installed in a known library path, need “use lib ‘pathname’” to specify where to find the
module
• Module codon2aa() returns a single character amino acid from the 3-character
codon input
• Need to write a loop which will grab 3 characters while stepping through the
RNA sequence
24
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-3 Example 2-3 Example 2-3
# 'CAT' => 'H', # Histidine 'CCT' => 'P', # Proline
# codon2aa 'CAA' => 'Q', # Glutamine 'CAC' => 'H', # Histidine
# 'CAG' => 'Q', # Glutamine
# A subroutine to translate a DNA 3-character 'GTA' => 'V', # Valine
'CGA' => 'R', # Arginine 'GTC' => 'V', # Valine
# codon to an amino acid 'CGC' => 'R', # Arginine
# Using hash lookup 'GTG' => 'V', # Valine
'CGG' => 'R', # Arginine 'GTT' => 'V', # Valine
'CGT' => 'R', # Arginine 'GCA' => 'A', # Alanine
sub codon2aa { 'ATA' => 'I', # Isoleucine
my($codon) = @_; 'GCC' => 'A', # Alanine
'ATC' => 'I', # Isoleucine 'GCG' => 'A', # Alanine
'ATT' => 'I', # Isoleucine 'GCT' => 'A', # Alanine
$codon = uc $codon; 'ATG' => 'M', # Methionine 'GAC' => 'D', # Aspartic Acid
'ACA' => 'T', # Threonine 'GAT' => 'D', # Aspartic Acid
my(%genetic_code) = ( 'ACC' => 'T', # Threonine 'GAA' => 'E', # Glutamic Acid
'ACG' => 'T', # Threonine 'GAG' => 'E', # Glutamic Acid
'TCA' => 'S', # Serine 'ACT' => 'T', # Threonine
'TCC' => 'S', # Serine 'GGA' => 'G', # Glycine
'AAC' => 'N', # Asparagine 'GGC' => 'G', # Glycine
'TCG' => 'S', # Serine 'AAT' => 'N', # Asparagine
'TCT' => 'S', # Serine 'GGG' => 'G', # Glycine
'AAA' => 'K', # Lysine 'GGT' => 'G', # Glycine
'TTC' => 'F', # Phenylalanine 'AAG' => 'K', # Lysine
'TTT' => 'F', # Phenylalanine );
'AGC' => 'S', # Serine
'TTA' => 'L', # Leucine 'AGT' => 'S', # Serine
'TTG' => 'L', # Leucine if(exists $genetic_code{$codon}) {
'AGA' => 'R', # Arginine return $genetic_code{$codon};
'TAC' => 'Y', # Tyrosine 'AGG' => 'R', # Arginine
'TAT' => 'Y', # Tyrosine }
'CCC' => 'P', # Proline else{
'TAA' => '_', # Stop
'CCG' => 'P', # Proline print STDERR "Bad codon \"$codon\"!!\n";
'TAG' => '_', # Stop
'TGC' => 'C', # Cysteine exit;
'TGT' => 'C', # Cysteine }
'TGA' => '_', # Stop }
'TGG' => 'W', # Tryptophan
'CTA' => 'L', # Leucine
'CTC' => 'L', # Leucine This subroutine takes, as an argument, a three character DNA sequence and returns the single character
'CTG' => 'L', # Leucine representation of the amino acid. The data type used is a hash lookup. The condition
'CTT' => 'L', # Leucine ‘if (exists $genetic_code($codon))
'CCA' => 'P', # Proline searches for a match between the 3 characters of the codon and the list of keys in the hash. The associated value
of the key, if found, is returned. Otherwise an error is reported and the program terminates. This module is
included in the module BeginPerlBioinf.pm, which will be used with other subroutines, throughout the rest
of the workshop.
25
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Modules
Example 2-3 This is the perl code which, with only a few lines, translates DNA into a
#!/usr/bin/perl -w protein sequence. The command ‘use lib …’ instructs the perl compiler to
# Example 2-3 : Translate DNA into protein append the search path for necessary libraries, like BeginPerlBioinfo.pm.
BeginPerlBioinfo.pm is a part of the book Beginning Perl for
use lib ‘../ModLib/’; Bioinformatics, by James Tysdall.
use strict;
use warnings; The ‘for’ loop references the dna string sequence by threes starting at the 0
use BeginPerlBioinfo; # This does not require the ‘.pm’ in the ‘use’ command Index :
0 3 6 9 ….
# Initialize variables CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC
my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC';
The 3 character substring is assigned to the $codon variable by the perl
my $protein = '';
command ‘substr’. Then $protein, returned by the subroutine codon2aa() is
my $codon;
appended to the end of the current protein string.
# Translate each three-base codon into an amino acid, and append to a protein Results from running example2-3.pl:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$codon = substr($dna,$i,3); I translated the DNA
$protein .= codon2aa($codon);
}
CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC
into the protein
print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";
RRLRTGLARVGR
exit;
26
Using Perl for BioinformaticsSection 2 : BioPerl and CPAN
Example 2-4 : Installing and testing bioperl
• http://bioperl.org
• The Bioperl Project is an international association of developers of open
source Perl tools for bioinformatics, genomics and life science research.
• The Bioperl server provides an online resource for modules, scripts, and web
links for developers of Perl-based software for life science research.
• Bioperl modules and documentation are very extensive
• Good examples to illustrate uses
• Will discuss installation of bioperl
• Also take a quick look at some test scripts
• In Chapter 9 of Mastering Perl for Bioinformatics, James Tisdall gives a
personal account of installing bioperl.
– Depends on installing using CPAN shell
– Linux installations vary from site to site, so it is advised that someone with
administrator privileges install bioperl
27
Using Perl for BioinformaticsSection 2 : BioPerl and CPAN
Example 2-4 : Installing and testing bioperl
• My own experiences were slightly different
– Download the core bioperl install file, version 1.4 the most recent
– Follow the make instructions included in the INSTALL documentation
– Carefully follow the ‘make test’ instruction
• Make sure you have an internet connection
– Note where the test script fails
• You will see module names like LPW, IO::Strings, etc.
– I noticed that the LPW and IO::Strings were involved in quite a few failures
• Here is where I installed the missing modules using the CPAN shell
– >> perl –MCPAN –e shell
– At the CPAN prompt, install the missing module
• cpan > install LPW
– After exiting the CPAN shell, try ‘make test’ to see if it lessens the failed responses
• After concluding that the failures won’t impede using bioperl, use the ‘make
install’
• This usually puts the modules in /usr/lib/perl5/5.x.x/site_perl, on Linux
systems
28
Using Perl for BioinformaticsSection 2 : BioPerl and CPAN
Example bptest0.pl These simple tests measure if bioperl is installed correctly.
#!/usr/bin/perl –w
Test ‘bptest0.pl’ simply checks if Perl can find Bio::Perl. If it doesn’t complain,
we are one step closer.
use Bio::Perl;
exit;
######################################################
Example bptest1.pl
#!/usr/bin/perl -w
# Example to Test the Bioperl installation
use Bio::Perl; In the file ‘bptest1.pl’, we need internet access. The perl program retrieves a
# Must use this script with an internet connection swissprot sequence and prints it to a file, ‘roa1.fasta’, in FASTA format.
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
write_sequence("> roa1.fasta", 'fasta', $seq_object);
exit;
######################################################
Example bptest2.pl
#!/usr/bin/perl –w
# Example to Test the Bioperl installation The last perl script uses NCBI to BLAST a sequence and saves the results to a
use Bio::Perl; file. This should be used judiciously as we don’t want to abuse the computing
# Must use this script with an internet connection cycles of NCBI. These requests should be done for individual searches.
Download the blast package locally to do large numbers of BLAST searches.
$seq_object = get_sequence('swissprot',"ROA1_HUMAN");
$blast_result = blast_sequence(($seq_object);
write_blast(“>raol1.blast”, $blast_result);
exit;
29
Using Perl for BioinformaticsSection 2 : Mutations, Randomization and Bioperl
Exercises for Section 2
1. Write a subroutine to concatenate two strings of DNA.
2. Write a subroutine to report the percentage of each nucleotide in DNA. Count the number of each nucleotide,
divide by the total length of the DNA, then multiply by 100 to get the percentage. Your arguments should be the
DNA and the nucleotide you want to report on. The int function can be used to discard digits after the decimal
point, if needed.
3. Write a module that contains subroutines that report various statistics on DNA sequences, for instance length,
GC content, presence or absence of poly-T sequences (long stretches of mostly T’s at the 5’ (left) end of many
$DNA sequences), or other measures of interest.
4. Write a program that asks you to pick an amino acid and then keeps (randomly) guessing which amino acid you
picked.
5. Write a program to mutate protein sequence, similar to the code in Example 2-2 that mutates DNA.
6. Write a program that uses Bioperl to perform a BLAST search at the NCBI web site, then use Bioperl to parse
the BLAST output.
30
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
• Many different formats for saving sequence data and annotations in files
• Perhaps as many as 20 such formats for DNA
• Some of the most popular
– FASTA and BLAST, Basic Local Alignment Search Technique, both using the
FASTA format
– Genetic Sequence Data Bank (GenBank)
– European Molecular Biology Laboratory (EMBL)
• In this section we will focus on reading FASTA format
• Sample of FASTA format:
> sample dna | (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
31
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-1: Reading FASTA format and extract sequence data
• Write three subroutines and rely on regular expressions
• First subroutine will get data from a file
– Read filename from command li neargument = filename
– open file
• if can’t open, print error message and exit
– read in data
– return array which contains each line of the file, @data
• Second subroutine extracts sequence data from fasta file
– Read in array of file data in fasta format
– Discard all header, blank and comment lines
– If first character of first line is >, discard it
– Read in the rest of the file, joined in a scalar,
– edit out non-sequence data, white spaces
– return sequence
• Third subroutine writes the sequence data
– More often than not, the sequence to print is longer than most page widths
– Need to specify a length parameter to control the output
32
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-1 Get_file_data() take a string argument, the filename. The unless condition
# get_file_data attempts to open a file. If unsuccessful, it prints an error statement and exits the
# program.
# A subroutine to get data from a file given its filename
sub get_file_data { If the file exists, it saves each line of the file, one by one, into the array
@filedata. Returns the array to the main routine, after closing the file pointer, of
my($filename) = @_; course.
use strict;
use warnings;
# Initialize variables
my @filedata = ( );
unless ( open (GET_FILE_DATA, $filename) ) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
@filedata = ;
close GET_FILE_DATA;
return @filedata;
}
33
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-1 Extract_sequence_from_fasta_data() takes the array that is the contents of the
# extract_sequence_from_fasta_data fasta file. The foreach loop takes each of the elements of the array, a complete
# line of the file, and assigns it to the variable $line. The different conditions help
# A subroutine to extract FASTA sequence data from an array us ignore the blank, comment and header lines:
sub extract_sequence_from_fasta_data { • /^\s*$/ looks for lines that have just white spaces from beginning to end
my(@fasta_file_data) = @_; • /^\s*#/ look for lines which have the pound character, preceded by white
use strict; spaces, as a comment line
use warnings; • /^>/ look for lines which have the ‘greater-than’ symbol at the
beginning of the line, the fasta header line
# Declare and initialize variables
• all other lines are concatenated together into the $sequence variable
my $sequence = ’’;
foreach my $line (@fasta_file_data) { When all is done, all white space characters are removed:
# discard blank line $sequence =~ s/\s//g;
if ($line =~ /^\s*$/) {
next; The sequence is returned to the calling routine.
# discard comment line
} elsif($line =~ /^\s*#/) {
next;
# discard fasta header line
} elsif($line =~ /^>/) {
next;
# keep line, add to sequence string
} else {
$sequence .= $line;
}
}
# remove non-sequence data (in this case,whitespace) from $sequence string
$sequence =~ s/\s//g;
return $sequence;
}
34
Using Perl for BioinformaticsSection 3 : Fasta file format
Example 3-1 Finally, the print_sequence() routine takes the cleaned string and an integer
# print_sequence
specifying the number of characters to print, per line. Again notice that the
#
variables are assigned from the special array, @_. This is accomplished by the
# A subroutine to format and print sequencedata
for for loop and the substr module. The print command takes a substring of the
sub print_sequence {
complete string on a new line.
my($sequence, $length) = @_;
use strict;
use warnings;
# Print sequence in lines of $length Well, now that we have the produced the subroutines needed for our program,
for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { these subroutines have been installed in the BeginPerlBioinfo.pm module. Our
print substr($sequence, $pos, $length), "\n"; program may be succinctly written as in the code to the left. The final command
} prints the sequence, passing the character string and the length to the
} print_sequence subroutine.
Output from example3-1
agatggcggcgctgaggggtcttgg
Example 3-1 gggctctaggccggccacctactgg
#!/usr/bin/perl tttgcagcggagacgacgcatgggg
cctgcgcaataggagtacgctgcct
# Read a fasta file and extract the sequence data gggaggcgtgactagaagcggaagt
use lib ‘../ModLib/’; # Must point to where BeginPerlBioinfo.pm resides agttgtgggcgcctttgcaaccgcc
use strict; tgggacgccgccgagtggtctgtgc
aggttcgcgggtcgctggcgggggt
use warnings; Cgtgagggagtgcgccgggagcgga
use BeginPerlBioinfo;
…
# Declare and initialize variables
my @file_data = ( );
gaagttcgggggccccaacaagatc
my $dna = ’’;
cggcagaagtgccggctgcgccagt
# Read in the contents of the file "sample.dna" gccagctgcgggcccgggaatcgta
@file_data = get_file_data("sample.dna"); caagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgc
# Extract the sequence data from the contents of the file "sample.dna" caaggccccgccggccactgcccac
$dna = extract_sequence_from_fasta_data(@file_data); ccaacagcagccacagccatcacag
aagttagggcgcatccgtgaagatg
# Print the sequence in lines 25 characters long agggggcagtggcgtcatcaacagt
print_sequence($dna, 25); caaggagcctcctgaggctacagcc
exit; acacctgagccactctcagatgagg
accta
35
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-2: Translate a DNA sequence in all six reading frames
• Given a sequence of DNA, it is necessary to examine all six reading frames of
the DNA to find the coding regions the cell uses to make proteins
• Genes very often occur in pieces that are spliced together during the
transcription/translation process
• Since the codons are three bases long, the translation happens in three
"frames,“ starting at the first base, or the second, or perhaps the third.
• Each starting place gives a different series of codons, and, as a result, a
different series of amino acids.
• Examine all six reading frames of a DNA sequence and to look at the resulting
protein translations
• Stop codons are definite breaks in the DNA => protein translation process
• If a stop codon is reached, the translation stops
• We need some code to represent the reverse compliment of the DNA
• Need to break both strings into the representative frames
• Translate each frame of DNA to protein
36
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-2 We are going to reuse our old code from Section 1, revcom(). We have to
# revcom rewrite it as a subroutine.
#
# A subroutine to compute the reverse complement of DNA sequence Now we need to design that subroutine which will break the DNA strings
sub revcom { into our frames and translate the string into proteins. Our old perl command
my($dna) = @_;
substr() should do the trick for taking apart our frames. The unless($end)
# First reverse the sequence
condition checks for a value in the variable $end, if no value then it
my($revcom) = reverse($dna); calculates the end value as the length of the sequence. The length of the
# Next, complement the sequence, dealing with upper and lower case desired sequence doesn’t change with the change in indices, since:
# A->T, T->A, C->G, G->C (end - 1) - (start - 1) + 1 = end - start + 1
$revcom =~ tr/ACGTacgt/TGCAtgca/;
return $revcom; Translating to peptides we revisite our codon2aa() subroutine, from Section
} 2. This has been included in a subroutine dna2peptide() which is, already, in
BeginPerlBioin.pm.
# translate_frame
#
# A subroutine to translate a frame of DNA
sub translate_frame {
my($seq, $start, $end) = @_;
my $protein;
# To make the subroutine easier to use, you won’t need to specify
# the end point--it will just go to the end of the sequence
# by default.
unless($end) {
$end = length($seq);
}
# Finally, calculate and return the translation
return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) );
}
37
Using Perl for BioinformaticsSection 3 : Fasta Files and Frames
Example 3-2
#!/usr/bin/perl Now that we have done all that work, and it appears that our subroutines will
# Translate a DNA sequence in all six reading frames provide us with the functon we need, these routines are provided in
use lib ‘../ModLib’; BeginPerlBioinf.pm. So, the Perl program is a short exercise and is very
use strict;
use warnings;
modular.
use BeginPerlBioinfo;
# Initialize variables Output from example 3-2
my @file_data = ( ); -------Reading Frame 1--------
my $dna = ’’;
my $revcom = ’’; RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAE
my $protein = ’’; WSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKP
# Read in the contents of the file "sample.dna" DINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKS
@file_data = get_file_data("sample.dna"); RERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSP
# Extract the sequence data from the contents of the file "sample.dna" QPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKI
$dna = extract_sequence_from_fasta_data(@file_data); RQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRI
# Translate the DNA to protein in six reading frames REDEGAVASSTVKEPPEATATPEPLSDEDL
# and print the protein in lines 70 characters long
print "\n -------Reading Frame 1--------\n\n"; …
$protein = translate_frame($dna, 1);
print_sequence($protein, 70); -------Reading Frame 5--------
print "\n -------Reading Frame 2--------\n\n";
$protein = translate_frame($dna, 2); RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDS
print_sequence($protein, 70); EGVTGESEEGKYLYDSRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRH
print "\n -------Reading Frame 3--------\n\n"; ASHSPHMRADRLICCCCCW_CWLGVATKGCGEDLWGEAEPRASMAPTPVPDPARR
$protein = translate_frame($dna, 3); CRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFSLHSRQYHSRM
print_sequence($protein, 70); ALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGG
# Calculate reverse complement SGSEPSPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAY
$revcom = revcom($dna); SYCAGPMRRLRCKPVGGRPRAPKTPQRRH
print "\n -------Reading Frame 4--------\n\n";
$protein = translate_frame($revcom, 1); -------Reading Frame 6--------
print_sequence($protein, 70);
print "\n -------Reading Frame 5--------\n\n"; GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTL
$protein = translate_frame($revcom, 2); RASLVRARKGSTCTIPGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADM
print_sequence($protein, 70); PHTHHTCGLTV_SAAAAAGDAGWVWPPRAAERICGAKQSPEQAWPQPLSLTLPGA
print "\n -------Reading Frame 6--------\n\n"; AGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSLCTPDSTTPGW
$protein = translate_frame($revcom, 3); PWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEA
print_sequence($protein, 70); LGLNHLPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRT
exit; PIAQAPCVVSAANQ_VAGLEPPRPLSAAI
38
Using Perl for BioinformaticsSection 3 : FASTA file format
Exercises for Section 3
1. Add to the Perl program in Example 3-1 a translation from DNA to protein and print out the protein.
2. Write a subroutine that checks a string and returns true if it’s a DNA sequence. Write another that checks for
protein sequence data.
3. Write a program that can search by name for a gene in an unsorted array.
4. Write a subroutine that inserts an element into a sorted array. Hint: use the splice Perl function to insert the
element.
5. Write a subroutine that checks an array of data and returns true if it’s in FASTA format. Note that FASTA
expects the standard IUB/IUPAC amino acid and nucleic acid codes, plus the dash (-) that represents a gap of
unknown length. Also, the asterisk (*) represents a stop codon for amino acids. Be careful using an asterisk in
regular expressions; use a \* to escape it to match an actual asterisk.
39
Using Perl for BioinformaticsSection 4 : GenBank (Genetic Sequence Data Bank) Files
• International repository of known genetic sequences from a variety of
organisms
• GenBank is a flat file, an ASCII text file, that is easily readable
• GenBank referred to as a databank or data store
– Databases have a relational structure
– includes associated indices
– links and a query language.
• Perl modules and constructs are ideal for processing flat files
• For additional bioinformatics software, reference these web sites
– National Center for Biotechnology Information (NCBI)
– National Institutes of Health (NIH), http://proxy.lib.ohio-state.edu:2224
– European Bioinformatics Institute (EBI), http://www.ebi.ac.uk
– European Molecular Biology Laboratory (EMBL), http://www.embl-heidelberg.de/
• Let’s take a look at a short GenBank file
40
Using Perl for BioinformaticsSection 4 : GenBank Files
Example of a short GenBank file; /cell_line="HuS-L12"
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 /cell_type="lung fibroblast"
DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, /dev_stage="embryo"
complete cds. gene 229..2199
ACCESSION AB031069 /gene="PCCX1"
VERSION AB031069.1 GI:8100074 CDS 229..2199
KEYWORDS . /gene="PCCX1"
SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to /note="a nuclear protein carrying a PHD finger and a CXXC
mRNA. domain"
ORGANISM Homo sapiens /codon_start=1
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; /product="protein containing CXXC domain 1"
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. /protein_id="BAA96307.1"
REFERENCE 1 (sites) /db_xref="GI:8100075"
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
Takano,T. NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
is regulated by proteolysis QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
MEDLINE 20261256 EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
REFERENCE 2 (bases 1 to 2487) KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
Takano,T. FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
TITLE Direct Submission YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
Tadahiro Fujino, Keio University School of Medicine, Department of AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan BASE COUNT 564 a 715 c 768 g 440 t
(E-mail:fujino@microb.med.keio.ac.jp, ORIGIN
Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) (cont’d on next page)
FEATURES Location/Qualifiers
source 1..2487
/organism="Homo sapiens"
/db_xref="taxon:9606"
/sex="male"
41
Using Perl for BioinformaticsSection 4 : GenBank Files
Example of a short GenBank filw (cont’d):
For a view of the complete file and it’s format, look at ‘record.gb’ in Section 4
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
of the exercises.
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt A typical GenBank entry is packed with information. With perl we will be
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat able to separate the different parts. For instance, by extracting the sequence,
241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat we can search for motifs, calculate statistics on the sequence, or compare with
other sequences. Also, separating the various parts of the data annotation, we
… have access to ID numbers, gene names, genus and species, publications, etc.
The FEATURES table part of the annotation includes specific information
about the DNA, such as the locations of exons, regulatory regions, important
2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
mutations, and so on. The format specification of GenBank files and a great
2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt deal of other information about GenBank can be found in theGenBank release
2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat notes, gbrel.txt, on the GenBank web site at
2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//
42
Using Perl for BioinformaticsYou can also read