3.10 PROTBASE: Searching Through A Protein Database

Inside every large problem is a small problem struggling to get out.15

With apologies to C.A.R. Hoare

Yahoo’s database of stock market data is just one among the many large databases on the Internet. Another one is located at NCBI (National Center for Biotechnology Information). Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. In this section, we look at one of NCBI’s public services, which is called BLAST (Basic Local Alignment Search Tool).

You probably know that the information necessary for reproducing living cells is encoded in the genetic material of the cells. The genetic material is a very long chain of four base nucleotides. It is the order of appearance (the sequence) of nucleotides which contains the information about the substance to be produced. Scientists in biotechnology often find a specific fragment, determine the nucleotide sequence, and need to know where the sequence at hand comes from.

This is where the large databases enter the game. At NCBI, databases store the knowledge about which sequences have ever been found and where they have been found. When the scientist sends his sequence to the BLAST service, the server looks for regions of genetic material in its database which look the most similar to the delivered nucleotide sequence. After a search time of some seconds or minutes the server sends an answer to the scientist. In order to make access simple, NCBI chose to offer their database service through popular Internet protocols. There are four basic ways to use the so-called BLAST services:

Our starting point is the demonstration client mentioned in the first option. The README file that comes along with the client explains the whole process in a nutshell. In the rest of this section, we first show what such requests look like. Then we show how to use gawk to implement a client in about 10 lines of code. Finally, we show how to interpret the result returned from the service.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, ‘U’ and ‘*’ are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., ‘N’ for unknown nucleic acid residue or ‘X’ for unknown amino acid residue). The nucleic acid codes supported are:

A --> adenosine               M --> A C (amino)
C --> cytidine                S --> G C (strong)
G --> guanine                 W --> A T (weak)
T --> thymidine               B --> G T C
U --> uridine                 D --> G A T
R --> G A (purine)            H --> A C T
Y --> T C (pyrimidine)        V --> G C A
K --> G T (keto)              N --> A G C T (any)
                              -  gap of indeterminate length

Now you know the alphabet of nucleotide sequences. The last two lines of the following example query show such a sequence, which is obviously made up only of elements of the alphabet just described. Store this example query into a file named protbase.request. You are now ready to send it to the server with the demonstration client.

PROGRAM blastn
>GAWK310 the gawking gene GNU AWK

The actual search request begins with the mandatory parameter ‘PROGRAM’ in the first column followed by the value ‘blastn’ (the name of the program) for searching nucleic acids. The next line contains the mandatory search parameter ‘DATALIB’ with the value ‘month’ for the newest nucleic acid sequences. The third line contains an optional ‘EXPECT’ parameter and the value desired for it. The fourth line contains the mandatory ‘BEGIN’ directive, followed by the query sequence in FASTA/Pearson format. Each line of information must be less than 80 characters in length.

The “month” database contains all new or revised sequences released in the last 30 days and is useful for searching against new sequences. There are five different blast programs, blastn being the one that compares a nucleotide query sequence against a nucleotide sequence database.

The last server directive that must appear in every request is the ‘BEGIN’ directive. The query sequence should immediately follow the ‘BEGIN’ directive and must appear in FASTA/Pearson format. A sequence in FASTA/Pearson format begins with a single-line description. The description line, which is required, is distinguished from the lines of sequence data that follow it by having a greater-than (‘>’) symbol in the first column. For the purposes of the BLAST server, the text of the description is arbitrary.

If you prefer to use a client written in gawk, just store the following 10 lines of code into a file named protbase.awk and use this client instead. Invoke it with ‘gawk -f protbase.awk protbase.request’. Then wait a minute and watch the result coming in. In order to replicate the demonstration client’s behavior as closely as possible, this client does not use a proxy server. We could also have extended the client program in Retrieving Web Pages, to implement the client request from protbase.awk as a special case.

{ request = request "\n" $0 }

  BLASTService     = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80"
  printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService
  printf "Content-Length: " length(request) "\n\n"         |& BLASTService
  printf request                                           |& BLASTService
  while ((BLASTService |& getline) > 0)
      print $0

The demonstration client from NCBI is 214 lines long (written in C) and it is not immediately obvious what it does. Our client is so short that it is obvious what it does. First it loops over all lines of the query and stores the whole query into a variable. Then the script establishes an Internet connection to the NCBI server and transmits the query by framing it with a proper HTTP request. Finally it receives and prints the complete result coming from the server.

Now, let us look at the result. It begins with an HTTP header, which you can ignore. Then there are some comments about the query having been filtered to avoid spuriously high scores. After this, there is a reference to the paper that describes the software being used for searching the data base. After a repetition of the original query’s description we find the list of significant alignments:

Sequences producing significant alignments:                        (bits)  Value

gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733...    38  0.20
gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115...    38  0.20
emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57...    38  0.20
emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35...    38  0.20
emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H...    38  0.20
emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276...    38  0.20
gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169...    38  0.20

This means that the query sequence was found in seven human chromosomes. But the value 0.20 (20%) means that the probability of an accidental match is rather high (20%) in all cases and should be taken into account. You may wonder what the first column means. It is a key to the specific database in which this occurrence was found. The unique sequence identifiers reported in the search results can be used as sequence retrieval keys via the NCBI server. The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.

EMBL Data Libraryemb|accession|locus
DDBJ, DNA Database of Japandbj|accession|locus
NBRF PIRpir||entry
Protein Research Foundationprf||name
SWISS-PROTsp|accession|entry name
Brookhaven Protein Data Bankpdb|entry|chain
Kabat’s Sequences of Immuno…gnl|kabat|identifier
GenInfo Backbone Idbbs|number

For example, an identifier might be ‘gb|AC021182.14|AC021182’, where the ‘gb’ tag indicates that the identifier refers to a GenBank sequence, ‘AC021182.14’ is its GenBank ACCESSION, and ‘AC021182’ is the GenBank LOCUS. The identifier contains no spaces, so that a space indicates the end of the identifier.

Let us continue in the result listing. Each of the seven alignments mentioned above is subsequently described in detail. We will have a closer look at the first of them.

>gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4
             unordered pieces
          Length = 176383

 Score = 38.2 bits (19), Expect = 0.20
 Identities = 19/19 (100%)
 Strand = Plus / Plus

Query: 35    tggtgaagtgtgtttcttg 53
Sbjct: 69786 tggtgaagtgtgtttcttg 69804

This alignment was located on the human chromosome 7. The fragment on which part of the query was found had a total length of 176383. Only 19 of the nucleotides matched and the matching sequence ran from character 35 to 53 in the query sequence and from 69786 to 69804 in the fragment on chromosome 7. If you are still reading at this point, you are probably interested in finding out more about Computational Biology and you might appreciate the following hints.

  1. There is a book called Introduction to Computational Biology by Michael S. Waterman, which is worth reading if you are seriously interested. You can find a good book review on the Internet.
  2. While Waterman’s book explains the algorithms employed internally in the database search engines, most practitioners prefer to approach the subject differently. The applied side of Computational Biology is called Bioinformatics, and emphasizes the tools available for day-to-day work as well as how to actually use them. One of the very few affordable books on Bioinformatics is Developing Bioinformatics Computer Skills.
  3. The sequences gawk and gnuawk are in widespread use in the genetic material of virtually every earthly living being. Let us take this as a clear indication that the divine creator has intended gawk to prevail over other scripting languages such as ‘perl’, ‘tcl’, or ‘python’ which are not even proper sequences. (:-)



What C.A.R. Hoare actually said was “Inside every large program is a small program struggling to get out.”