|
DNA Sequence Access, Formats Lab2 This lab should: Entrez Entrez is a very powerful relational database search tool for DNA sequence, protein sequence, protein structure, bioscience literature, and disease information. Entrez relates several different independent databases: Genbank, GenProt, Protein Data Bank, PubMed, Genome Data Base, Taxonomy Data Base, and Online Mendelian Inheritance in Man. Genbank stores all known DNA sequence. GenProt is a compiliation of known and inferred protein sequences. The Protein Data Bank stores X-ray coordinates for large macromolecules including protein and DNA. Pubmed is a bibliographic database accessing the biomedical literature. The Genome Data Base is related to Genbank, but accumulates genome-wide information. Taxonomy Database is a phylogney-based database for access to other information. The Online Mendelian Inheritance in Man accumulates information about human genetic disease. Each database has limits on the information present. The primary DNA, protein, and structure databases are updated relative to world databases and are essentially available as compilations of almost all information of this sort available publicly. One limitation that is often encountered by basic scientists is that PubMed does not include scientific literature that is primarily agricultural or chemical. This limitation can be very troubling to people in fields of both basic science and in fields like agricultural biotechnology. Nevertheless, these resources are outstanding, and Entrez is a shining example of a superb relational database and a tool useful to scientists on a daily basis. One can enter databases from essentially any data type and find that information as well as related information. One of the best sources of information about Entrez is its source, the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health. Every bioinformatician has the NCBI bookmarked! We will learn about Entrez by using it for some defined searches. Entrez is available in two flavors: WWW entrez and Network Entrez. These are different interfaces that achieve similar ends. WWW Entrez popularity has eclipsed that of Network Entrez, thought Network Entrez offers greater flexibility in retrieving large data sets rapidly, in my experience. There is an excellent chapter that includes information on NetEntrez by Baxevanis, which explains many universal aspects of Entrez. Entrez can be used as a simple retrieval tool, but it shines when used as a relational database tool. For example, if you would like to know whether the mating type gene from Neurospora crassa has been sequenced, try entering the terms neurospora crassa mating type in the Entrez search window of the nucleotide database. Your result should look like this. The problem is that mating type is a common descriptor of all strains--so you retrieved almost all known Neurospora genes! You might cleverly refine your search (think about how), but another way would be to use some other entree to the correct data. For example, you might search for neurospora crassa mating type in PubMed, the literature database. At present, you would find 638 articles. If you retrieve all of these, you will see that there are 32 pages. Unfortunately, the entries that you want are not obvious here, either. Finally, what if you search in the protein database? You find 20 proteins, and if you retrieve them, many seem appropriate. How do you get the DNA sequence? You could first retrieve the literature links, then use the display related function to find nucleotide sequences. What if you want to find other related articles? Use the neighboring function to find related articles, which you can use to leapfrog to other mating type sequences!! The key to Entrez in its relational role is its concept of "neighboring", which is very biological. The basic neighboring functions are:
You will use Entrez in the exercises below. One of the critical points is how to enter a retrieved sequence into SeqWeb. DNA sequence formats The lecture described several formats. We will focus on the Genbank, FASTA, and SeqWeb formats. In the exercises below, note how each format uses the same basic information. Note in particular how FASTA is information-poor in comparison to the other formats. This is illustrated by the conversion process into SeqWeb. Format conversion utilities SeqWeb is very limited in format conversion utilities. At this point, we will not introduce UNIX on seqanal to the class. If you are an experienced UNIX user, please try the optional section. Readseq is a popular UNIX sequence utility written by Don Gilbert that is often used on UNIX or WWW machines as a "front end" to allow multiple types of sequence input. This can also be used alone on seqanal by typing the command readseq. There is a nice tutorial on reformatting offered by the Pittsburgh Supercomputer Center. Versions of readseq for UNIX, Mac, and Win3.1 are available for download. I have not yet found a Win95 version. My advice is to archive your DNA sequences in Genbank format and in a format suitable for the analysis suite that you most often use. FASTA format is required by some programs, but can be readily generated by a conversion utility or by text processing. DNA sequence manipulations There are three simple manipulations that are often used with DNA sequences:
The Seqweb commands for these are quite obvious. The greatest limitations of Seqweb with respect to these commands are options available in the regular UNIX versions that have not yet been added to SeqWeb. Translate is awkward to use on multi-exon genes because the exons must be individually added from the clipboard or "spawned" as separate sequences. The UNIX version of Translate is much more powerful. Map is easy to use, but the graphical results are not available. Molecular biologists often use graphic restriction maps to think about their clones. This is possible with the old version, but somewhat difficult. To do so, you must Mapsort, Mapplot, and a graphics viewer. This is easy in the SeqLab X-windows version. Findpatterns is very useful in searching for user-defined patterns.
Findpatterns can find exact matches or matches with defined mismatches. Pleae
note that allowing mismatches will often dramatically increase the number of
"finds". SeqWeb accepts the IUPAC ambiguity characters:
Findpatterns also accepts "gaps", according to the manual. For example, one finds that the branchpoint of fungal introns usually is CTRAC followed by 5-15 bases followed by YAG. One can define this as: CTRAC(N){5,15}YAG and use Findpatterns to search for that. This works on the UNIX version, but does not appear to work on SeqWeb. I am working on that bug!! More on that later. If you are a UNIX user, you can do this on GCG-UNIX. |