How to get good fungal and bacterial identifications from GenBank sequences

ncbi logoGenBank from the NCBI is an amazing and invaluable resource for DNA sequences, and combined with the searching tool BLAST it is really easy to identify your unknown gene sequence.

The trouble is that many of the organism identifications of GenBank sequences are dubious, outdated, or just plain wrong. Identifying a sequence as the wrong species is just bad science no matter what your reason, but is especially important for regulatory agencies. The problem of incorrect identifications is a self-compounding problem as these incorrect identifications are used by other sequence submitters to name their sequences.

Ideally you would use a carefully curated “known good” list of sequences. The very best sequences to use are those derived from “type material”, the nomenclatural type is the specimen that was used to originally describe a species, so by definition is the best sequence to use to get an accurate identification. A great website for bacteria that uses the 16S rDNA region is EzTaxon. I use this frequently but it has the inherent limitation of the lack of 16S variability in some bacterial species, which means it is not always easy to get an accurate identification.

For fungi using the universal barcode ITS rDNA region, a good website is this searches some sequences that have not yet been submitted to GenBank. GenBank has recognised the problem of poor quality identifications and for fungi has a curated list of type sequences, described in the publication Finding needles in haystacks (disclaimer: I am a co-author).

Still nothing beats the vast scope of GenBank, particularly if you want to use a gene other than ITS or 16S. There are several ways to limit the scope of you BLAST search to just good sequences. One way I recommend is to tick the box “Exclude Uncultured/environmental sample sequences” under the exclude option. These will be of no value in getting an identification, and just clutter up the results. Ticking “Exclude Models (XM/XP)” will make no difference either way, as these are automatically annotated genes from a few NCBI genomes (human, mouse, rat, honey bee, chicken, chimpanzee). You should also try ticking the “Sequences from type material” under the ‘limit to’ option. I find it very useful to view the Distance tree of results (select 'show all' under Collapse Mode), rather than just rely on the ranking given in the results page.

In addition to this you can use the very powerful Entrez Query option to limit the results further. These also work really well if you are using another service to query the GenBank database and not going through the website (e.g. using Geneious). For example try these:

Entrez Query options

Entrez Query


sequence from type[filter]

Only retrieves sequences from type cultures or specimens (works the same as the “Sequences from type material” option in the web interface)

src specimen voucher[prop] OR src culture collection[prop]

Only retrieves sequences from sequences that are associated with a herbarium, fungarium, or culture collection

collection icmp[prop]

Only retrieves sequences from cultures in the ICMP culture collection

NOT(environmental samples[organism] OR metagenomes[organism]) Filters out environmental samples or from metagenomes, these typically have poor identifications

These Entrez Queries can also be used when finding sequences in the GenBank database without using BLAST, for example collection icmp[prop] OR icmp[title] AND fungi[orgn] AND 2014/01/01[PDAT] : 2014/12/31[PDAT] finds all ICMP fungal cultures deposited in Genbank in the year 2014.

I hope this helps you use GenBank more effectively.

