Glossary of Terms Commonly Used in Genomics Research
accession Number: a
unique code that identifies a sequence in a database
algorithm: a
procedure embedded in a computer program
alignment: The
process of lining up two or more sequences to compare the degree of identity,
for the purpose of assessing the degree of similarity and the possibility of
homology
alternative splicing: mechanism
by which different introns (intervening sequences found within a gene) are
removed to form alternative sets of functional genes
amino acids: The 20 organic compounds that are
building blocks of proteins. The sequence of nucleotide bases in DNA determines
the sequence of amino acids in proteins. Some examples of amino acids are:
alanine, glycine, arginine, leucine
base/base pair: the
4 nitrogenous subunits (nucleotides) of DNA: adenine (abbreviated as A),
guanine (G), cytosine (C), and thymine (T). Most organisms contain thousands or
more of these, in a long double-stranded chain wrapped around itself in a
double helix. The linear order (sequence) of the nucleotides defines the
organism. DNA is double-stranded, and the 4 bases are complementary to each
other (eg. A and T can only bind with each other, likewise G and C), thus
determining the order of the bases on one strand infers the order of the other
strand, so the terms base and base pair are often used interchangeably, and
used as a measurement of the size of a genome (for example, the human genome is
approximately 3 billion base pairs long).
BAC: Bacterial
artificial chromosome, used as a vector to carry the DNA of another organism
for cloning and molecular biology purposes
bioinformatics: the
new field combining computer science, biology, and information technology,
involving the storage, managing and analysis of large amounts of data
BLAST: basic local
alignment search tool, a sequence comparison algorithm used in comparing
sequences, available through NCBI
cDNA: complementary
DNA, synthesized from mRNA, from which the introns have been spliced out
chromosome: the
structure in cells that carries the linearly arranged genetic material. The
number of chromosomes varies among species (for example, humans have 23 pairs,
Arabidopsis has 5)
codon: a set of 3
nucleotides in a DNA sequence, which corresponds to a specific amino acid
comparative genomics:
comparing the sequences of 2 or more organisms, used in identifying gene
functions and evolutionary studies
computational biology:
analyzing and interpreting
biological data data
COT analysis: uses
the principles of DNA renaturation kinetics, where the rate at which a
particular sequence reassociates (returns to the double-stranded state) is
proportional to the number of times it is found in the genome (Cot stands for
nucleotide concentration times reassociation time). This is used as a way of
filtering out highly repetitive sequences, better enabling the sequencing of
low copy sequences (more likely to be genes)
Database: a
collection of data. See also relational database
DNA: Deoxyribose
Nucleic Acid, the carrier molecule of genetic information. Made up of 2 long
chains made up of nucleotides, which consist of a sugar (deoxyribose), a
phosphate group and one of 4 nitrogenous bases (see also base).
DNA chip: see microarray
DNA fingerprinting:
the creation of a unique DNA profile of an individual using molecular
techniques
DNA sequence: the
sequence of nucleotide bases that are in a DNA molecule. Expressed by the
sequence of the letters representing each of the 4 nucleotide bases, for
example GCATATTGCT. This sequence
is specific to each living organism.
ESTs: expressed
sequence tags; partial gene sequences of the expressed part of the genome; used
for gene discovery, particularly in organisms that have not yet been sequenced
Exon: the part of a
DNA sequence that codes for a protein (usually in conjunction with other exons)
FASTA: the first widely used search algorithm
for database similarity searching; now sometimes used simply to denote the file
format that sequences are commonly expressed in
functional genomics:
studies of the structure and organization, and function of the genome in
developmental and other life processes of an organism
gap: a space
introduced into an alignment to compensate for insertions and deletions in one
sequence relative to another
GenBank: the most
used public database for sequence data and related information. Managed by
NCBI, supported by the National Library of Medicine and NIH, available at
http://www.ncbi.nlm.nih.gov
gene: the functional
subunit of heredity, a sequence of nucleotides on a particular position of a
chromosome which usually encodes for a specific functional product
gene expression:
when a gene is "turned on", making a product
gene family: a group
of closely related genes that produces similar protein products
genetic engineering:
the technique of copying a gene from one living thing, such as a bacteria,
plant or animal, and adding it to another. Most commonly used to add a new gene
to a crop plant, giving it traits that may be beneficial to the farmer or consumer
genome: the entire
genetic endowment of an organism. Genome sizes vary widely among organisms
genotype: the
genetic constitution of an organism, see also phenotype
GMO: genetically
modified organism. Although technically this could refer to any organism that
has been genetically modified, even through traditional breeding and selection
methods, typically the term now refers to an organism that has been modifed
through genetic engineering methods. Also called transgenics
Haplotype: A
collection of variable DNA sequences that tend to be inherited together
Heuristic: a
procedure that derives an approximation to the real answer of a problem in a
more economical or faster way than using the more mathematically
"strict" algorithm.
However, obtaining the "True" answer is not guaranteed to a
100%. In computer science, heuristics are applied when finding the exact
solution to a problem via strict algorithms is computationally impractical.
homology: having a
common evolutionary origin, relatedness (now often used simply to describe
similarity in DNA sequence)
imprinting: The
phenomenon in which a gene may be expressed differently in an offspring
depending on whether it was inherited from the father or the mother.
intron: A DNA
sequence that interrupts the sequences coding for a gene product (exons).
Junk DNA: non-coding
DNA; DNA that does not directly code for proteins. May have functions as
structural stabilizers, controlling gene expression, or other
library: a set of
sequences or clone, usually generated at once for a specific purpose. Examples
include EST libraries, BAC libraries, etc.
mapping: identifying
the location of a gene or DNA segment along a chromosome
metabolomics: the
study of the unique chemical fingerprints that specific cellular processes
leave behind; the study of an organismÕs small-molecule metabolite profiles
microarray (or DNA
chip, gene chip): device where tens of thousands of genes or DNA segments are
attached to a small, thumb-sized chip, and can be simultaneously assessed to
detect specific genes or gene activity or expression
minimal tiling path:
the minimum number of overlapping clones in a physical map needed to generate a
sequence of the whole genome
molecular marker:
gene or DNA segment with a known location on a chromosome. (for a good tutorial
on the uses of markers, see the downloadable training materials available from
the International Plant Genetic Resources Institute, http://www.ipgri.cgiar.org/
mutation: abrupt
change in the genotype of an organism that is not the result of recombination
NCBI: National
Center for Biotechnology Information, which manages GenBank, PubMed (a database
of publications), and other databases (available at
http://www.ncbi.nlm.nih.gov)
nucleic acids: see base/base
pair and DNA
nucleotide: contains
one base, one phosphate molecule, and the sugar molecule deoxyribose. The bases
in DNA nucleotides are adenine, thymine, guanine, and cytosine (abbreviated A,
T, G, and C). See also base/base pair
orthologs:
homologous genes from different species that are derived from a common
ancestral gene at the time of the last common ancestor
paralogs: genes
within a species that arose from
gene duplication
PCR: polymerase
chain reaction, the process by which a small fragment of DNA can be replicated
into millions of copies. It is done in a small desktop machine called a
thermalcycler, which, through temperatures cycles, stimulates the DNA synthesis
process
phenotype: the
traits displayed by an organism as a result of its genetic constitution (genotype)
phylogenetics is the
field of biology that deals with identifying and understanding the
relationships between the different kinds of life on earth.
phylogenomics: a
method of assigning a function to a gene based on its evolutionary history in a
Phylogenetic tree; Phylogenomics uses knowledge
on the evolution of a gene to improve function prediction.
Physical map: the
linear order of sequences along a chromosome, often generated by the use of
overlapping clones such as BAC clones
polyploidy: the
occurrence of whole genome or large scale duplications within a genome,
typically in plants
promoter: the part
of a gene that contains the information to turn the gene on or off
proteins: large
complex molecules made up of amino acids that make up most cellular structures
and catalyze most reactions
proteome: the set of
all proteins in a cell. Unlike the relatively unchanging genome, the dynamic
proteome changes from minute to minute in response to tens of thousands of
intra- and extracellular environmental signals
proteomics: the
large-scale analysis of an organism's proteins to reveal expression and
functions
recombination: formation
in offspring of genetic combinations not present in the parents through the
physical exchange of genetic material during cell division
regulatory DNA: DNA
that controls the activity of genes. These DNA sequences tend to be short and
located near the genes they control (but not always).
relational database:
a database that cross-references the different types of data it contains, and
allows queries of any type (a sequence, the sequence name, etc.) to retrieve
data
RNA: Ribonucleic
acid, the molecule responsible for translating the genetic information into
proteins. Made up of one long chain of nucleotides, the bases of which are the
same as DNA except that uracil is used instead of thymine. There are three main
types of RNA: messenger RNA, transfer RNA, and ribosomal RNA.
RNA interference (RNAi): a system in cells for Òturning off,Ó or silencing, particular genes.
Scientists can now mimic this process to help identify the functions of
individual genes.
sequencing:
determining the order (sequence) of nucleotide bases in a segment of DNA (or
RNA or protein, less commonly). Samples are run on an electrophoresis gel, on
which the 4 bases give distinctive banding patterns. By ordering the
overlapping fragments, the sequence of the entire DNA segment can be deduced
SNPs: single
nucleotide polymorphisms (pronounced "snip"). A single nucleotide
difference between 2 or more sequences, caused by allelic variation or
mutations. Can be used as genetic markers, to track inheritance in families or
species
structural genomics:
identifying the 3-D structures of proteins, which will help identify their
functions and provide targets for drug design
syntenic:
orthologous loci from different species that are in the same order in their
respective species (originally meant only that they were located on the same
chromosome, but is now used to mean colinear as well). For example, genes on a
human chromosome that are found in the same order on a mouse chromosome
transgenic: an
organism containing genetic material from another organism transferred by
genetic engineering. See also GMO.
transcription: the
process by which RNA is formed from DNA, thereby activating the genes
transcriptome: the
sum of all the regions of a genome that are transcribed
transcriptomics: depicts
the expression level of genes,
often using techniques capable of sampling tens of thousands of different mRNA
molecules at a time, using technologies such as microarrays.
transformation: the
process of adding a gene from one organism into another
transposon: a
genetic element that can move within the genome
unigene:
non-redundant set of gene-oriented clusters, often generated by clustering
large amounts of ESTs
universal primers:
primers that will amplify orthologous sequences in different species
UTR: untranslated
region, that part of a gene that is not translated into protein.
Main Sources and other glossaries
Chemis
Interactive Molecular Library: nucleic acids http://www.geneticengineering.org/chemis/Chemis-NucleicAcid/DNA.htm,
2000, Dr Didier Collomb 2/13/02
Friend, S.H. and
Stoughton, R.B. (2002, February). The magic of microarrays. Scientific
American, pp. 44-53
Hartwell, L.H.,
Hood, L., Goldberg, M., Reynolds, A.E., Silver, L.M., & Veres, R.C. (2000).
Genetics: from genes to genomes. New
York: McGraw-Hill Companies, Inc.
Interagency
Working Group on Plant Genomes (2000). National Plant Genome Initiative.
Washington, D.C.: National Science and Technology Council
Genomics
Initative, a supplement to the Cornell Chronicle. (1999, January). Cornell
University
Human Genome
Management Information System (HGMIS) (2001). Genomics and its impact on
medicine and society: a primer, [pdf]. HGMIS at Oak Ridge National Laboratory,
Oak Ridge, TN, for the U.S. Department of Energy Human Genome Program.
Available at http://www.ornl.gov/hgmis
National Center
for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
National
Institutes of Health, National Institute of General Medical Sciences (2001)
Genetics Basics. NIH Publication No. 01-662. Also available at: http://publications.nigms.nih.gov/genetics/
Genome News
Network glossary http://www.genomenewsnetwork.org/
Wikipedia, the
free encyclopedia http://en.wikipedia.org/
For reviews of
some online glossaries in genomics and biotechnology, see http://www.sciencegenomics.org