118
BIOLOGY
6.9 HUMAN GENOME PROJECT
In the preceding sections you have learnt that it is the sequence of bases in
DNA that determines the genetic information of a given organism. In other
words, genetic make-up of an organism or an individual lies in the DNA
sequences. If two individuals differ, then their DNA sequences should also
be different, at least at some places. These assumptions led to the quest of
finding out the complete DNA sequence of human genome. With the
establishment of genetic engineering techniques where it was possible to
isolate and clone any piece of DNA and availability of simple and fast
techniques for determining DNA sequences, a very ambitious project of
sequencing human genome was launched in the year 1990.
Human Genome Project (HGP) was called a mega project. You can
imagine the magnitude and the requirements for the project if we simply
define the aims of the project as follows:
Human genome is said to have approximately 3 x 10
9
bp, and if the
cost of sequencing required is US $ 3 per bp (the estimated cost in the
beginning), the total estimated cost of the project would be approximately
9 billion US dollars. Further, if the obtained sequences were to be stored
in typed form in books, and if each page of the book contained 1000
letters and each book contained 1000 pages, then 3300 such books would
be required to store the information of DNA sequence from a single human
cell. The enormous amount of data expected to be generated also
necessitated the use of high speed computational devices for data storage
and retrieval, and analysis. HGP was closely associated with the rapid
development of a new area in biology called Bioinformatics.
Goals of HGP
Some of the important goals of HGP were as follows:
(i) Identify all the approximately 20,000-25,000 genes in human DNA;
(ii) Determine the sequences of the 3 billion chemical base pairs that
make up human DNA;
(iiii) Store this information in databases;
(iv) Improve tools for data analysis;
(v) Transfer related technologies to other sectors, such as industries;
(vi) Address the ethical, legal, and social issues (ELSI) that may arise
from the project.
The Human Genome Project was a 13-year project coordinated by
the U.S. Department of Energy and the National Institute of Health. During
the early years of the HGP, the Wellcome Trust (U.K.) became a major
partner; additional contributions came from Japan, France, Germany,
China and others. The project was completed in 2003. Knowledge about
the effects of DNA variations among individuals can lead to revolutionary
new ways to diagnose, treat and someday prevent the thousands of
2022-23
BIOLOGY
6
.
9
H
UM
AN
G
EN
OM
E
P
E
RO
JE
CT
P
P
In the preceding sections you have learnt that it is the sequence of bases in
DNA that determines the genetic information of a given organism. In other
words,
g
enetic make-
up
of an o
rg
anism or an individual lies in the DNA
sequences. If two individuals dif
fe
r
ff
, then their DNA sequences should also
be different, at least at some places. These assumptions led to the quest of
e
to
st
of
an
y
e
e
ly
ed
0
ld
an
o
e
d
t
e
y
ng
or
China and others. The
pr
oj
ect was co
mp
leted in 2003. Knowled
ge
about
the effects of DNA variations among individuals can lead to revolutionary
new ways to diagnose, treat and someday prevent the thousands of
202
2-2
3
111188
be different, at least at some
p
laces. These assum
pt
ions led to the
qu
est of
finding out the complete DNA sequence of human genome. With the
establishment of
g
enetic e
ng
ineeri
ng
techn
iq
ues where it was
p
ossible to
isolate and clone any piece of DNA and availability of simple and fast
techniques for deter
mining DNA sequences, a very ambitious p
r
oject of
sequencing human genome was launched in the year 1990.
Human Genome Pr
oject
(HGP) was called a mega pr
oject. Y
ou can
Y
Y
im
ag
ine the ma
gn
itude and the re
qu
irements for the
p
ro
je
ct if we sim
pl
y
define the aims of the
p
ro
je
ct as follows:
Human genome is said to have approximately 3 x 10
9
bp, and if the
cost of sequencing required is US $ 3 per bp (the estimated cost in the
beginning), the total estimated cost of the project would be approximately
9 billion US dollars. Further
, if the obtained se
qu
ences wer
e to be stor
ed
in typed form in books, and if each page of the book contained 1000
letters and each book contained 1000 pages, then 3300 such books would
be required to store the information of DNA sequence from a single human
cell. The enormous amount of data expected to be generated also
necessitated the use of hi
gh
s
pe
ed com
pu
tational devices for data stor
ag
e
and retrieval, and analysis. HGP was closely associated with the rapid
development of a new area in biology calle
d
Bi
oi
nf
or
ma
ti
cs
.
Go
al
s
of
H
GP
Some of the important goals of HGP were as follows:
(i)
I
dentify all the approximately 20,000-25,000 genes in human D
NA
;
(ii)
Determine the sequences of the 3 billion chemical base pairs that
make up human DNA;
(iiii)
Store this information in databases;
(iv)
Improve tools for data analysis;
(v
)
T
ra
ns
fe
r r
elated technologies to other sectors, such as industries
;
(vi)
Address the ethical, l
eg
al, and social issues (ELSI) that ma
y
arise
from the project
.
The Human Genome Project was a 13-year project coordinated by
the U.S. Department of Energy and the National Institute of Health. During
the early years of the HGP
, the W
ellcome T
W
W
rust (U.K.) became a major
partner; additional contributions came from Japan, France, Germany,
China and others. The project was completed in 2003. Knowledge about
119
MOLECULAR BASIS OF INHERITANCE
disorders that affect human beings. Besides providing clues to
understanding human biology, learning about non-human organisms
DNA sequences can lead to an understanding of their natural capabilities
that can be applied toward solving challenges in health care, agriculture,
energy production, environmental remediation. Many non-human model
organisms, such as bacteria, yeast, Caenorhabditis elegans (a free living
non-pathogenic nematode), Drosophila (the fruit fly), plants (rice and
Arabidopsis), etc., have also been sequenced.
Methodologies : The methods involved two major approaches. One
approach focused on identifying all the genes that are expressed as
RNA (referred to as Expressed Sequence Tags (ESTs). The other took
the blind approach of simply sequencing the whole set of genome that
contained all the coding and non-coding sequence, and later assigning
different regions in the sequence with functions (a term referred to as
Sequence Annotation). For sequencing, the total DNA from a cell is
isolated and converted into random fragments of relatively smaller sizes
(recall DNA is a very long polymer, and there are technical limitations in
sequencing very long pieces of DNA) and cloned in suitable host using
specialised vectors. The cloning resulted into amplification of each piece
of DNA fragment so that it subsequently could be sequenced with ease.
The commonly used hosts were bacteria and yeast, and the vectors were
called as BAC (bacterial artificial chromosomes), and YAC (yeast artificial
chromosomes).
The fragments were sequenced using automated DNA sequencers that
worked on the principle of a method developed by Frederick Sanger.
(Remember, Sanger is also credited for developing method for
determination of amino acid
sequences in proteins). These
sequences were then arranged based
on some overlapping regions
present in them. This required
generation of overlapping fragments
for sequencing. Alignment of these
sequences was humanly not
possible. Therefore, specialised
computer based programs were
developed (Figure 6.15). These
sequences were subsequently
annotated and were assigned to each
chromosome. The sequence of
chromosome 1 was completed only
in May 2006 (this was the last of the
24 human chromosomes 22
autosomes and X and Y to be
119
Figure 6.15 A representative diagram of human
genome project
2022-23
MOLECULAR BASIS OF INHERITANCE
disorders that affect human beings. Besides providing clues to
understandin
g
human biol
og
y,
learnin
g
about non-human or
ga
nisms
DNA sequences can lead to an understanding of their natural capabilities
that can be applied toward solving challenges in health care, agriculture,
energy production, environmental remediation. Many non-human model
organisms, such as bacteria, yeast,
Caenorhabditis elega
ns
(a free living
ns
th
og
ic
at
od
e)
Dr
hi
la
(t
he
f
it
f
ly
),
la
nt
(r
ic
d
ay (t
24 human chromosomes 2
2
au
to
so
me
s
an
d X
an
d Y
to
b
e
Figure 6.1
5
A representative diagram of human
genome project
202
2-2
3
non-pathogenic nematode),
Drosophila
(the fruit fly), plants (rice and
Arabidopsis
), etc., have also been se
qu
enced.
Methodologies :
The methods involved two major approaches. One
approach focused on identifying all the genes that are expressed as
RNA (referred to as
Expressed Sequence Tags
(ESTs). The other took
the blind approach of simply sequencing the whole set of genome that
contained all the coding and non-coding sequence, and later assigning
different re
gi
ons in the s
eq
uence with functions (a term referred to as
Sequence Annotation
). For sequencing, the total DNA from a cell is
isolated and converted into random fragments of relatively smaller sizes
(r
ecall DNA is a very long polymer
, and th
er
e
ar
e technical limitations in
sequencing very long pieces of DNA) and cloned in suitable host using
specialised vectors. The cloning resulted into amplification of each piece
of DNA fragment so that it subsequently could be sequenced with ease.
The commonly used hosts were bacteria and yeast, and the vectors were
called as
BAC
(bacterial artificial chr
omosomes), and
Y
AC
YY
(
ye
ast artificial
AC
chromosomes).
The fragments were sequenced using automated DNA sequencers that
worked on the princi
pl
e of a method developed
by
Fr
ederick Sanger
.
(Remember
, Sanger is also cr
edited for developing method for
determination of amino aci
d
se
qu
ences in
p
roteins). Thes
e
sequences were then arranged bas
ed
on some overlapping region
s
present in them. This require
d
generation of overlappin
g
fragment
s
for sequencing. Alignment of thes
e
sequences was humanly not
po
ssible. Therefore, s
pe
cialis
ed
computer based programs we
re
developed (Figure 6.15). The
se
sequences were subsequently
annotated and were assigned to each
chromosome. The sequence of
chromosome 1 was completed only
in May 2006 (this was the last of th
e
111199
111199
111199
111199
120
BIOLOGY
sequenced). Another challenging task was assigning the genetic and
physical maps on the genome. This was generated using information on
polymorphism of restriction endonuclease recognition sites, and some
repetitive DNA sequences known as microsatellites (one of the applications
of polymorphism in repetitive DNA sequences shall be explained in next
section of DNA fingerprinting).
6.9.1 Salient Features of Human Genome
Some of the salient observations drawn from human genome project are
as follows:
(i) The human genome contains 3164.7 million bp.
(ii) The average gene consists of 3000 bases, but sizes vary greatly, with
the largest known human gene being dystrophin at 2.4 million bases.
(iii) The total number of genes is estimated at 30,000 much lower
than previous estimates of 80,000 to 1,40,000 genes. Almost all
(99.9 per cent) nucleotide bases are exactly the same in all people.
(iv) The functions are unknown for over 50 per cent of the discovered
genes.
(v) Less than 2 per cent of the genome codes for proteins.
(vi) Repeated sequences make up very large portion of the human genome.
(vii) Repetitive sequences are stretches of DNA sequences that are
repeated many times, sometimes hundred to thousand times. They
are thought to have no direct coding functions, but they shed light
on chromosome structure, dynamics and evolution.
(viii) Chromosome 1 has most genes (2968), and the Y has the fewest (231).
(ix) Scientists have identified about 1.4 million locations where single-
base DNA differences (SNPs single nucleotide polymorphism,
pronounced as ‘snips’) occur in humans. This information promises
to revolutionise the processes of finding chromosomal locations for
disease-associated sequences and tracing human history.
6.9.2 Applications and Future Challenges
Deriving meaningful knowledge from the DNA sequences will define
research through the coming decades leading to our understanding of
biological systems. This enormous task will require the expertise and
creativity of tens of thousands of scientists from varied disciplines in both
the public and private sectors worldwide. One of the greatest impacts of
having the HG sequence may well be enabling a radically new approach
to biological research. In the past, researchers studied one or a few genes
at a time. With whole-genome sequences and new high-throughput
technologies, we can approach questions systematically and on a much
2022-23
BIOLOGY
sequ
e
nced). Another challenging task was assigning the genetic and
physical maps on the genome. This was generated using information on
polymorphism of restriction endonuclease recognition sites, and some
repetitive DNA sequences known as microsatellites (one of the applications
of
p
ol
ym
or
ph
ism in r
ep
etitive DNA se
qu
ences shall be e
xp
lained in next
section of DNA fin
ge
rprintin
g)
.
re
h
er
ll
ed
re
y
t
e-
es
or
ne
f
d
h
of
ch
es
at a time. With whole-genome sequences and new high-throug
hp
ut
technologies, we can approach questions systematical
ly
and on a much
202
2-2
3
112200
6.9.1 Salient Features of Human Geno
me
Some of the salient observations drawn from human
ge
nome
p
ro
je
ct are
as follows:
(i
)
The human genome contains 3164.7 million bp.
(ii)
The average gene consists of 3000 bases, but sizes vary greatly, with
the largest known human gene being dystrophin at 2.4 million bases.
(iii
)
The total number of genes is estimated at 30,00
0
much lower
than previous estimates of 80,000 to 1,40,000 genes. Almost all
(99.9 per cent) nucleotide bases are exactly the same in all people.
(i
v)
The functions are unknown for over 50 per cent of the discovered
genes.
(v
)
Less than 2 per cent of the genome codes for protein
s.
(v
i)
Repeated sequences make up very large portion of the human genome.
(v
ii
)
Re
pe
titive s
eq
uences are stretches of DNA se
qu
ences that are
re
pe
ated man
y
times, sometimes hundred to thousand times. The
y
are thought to have no direct coding functions, but they shed light
on chromosome structure, d
yn
amics and evolutio
n.
(viii)
Chromosome 1 has most genes (2968), and the Y has the fewest (231).
(i
x)
Scientists have identified about 1.4 million locations where sin
gl
e-
base DNA differences
(
SNPs
si
ng
le nucleotide polymorphism
,
pronounced as ‘snips’) occur in humans. This information promises
to revolutioni
s
e the
pr
ocesses of findi
ng
chromosomal locations for
disease-associated sequences and tracing human history.
6.6.
9.9.
2
2
Ap
Ap
plications and Future Challeng
es
Deriving meaningful knowledge from the DNA sequences will define
research through the coming decades leading to our understanding of
biological systems. This enormous task will require the expertise and
creativity of tens of thousands of scientists from varied disciplines in both
the public and private sectors worldwide. One of the greatest impacts of
having the HG sequence may well be enabling a radically new approach
to biological research. In the past, researchers studied one or a few genes
at t With whol nd h h-th hput
121
MOLECULAR BASIS OF INHERITANCE
broader scale. They can study all the genes in a genome, for example, all
the transcripts in a particular tissue or organ or tumor, or how tens of
thousands of genes and proteins work together in interconnected networks
to orchestrate the chemistry of life.
6.10 DNA FINGERPRINTING
As stated in the preceding section, 99.9 per cent of base sequence among
humans is the same. Assuming human genome as 3 × 10
9
bp, in how
many base sequences would there be differences? It is these differences
in sequence of DNA which make every individual unique in their
phenotypic appearance. If one aims to find out genetic differences
between two individuals or among individuals of a population,
sequencing the DNA every time would be a daunting and expensive
task. Imagine trying to compare two sets of 3 × 10
6
base pairs. DNA
fingerprinting is a very quick way to compare the DNA sequences of any
two individuals.
DNA fingerprinting involves identifying differences in some specific
regions in DNA sequence called as repetitive DNA, because in these
sequences, a small stretch of DNA is repeated many times. These repetitive
DNA are separated from bulk genomic DNA as different peaks during
density gradient centrifugation. The bulk DNA forms a major peak and
the other small peaks are referred to as satellite DNA. Depending on
base composition (A : T rich or G:C rich), length of segment, and number
of repetitive units, the satellite DNA is classified into many categories,
such as micro-satellites, mini-satellites etc. These sequences normally
do not code for any proteins, but they form a large portion of human
genome. These sequence show high degree of polymorphism and form
the basis of DNA fingerprinting. Since DNA from every tissue (such as
blood, hair-follicle, skin, bone, saliva, sperm etc.), from an individual
show the same degree of polymorphism, they become very useful
identification tool in forensic applications. Further, as the polymorphisms
are inheritable from parents to children, DNA fingerprinting is the basis
of paternity testing, in case of disputes.
As polymorphism in DNA sequence is the basis of genetic mapping
of human genome as well as of DNA fingerprinting, it is essential that we
understand what DNA polymorphism means in simple terms.
Polymorphism (variation at genetic level) arises due to mutations. (Recall
different kind of mutations and their effects that you have already
studied in Chapter 5, and in the preceding sections in this chapter.)
New mutations may arise in an individual either in somatic cells or in
the germ cells (cells that generate gametes in sexually reproducing
organisms). If a germ cell mutation does not seriously impair individual’s
ability to have offspring who can transmit the mutation, it can spread to
2022-23
MOLECULAR BASIS OF INHERITANCE
broader scale. They can study all the genes in a genome, for example, all
the transcripts in a particular tissue or
or
gan or tumor
, or how tens of
thousands of genes and proteins work together in interconnected networks
to orchestrate the chemistry of life.
6.10 DNA F
INGERPRINTING
ability to have offspring who can transmit the mutation, it can spread to
202
2-2
3
112211
As stated in the preceding section, 99.9 per cent of base sequence among
humans is the same.
Assuming human genome as 3 × 10
9
bp
,
in h
ow
many base sequences would there be d
if
fe
rences
?
It
i
s
th
es
e
diff
er
en
ce
s
?
in sequence of DNA which make every individual unique in their
phenotypic appearance. If one aims to find out genetic differences
between two individuals or among individuals of a population,
sequencing the DNA every time would be a daunting and expensive
task. Imagine trying to compare two
se
ts
of 3 ×
1
0
6
base pairs. D
NA
fingerprinting is a very quick way to compare the DNA sequences of any
two individuals.
DNA fingerprinting involves identifying differences in some specific
regions in DNA sequence called as
repetitive DNA
,
because in these
sequences, a small stretch of DNA is repeated many times. These repetitive
DNA are separated from bulk genomic DNA as different peaks during
density gradient centrifugation. The bulk DNA forms a major peak and
the other small peaks are referred to as
satellite DNA
. Depending on
base composition (A
:
T rich or G:C rich), length of segment, and number
of repetitive units, the satellite DNA is classified into many categories,
such as micro-satellites, mini-satellites etc. These sequences normally
do not code for any proteins, but they form a large portion of human
genome. These sequence show high degree of polymorphism and form
the basis of DNA
fingerprinting. Since DNA from every tissue (such as
blood, hair
-follicle, skin, bone, saliva, sper
m etc.)
,
f
r
om
a
n
in
di
vi
dual
show the same degree of po
ly
morphism, they become very useful
identification tool in for
ensic applications. Furthe
r
, as the polymorphisms
are inheritable from parents to children, DNA fingerprinting is the basis
of paternity testing, in case of disputes.
As polymorphism in DNA sequence is the basis of genetic mapping
of human genome as well as of DNA
fingerprinting, it is essential that we
understand what DNA polymorphism means in simple terms.
Po
ly
mo
rp
hism
(variation at genetic level) arises due to mutations.
(
Recall
different kind of mutations and their effects that you have already
studied in Chapter 5, and in the pr
eceding section
s
in
this chapter
.)
New mutations may arise in an individual either in somatic cells or in
the germ cells (cells that generate gametes in sexually reproducing
organisms). If a germ cell mutation does not seriously impair individual’s