Science and Reason: Human gene count drops again

Before the human genome was sequenced and the results published in February 2001, some biologists speculated that there might be 100,000 or more different genes. Later in 2001 the estimated numbers were still sometimes between 60,000 and 90,000. (See here, here, here.) More conservative estimates at the time were around 35,000, and that gradually fell to about 25,000 over the next several years.

The problem is that there are no dependably unambiguous markers within the DNA itself to identify where a potentially active gene starts and ends. Remember that every strand of DNA has a sense of directionality established by the two ends of the strand, which are chemically distinct and called the 5′ and 3′ ends. The enzymes that transcribe DNA into RNA always read from the 5′ to the 3′ end, so the start of a gene is closer to the former than the latter. Every gene also has a "promoter" region, which is a short DNA sequence located in the 5′ direction ("upstream") from the gene itself. Transcription factors attach themselves to the promoter region in order to enable gene transcription.

Although it is relatively easy to recognize promoter regions, they can be located far from the gene itself, and it's still often difficult to predict where the corresponding gene (if any) actually begins. See here for much more detail on gene finding.

Initial high estimates of gene numbers were made indirectly, based on the number of distinct proteins that existed in human cells. It was known that this number was at least 100,000 or more, possibly a lot more. But this was misleading, because it wasn't well understood that a single gene could code for multiple proteins, through the process of alternative splicing.

One clue to the actual start of a gene is the presence of a start codon. This is a specific three-letter sequence (ATG). In eukaryotes this codon encodes the amino acid methionine, but it is usually preceded by a "5′ untranslated region" ("5′ UTR") as further identification. The end of a gene is marked by a stop codon. The portion of DNA between the start and stop codons is called an "open reading frame" (ORF). Another condition needed is a sequence of enough codons (100, i. e. 300 nucleotides) in order to encode a working protein.

It is still not true that all sufficiently long open reading frames correspond to actual genes. There are various heuristics used to identify ORFs that do not really correspond to genes, but potential uncertainty remains, because there's no unambiguous way to tell from the DNA itself that a particular ORF actually corresponds to a working gene. It might instead, for example, have been an actual gene in some distant human ancestor but is no longer functional in humans. (Such things are known as "pseudogenes".)

But now that we know the DNA sequences of various other mammals, it is possible to identify more pseudogenes in human DNA, further reducing the total count of actual functioning genes.

Human Gene Count Tumbles Again

Estimates of the number of genes in the human genome have ranged wildly over the past two decades, from 20,000 all the way up to 150,000. By the time the working draft of the human genome was published in 2001, the best approximation stood at 35,000, yet even that number has fallen. A new analysis, one that harnesses the power of comparing genome sequences of various organisms, now reveals that the true number of human genes is about 20,500, thousands fewer than what is currently listed in human gene catalogs.

The initial clue that not all sequences among the 25,000 that had been settled upon as "real" human genes actually were such is that many did not correspond to genes identified in the mouse genome. This was suspicious, since working, useful genes in the common ancestor of humans and mice ought to be conserved in both later species.

A supposed human gene which did not correspond to a mouse gene might have appeared in the time since humans and mice diverged from their common ancestor, or the gene might have been lost by mice (but not humans) sometime after the common ancestor. On the other hand it might not be a real human gene at all (having lost functionality along the way). One method to distinguish between these two cases is to check whether an analogue of the supposed gene could be found in a primate genome, since the primates whose genomes have been cataloged (macaques and chimpanzees) are much more closely related to humans than mice are.

Ultimately, almost 5000 pseudogenes have been removed from the earlier list of 25,000 human genes. Sequences of human DNA that appear to be genes but do not correspond to genes in mice and dogs, yet do correspond to genes in macaques and chimpanzees, are considered real. The remainder, mostly, are considered pseudogenes:

To distinguish such misidentified genes from true ones, the research team, led by Clamp and Broad Institute director Eric Lander, developed a method that takes advantage of another hallmark of protein-coding genes: conservation by evolution. The researchers considered genes to be valid if and only if similar sequences could be found in other mammals – namely, mouse and dog. Applying this technique to nearly 22,000 genes in the Ensembl gene catalog, the analysis revealed 1,177 “orphan” DNA sequences. These orphans looked like proteins because of their open reading frames, but were not found in either the mouse or dog genomes.

Although this was strong evidence that the sequences were not true protein-coding genes, it was not quite convincing enough to justify their removal from the human gene catalogs. Two other scenarios could, in fact, explain their absence from other mammalian genomes. For instance, the genes could be unique among primates, new inventions that appeared after the divergence of mouse and dog ancestors from primate ancestors. Alternatively, the genes could have been more ancient creations — present in a common mammalian ancestor — that were lost in mouse and dog lineages yet retained in humans. ...

After extending the analysis to two more gene catalogs and accounting for other misclassified genes, the team’s work invalidated a total of nearly 5,000 DNA sequences that had been incorrectly added to the lists of protein-coding genes, reducing the current estimate to roughly 20,500.