Debunking myths on genetics and DNA

Monday, February 27, 2012

Mendelian puzzles

A Mendelian disorder is a disease caused by a single-gene mutation that's usually inherited according to Mendel's laws. Despite being for the most part quite debilitating, they persist in the population, though at a very low prevalence. This is due in part to the effect called heterozygote advantage. Recessive mutations bear no symptoms and are carried on from one generation to the next until an individual with both mutated alleles is born.

A Perspective in the latest issue of Science [1] gives a good review of Mendelian disorders and the puzzles yet to unravel around them. The mystery is two-fold: first, given a disorder, not all subjects carry the specific mutation; second, given the specific mutation, not all subjects are affected in the same way, and in fact, some aren't affected at all.

The first puzzle (an affected person who doesn't carry the mutation) can easily be attributed to the fact that the specific disease could indeed have many causes. Just because it has been found to be monogenic (caused by one gene), it doesn't mean that it has one cause only. Environmental factors could be playing a role too, for example, as well as additional interactions between genes. Different environmental exposures could also explain the other side of the puzzle, i.e. why carriers of the same mutations can be affected in such different ways, from no symptoms at all to severe conditions. Ultimately, things like gene-to-gene interactions and variation in an individual's regulatory sequences could explain both puzzles.

DNA transcription is controlled by proteins called transcription factors. These proteins bind to the region "upstream" from the gene, and this is the non-coding region called "regulatory sequence" because it affects whether the gene is silenced or expressed.
"For an autosomal dominant disorder in which individuals harbor one normal (wild-type) and one mutant gene copy (allele), additional variants at a physically close silencer or enhancer can modulate the wild-type versus mutant transcripts so as to yield more or less mutant transcript and protein, thus leading to more severe or less severe disease. The specific outcomes depend on the function of the regulatory sequence and whether the regulatory variant is physically linked to the wild-type or the mutant allele.
In the Science review, Chakravarti and Kapoor hypothesize that mutations in the regulatory sequences are far more common than mutations in the adjacent structural gene that code for the affected transcript or protein, and it is the combination of the two that could explain the wide range of disease penetrance we observe. Furthermore, while disease mutations are kept at a low prevalence by strong selection, selection is much weaker on regulatory sequences. As a consequence, mutations that arise in these regions have a higher chance to become common in the population.
"The combination of rare coding mutations with common regulatory variants can lead to complex patterns of inheritance and thus provide a singular mechanistic explanation of Mendelian families that do not carry a mutation in the coding gene associated with the disease, as well as variable disease penetrance in individuals that carry the Mendelian gene mutation."
Photo: waves in La Jolla, CA.

[1] Chakravarti, A., & Kapoor, A. (2012). Mendelian Puzzles Science, 335 (6071), 930-931 DOI: 10.1126/science.1219301

Thursday, February 23, 2012

Modifying gene expression through riboswitches

Messenger RNA (mRNA), the RNA transcribed from a DNA template in order to make proteins, contains elements able to sense and bind to specific targeting molecules (metabolites or metal ions). In bacteria, fungi and plants, these binding mechanisms are used to control gene expression, and therefore act as genetic "switches", which is why these RNA elements are called "riboswitches". They are often found at the 5' end of the mRNA, in the untranslated region (the stretch that precedes the start codon): this way, they are the first domain to be synthesized and can therefore influence expression before the entire mRNA is created.

Riboswitches have two components: the domain that binds to the ligand is called "aptamer" and it's highly conserved from an evolutionary point of view, as it has to "sense" a precise type of molecule. The other component, called "expression platform," is what regulates gene expression, and, contrary to the aptamer, it can vary greatly in order to affect the different processes of transcription, translation, and RNA processing.

In order to understand how riboswitches bind to their specific ligands, it is vital to decipher their "secondary structure," in other words, the way they fold and assume a 3-D structure that allows them to "sense" and "capture" the targeting molecules. Common elements of RNA secondary structures are "helices" (similar to those found in DNA), and "hairpins," which take place when the RNA folds back onto itself. "Some riboswitches are surprisingly complex, and they rival protein factors in their structural and functional sophistication [1]."

The following figure, from this Scitable article, illustrates the kind of changes in secondary (3D) structure a riboswitch can undergo before and after binding to a molecule. 

A riboswitch can adopt different secondary structures to effect gene regulation depending on whether ligand is bound. This schematic is an example of a riboswitch that controls transcription. When metabolite is not bound (-M), the expression platform incorporates the switching sequence into an antiterminator stem-loop (AT) and transcription proceeds through the coding region of the mRNA. When metabolite binds (+M), the switching sequence is incorporated into the aptamer domain, and the expression platform folds into a terminator stem-loop (T), causing transcription to abort. aptamer domain (red), switching sequence (purple), and expression platform (blue).

Because they affect gene expression, particularly genes involved in biosynthetic pathways, riboswitches are natural targets for drug development.
"First, many riboswitches repress the expression of genes whose protein products are involved in the transport or biosynthesis of essential metabolites. Therefore, compounds that trick riboswitches by mimicking the natural ligand might inhibit bacterial growth by starving the cells for that essential metabolite. Second, medicinal chemists already have a ‘‘hit’’ compound (the natural ligand) for each validated riboswitch class that they can begin to chemically alter to create new antibiotics. In this regard, riboswitches are almost unique among noncoding RNAs classes because they have evolved pockets to purposefully bind a small molecule, and therefore should be more easily drugged [1]."
From the Scitable article:
"Their role in regulating transcription in bacteria makes them enticing targets for the development of novel antibiotics aimed at stopping bacterial pathogens from flourishing inside the people they infect. Because riboswitches control genes essential for bacterial survival, or genes that control the ability of bacteria to succeed at infection, a drug designed to affect a riboswitch could be a powerful tool for shutting down pathogenic bacteria."
Synthetic riboswitches have been developed and shown to activate or repress gene expression in bacteria [2]. While I couldn't find any studies done in humans yet (though if you guys know of some, please let me know!), I did find a Nature letter reporting the first ever human RNA switch analogous to riboswitches [3].

[1] Breaker, R. (2011). Prospects for Riboswitch Discovery and Analysis Molecular Cell, 43 (6), 867-879 DOI: 10.1016/j.molcel.2011.08.024

[2] Topp, S., Reynoso, C., Seeliger, J., Goldlust, I., Desai, S., Murat, D., Shen, A., Puri, A., Komeili, A., Bertozzi, C., Scott, J., & Gallivan, J. (2010). Synthetic Riboswitches That Induce Gene Expression in Diverse Bacterial Species Applied and Environmental Microbiology, 76 (23), 7881-7884 DOI: 10.1128/AEM.01537-10

[3] Ray, P., Jia, J., Yao, P., Majumder, M., Hatzoglou, M., & Fox, P. (2008). A stress-responsive RNA switch regulates VEGFA expression Nature, 457 (7231), 915-919 DOI: 10.1038/nature07598

Monday, February 20, 2012

Is cancer contagious? Sometimes. But it may not be a bad thing.

About 15% of all cancers worldwide are caused by infectious pathogens such as viruses, bacteria, or parasites [1]. Viruses that are capable of inducing cancer are called oncoviruses -- HPV is an example. The pathogen is transmitted from a donor to a recipient, starts the infection, and the infection eventually causes the cancer. But did you know there existed such a thing as a transmissible cancer? In this case, it's not the pathogen, but the cancer cell line itself that gets transmitted from one individual to another.

Yes, it's scary, but there are some good news.

For one thing, "relatively common" cases have been observed in animals only. Canine transmissible venereal tumor (CTVT) is quite common in dogs. It's transmitted during mating and eventually rejected by the host dog who then acquires lifelong immunity.
"In man, only scattered case reports exist about such communicable cancers, most often in the setting of organ or hematopoietic stem cell transplants and cancers arising during pregnancy that are transmitted to the fetus. In about one third of cases, transplant recipients develop cancers from donor organs from individuals who were found to harbor malignancies after the transplantation. The fact that two thirds of the time cancer does not develop, along with the fact that cancer very rarely is transmitted from person to person, supports the notion that natural immunity prevents such cancers from taking hold in man. These observations might hold invaluable clues to the immunobiology and possible immunotherapy of cancer [1]."
CTVT is particularly interesting to study because it has evolved some ingenious mechanisms to escape the immune system. Every nucleated cell has a class of molecules, called MHC, which have the function to display fragments of proteins that are "flags" as to whether the cell is healthy or harbors some pathogen. Once in the host, CTVT
"downregulates its MHC I expression, thereby reducing its initial visibility to the host's immune system. This allows it to not only to escape T-cell mediated immunity (which would occur if MHC I were fully expressed) but also natural killer cells (which would eradicate the cells were they completely devoid of MHC I)."
Despite this type of "defense", eventually the dog's immune system recognizes the pathogen and clears it, and understanding this mechanism is of interest for a possible cancer vaccine (I talked about cancer vaccines here).

A recent study published in Science [2] looked at two regions in the mitochondrial genome (mtDNA) from 37 CTVT samples, and compared them with sequences from the mtDNA of 15 hosts. Through phylogenetic analysis Rebbeck et al. showed a high variability in the sequenced regions, suggesting that CTVT periodically acquires mtDNA from infected hosts. The reason for this, the researchers hypothesize, is that CTVT mitochondria, due to a high metabolic rate, tend to accumulate deleterious mutations and therefore, transfers of mtDNA from the host may have the benefit of restoring CTVT mitochondrial function.

[1] Welsh, J. (2011). Contagious Cancer The Oncologist, 16 (1), 1-4 DOI: 10.1634/theoncologist.2010-0301

[2] Rebbeck, C., Leroi, A., & Burt, A. (2011). Mitochondrial Capture by a Transmissible Cancer Science, 331 (6015), 303-303 DOI: 10.1126/science.1197696

Friday, February 17, 2012

Avian influenza, ferrets, and bioterrorism: fear versus science

I learned about this last week, when Science published a short article on how the National Science Advisory Board for Biosecurity had recommended two research groups NOT to publish details on how avian influenza strains were modified in order to make them transmissible through aerosol in ferrets.

You can read that story here.

The first thing that struck me was: is this censorship? Because for as long as I've been a scientist I've known that the great bulk of scientific progress is made through the free exchange of ideas and results. The very core of scientific validation is in the reproducibility of an experiment, and you can't reproduce an experiment unless who conducted it shares the details.

Why then the recommendation?

The World Health Organization currently lists the case fatality of avian influenza (H5) somewhere between 50% and 80%. This is the percentage of all cases that report in a hospital and have been confirmed through labwork. Currently, it is transmissible through fluids by coming in contact with infected birds. The two studies under the radar here, by Ron Fouchier at Erasmus Medical Center in Rotterdam and by Yoshihiro Kawaoka at the University of Wisconsin, have been submitted but not yet published to Science and Nature respectively. Though different, they both prove that it takes a relatively small number of mutations for the virus to become transmissible through aerosol in ferrets.

Why the fear? With a fatality rate anywhere above 50%, if you can make the virus transmissible through aerosol, you've got a deadly weapon. But is it so obvious one can make it?

First, ferrets are not humans and currently we have no way to predict whether what has been observed in ferrets is likely to happen in humans. For example, there are many strains of avian influenza, and they all have been circulating in birds for many decades. However, of all these strains, only three (H1, H2, and H3) have been able to circulate in humans. There is a natural bottleneck in the way a virus is able to adapt from one organism to another.

One may object we don't know for sure, so, theoretically, it could be possible. But in that case, is censorship the answer? I honestly don't think so and I was quite happy to find a PNAS paper [1] in complete agreement with my thoughts:
"Why Is it Important to Have the Full Data Published? With respect to the specific papers by Fouchier and Kawaoka, it would be important for other scientists to replicate portions of these works to test new vaccines/therapeutic agents and for continued studies on the molecular aspects of influenza transmission, a topic that is extremely important yet relatively poorly understood."
And, most importantly:
"It would be very difficult for a bioterrorist to come up with a human virus strain that is transmissible and still highly virulent. Under natural conditions, however, there is virtually unlimited allowance for generation of capable viruses, the opportunities for infection of humans are plentiful, and the evolutionary pressures of selection are great. If anyone could do it, Nature could."
And that's exactly why we need to be prepared. And the way we are prepared is by sharing results and having multiple groups worldwide brainstorm and join forces to find a vaccine.

What do you guys think?

Palese, P., & Wang, T. (2012). H5N1 influenza viruses: Facts, not fear Proceedings of the National Academy of Sciences, 109 (7), 2211-2213 DOI: 10.1073/pnas.1121297109

This post was chosen as an Editor's Selection for

Thursday, February 16, 2012

Large intergenic noncoding RNAs affect gene expression

I learned this amazing fact from a talk I went to last week: currently, somewhere between 70% and 90% of DNA is estimated to be transcribed into RNA but not translated into proteins. So, the question is: if it's not making proteins, what's all this non-coding RNA doing?

In mammalians in particular, more than a thousand large (over 5 kb) intergenic noncoding RNAs (lincRNA) have been identified [1] and, by looking at expression patterns, researchers were able to see that they are involved in many different biological processes. They are evolutionary conserved across species, indicating that they are indeed functional, yet very little is known of their function. Two 2009 papers [1,2] investigated whether lincRNA are involved in the establishment of chromatin states by creating "genome-wide chromatin-state maps."

We need a refresher here. I discussed chromatin (the package of DNA and proteins inside the nucleus) in this post. The structure and topology of the chromatin changes from cell line to cell line and also during a cell's life, allowing for different genes to be activated or deactivated as needed (for example during cell differentiation). These modifications in the way the DNA is packaged are called chromatin states and are key to understand how and which genes are expressed inside the cell. In particular, there exists a whole family of proteins, called chromatin-modifying complexes, that modify the structure of chromatin to promote or inhibit access genes.

in [1], Guttman et al. looked at a particular genome domain called K4-K36 in genome-wide chromatin-state maps in 4 mouse cell lines. This chromatin signature marks actively transcribed genes, hence, they were able to find lincRNAs "by identifying K4-K36 structures that reside outside protein-coding gene loci."
"These lincRNAs show similar expression levels as protein-coding genes, but lack any protein-coding capacity. Importantly, lincRNAs show significant evolutionary conservation relative to neutral sequences, providing strong evidence that they have been functional in the mammalian lineage [2]."
In [2], Khalil et al. extended the results found in [1] by mapping the K4-K36 domain to 6 human cell types. They found 1,703 new human lincRNA genes and estimated the total number of human lincRNAs to be roughly 4,500. Of all newly discovered lincRNAs, a substantial fraction was found to be associated with PCR2, one of the chromatin-modifying complexes I mentioned above.
"Collectively, these results suggest that many lincRNAs collaborate with chromatin-modifying proteins to repress gene expression at specific loci. [...] Our results suggest an intriguing hypothesis that lincRNAs bind to chromatin-modifying complexes to guide them to specific locations in the genome. [...] Under our model, differentially expressed lincRNAs could bind to these complexes and help establish cell type specific epigenetic states."
The specific experiments conducted by Khalil et al. identified associations with chromatin-modifying complexes that have a repressive role, but the researchers suggest that, with different experiments, one could find additional lincRNA that instead are associated with activating chromatin-modifying complexes.

[1] Guttman, M., Amit, I., Garber, M., French, C., Lin, M., Feldser, D., Huarte, M., Zuk, O., Carey, B., Cassady, J., Cabili, M., Jaenisch, R., Mikkelsen, T., Jacks, T., Hacohen, N., Bernstein, B., Kellis, M., Regev, A., Rinn, J., & Lander, E. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals Nature, 458 (7235), 223-227 DOI: 10.1038/nature07672

[2] Khalil, A., Guttman, M., Huarte, M., Garber, M., Raj, A., Rivea Morales, D., Thomas, K., Presser, A., Bernstein, B., van Oudenaarden, A., Regev, A., Lander, E., & Rinn, J. (2009). Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression Proceedings of the National Academy of Sciences, 106 (28), 11667-11672 DOI: 10.1073/pnas.0904715106

Monday, February 13, 2012

The "not-so-universal" genetic code, its origin and its evolution

From [1]:
"Until relatively recently, the [genetic] code was thought to be invariable, frozen, in all organisms, because of the way in which any change would produce widespread alteration in the amino acid sequences of proteins. The universality of the genetic code was first challenged in 1979, when mammalian mitochondria were found to use a code that deviated somewhat from the universal."
A brief refresher: proteins are chains of amino acids. They are made from messenger RNA by assigning each triplet of RNA nucleotides (a codon) to one amino acid. For example, in the sequence AUGCCCAAGCUG each triplet codes an amino acid: AUG becomes M, CCC becomes P, AAG becomes K, and CUG becomes L. All together: AUG|CCC|AAG|CUG -> MPKL.

So, what does "universal" mean in the above quote? It means that the above sequence gets translated into the same amino acids in every organism, from bacteria to humans. Is this true? Not always.

Take a stop codon, for example. A stop codon is a triplet of RNA nucleotides that end the translation. Think of it as a flag that says, "The protein code ends here." If the genetic code were a universal one, a stop codon would always be a stop codon, in all organisms. The first exception to this was discovered in 1985, when the stop codon UGA was found to be actually coding an amino acid in the bacteria Mycoplasma capricolum. More exceptions to the "universal" conception (other triplets that coded different amino acids instead of always the same one) were later found in other organisms and in mitochondrial DNA as well. A more realistic theory is that, being DNA dynamical, when codons "disappear" the old codons can undergo reassignments and take on a new meaning.

The "universal" view has prevailed for many years on the basis that present time proteins are so evolved that changes would most likely be lethal. The first deviations from universality were found in the late 'seventies in mitochondrial DNA. It was argued that mtDNA is considerably smaller than nuclear DNA and hence it had a better tolerance to changes.

In [1], Ohama et al. list various code changes reported in the nuclear DNA in the past three decades, and then discuss the origin of the genetic code:
"The theories to explain the early evolution of the genetic code are numerous, all of which include speculations that the coding system arose with one or a limited number of amino acids, and that others were added until a total of 20 was reached. Most of these theories are aesthetically pleasing but cannot be verified."
They assume that the most ancient genetic code had to have a minimum number of codons made of all 20 amino acids and a minimum number of corresponding tRNAs -- transfer RNA molecules that act as mediators between the mRNA and the amino acids. This first genetic code had to have very little tolerance for change. However, with the time, the development of synonymous codons (different triplets code the same amino acid), allowed for flexibility and therefore resulted in an advantageous addition.

Finally, they conclude:
"It should be stressed however that there are no organisms which use the genetic code system for more than, or less than, 20 amino acids. What were frozen are 20 amino acids (magic 20!) and not the genetic code that assigns them. Thus the genetic code is still in the state of evolution."
I'm including below a second reference [2] that goes a bit more in depth on how these codon reassignments happen, for those of you who might be interested. In this case, the authors looked at the evolution of the genetic code in yeast.

[1] Ohama T, Inagaki Y, Bessho Y, & Osawa S (2008). Evolving genetic code. Proceedings of the Japan Academy. Series B, Physical and biological sciences, 84 (2), 58-74 PMID: 18941287

[2] Miranda, I., Silva, R., & Santos, M. (2006). Evolution of the genetic code in yeasts Yeast, 23 (3), 203-213 DOI: 10.1002/yea.1350

Sunday, February 12, 2012

Happy Darwin Day!

Happy Darwin Day everybody!

Charles Darwin was born on this day in 1809, and ... well, I don't need to tell you who Darwin was, right?

The International Darwin Day Foundation has a collection of links to Darwin lectures happening all over the world this month. If you live in a big city, chances are, there's something going on near by and you may want to check it out.

Whatever you do today, please keep one thing in mind: Charles Darwin was a great scientist and a great man and he left us an immense legacy. This is not a religious debate. And it's not political either. Anybody who mentions either religion or politics when discussing evolution is missing the point. As the Poet once said, "Non ragioniam di lor ma guarda e passa."

I leave you with Carl Sagan's hymn to evolution:

Friday, February 10, 2012

Migrating genes

I've been talking quite a lot about mitochondria lately. The fact that these organelles contain their own DNA (called mtDNA) and were the result of a horizontal gene transfer during evolution is simply fascinating. And I know many of you agree, as proved by the wonderful questions my last post on mitochondria sparked (thank you Hollis and Marleen!)

Plastids, plant organelles that are responsible for photosynthesis, also have a circular, double-stranded DNA molecule (called ptDNA). Like mitochondria, plastids originated through endosymbiosis and, in most plants, are inherited from one parent only. Now, here's another fascinating fact:
"During the early phase of organelle evolution, organelle-to-nucleus DNA transfer resulted in a massive relocation of functional genes to the nucleus: in yeast, as many as 75% of all nuclear genes could derive from protomitochondria, whereas ~4500 genes in the nucleus of Arabidopsis are of plastid descent. Cases of present-day organelle-to-nucleus DNA transfer, revealed by the presence of NUMTs and NUPTs [the fraction of nuclear DNA that derives from mitochondria and plastids respectively], are known in most species studied so far. [...] Mitochondrial chromosomes contain segments homologous to chloroplast sequences, as well as sequences of nuclear origin, providing indirect evidence for plastid-to-mitochondrion and nucleus-to-mitochondrion transfer of DNA [1]."
Throughout evolution transfers of genes between plastids and mitochondria have been documented, although in present organisms these transfers gave rise to non-functional sequences. In plants, the transfer of genes from organelles to nucleus seems to be still active, as documented by the RPS10 gene, which is present in the mitochondria of some angiosperms, and in the nucleus of other plants [Henze and Martin, 2001]. In fact, orgenelle-to-DNA gene transfers have been studied extensively in plants: their cells have both plastids and mitochondria and, as a consequence, they are in general more informative than animal eukaryotes.

In [1], Leister revises studies that show that mtDNA in Homo sapiens integrates continuously into the nuclear genome, as both de novo and pre-existing nuclear insertions of mtDNA have been documented. Recent acquisition of nuclear mtDNA have been documented by comparison with chimpanzee genomes.

So, how do these transfers happen? Though it was originally thought that genes would migrate as RNA transcripts, new studies have shown that it's the DNA itself that "escapes" the organelle:
"Escape of organelle DNA and its uptake into the nucleus has been experimentally demonstrated in yeast and tobacco."
Once these bits of DNA arrive to the nucleus, however, they are subject to a much lower mutation rate than they were in their original location. What this means is that mutations appear more rarely in the nucleus than they do in the organelles. As a consequence, they become "conserved," undergo very little changes, and, at all effects, become "molecular fossils," allowing researchers to retrace phylogenies between species.
"Moreover, nuclear organelle DNA insertion polymorphisms, as a subclass of insertion-deletion polymorphisms, are valuable markers for population and evolutionary studies."
Since the process of migration from organelle to nucleus is a constant one, studies have been directed at measuring the rate of continued colonization of organelle DNA into the germline. The rate in humans has been found to be of the order of 10e-05, and even though these insertions in the past had been thought to be essentially harmless, recent studies have confirmed associations with certain types of hereditary diseases. As I was discussing last week, more studies are in the way to investigate possible associations between nuclear mitochondrial polymorphisms and certain types of cancers.

[1] Leister, D. (2005). Origin, evolution and genetic effects of nuclear insertions of organelle DNA Trends in Genetics, 21 (12), 655-663 DOI: 10.1016/j.tig.2005.09.004

Wednesday, February 8, 2012

You're being watched. That's okay, though, we do it for your own benefit. Or so we'd like you to think...

Starting March 1st Google's much anticipated new privacy policy will take place. Of course, how much it will or will not affect your life depends upon your own personal choices. It strikes me, though, how much the Internet has become a place like those Italian marketplaces I used to love growing up: lots to see, stands full of goodies, lots of people, lots of entertaining distractions, yet if you don't keep a constant eye on your wallet next thing you know it'll be gone.

What can you lose on the Internet?

Well, privacy, of course. It's a subtle question. Google offers me a service, and in a way, they have a right to access certain information that, by accepting their services, I am voluntarily giving up. Where's the boundary, though? For one thing, I'm bugged by the fact that they present it as yet another service they are offering me: they gather information so they can make my searches easier and provide me with a better service, tailored to my needs.

Please. It's called marketing, and we all know it.

I am indeed grateful for all the services Google is offering me. I love the Blogger platform, and, as I have stated before, I am thrilled with G+ and the community there. I also understand that no service is ever free, rather it comes at a cost. Having said that, I think it's worth giving the whole thing some thought because, as a Google user, I feel I have to make a choice of how much of my information I want to share.

Check-out what Leonhardt and Magee had to say back in 1998 (Remember 1998? Gmail didn't even exist back then!):
"[...] location services will often become repositories of potentially sensitive personal and corporate information. Where you are and who you are with are closely correlated with what you are doing. To leave this information unprotected for everybody to see is clearly undesirable. People would feel uncomfortable if their every move could be watched anonymously [1]."
Do you like to be watched anonymously?
From Google's new privacy policy:
"Location data: Google offers location-enabled services, such as Google Maps and Latitude. If you use those services, Google may receive information about your actual location (such as GPS signals sent by a mobile device) or information that can be used to approximate a location (such as a cell ID)."
You may argue it's a machine, not a person watching you. You're still being watched, though, and the way it's done -- as I understand it -- is not that you choose what to make public and what not to. Email or GPS signals are not something people typically post publicly, yet those pieces of info are apparently up for grabs as well. And that, to me, doesn't sound right.

Leonhardt and Magee predicted the future when they wrote:
"We are especially concerned with the balance between security imposed by the system (mandatory security), and security specified by individuals (discretionary security). [...] We expect that [a global location] service would be provided by a network of loosely cooperating providers, very similar to today's mobile telephone system. Customers would subscribe to one or more service providers. The providers would have roaming agreements with each other. [...] Further, there is scope for third-party location-aware services. For example, such a service might be responsible to automatically inform emergency services when a distress signal from a subscriber is received. On the other hand, users will often have to trust the service providers to obey the security policy laid down in the service contract."
Another quote, from a 2006 paper this time (yes, I did a lot of research on this!):
"These technologies can be applied for private and public goals, and can be used in private and public situations. Although it is possible to make a distinction between private and public on an analytical level, in reality, it is difficult to draw a clear line between private and public situations, and between private and public goals [2]."
That's exactly the issue here. Where do we draw the line between public service, hence available, and private data, hence "hands-off"? What are the dangers of not being able to draw such line? In the above paper, titled "Privacy invasions," philosophy professor Karsten Weber explains:
"In principle, leaving a physical place means leaving it forever; by contrast, being in cyberspace means being there forever, because all of an individual's actions are stored immediately, and can be tracked and analyzed. [...] The technology could be used to track individuals and monitor related characteristics, such as whether the person gathers in groups or prefers solitude. Even if the reader cannot imagine a use for such information, rest assured that marketing experts would find it highly valuable."
Some people seem not to be bothered by any of this. And most likely tomorrow I'll wake up and I'll no longer be bothered by it either. Still. I find it paradoxical that I live in a country where once kids are in college parents no longer have access to their grades or where one can't access the health record of an elderly relative because of privacy issues. Maybe next time you need to access any of that data you should ask Google.

[1] Leonhardt, U., & Magee, J. (1998). Security Considerations for a Distributed Location Service Journal of Network and Systems Management, 6 (1), 51-70 DOI: 10.1023/A:1018777802208

[2] Weber, K. (2006). Privacy invasions: New technology that can identify anyone anywhere challenges how we balance individuals' privacy against public goals EMBO reports, 7 DOI: 10.1038/sj.embor.7400684

Monday, February 6, 2012

The first tree of life

I came to learn the meaning of the word phylogenetics in 2006, when I started working on HIV. With a highly variable virus like HIV, it is convenient to be able to reconstruct its molecular evolution through a graph called phylogenetic tree. It gives researchers a visual sense of the genetic diversity found in the sample of viral sequences and infer what the infecting strain (the "patriarch", so to speak) might have looked like.

These trees are not specific to virology. In fact, they are used in all fields of evolutionary biology to infer genealogical and evolutionary relationships. A recent paper in PNAS [1] discusses the "Scientific, historical, and conceptual significance of the first tree of life." From the abstract:
"In 1977, Carl Woese and George Fox published a brief paper in PNAS [2] that established, for the first time, that the overall phylogenetic structure of the living world is tripartite. We describe the way in which this monumental discovery was made, its context within the historical development of evolutionary thought, and how it has impacted our understanding of the emergence of life and the characterization of the evolutionary process in its most general form."
By comparing molecular sequences of different organisms, Woese and Fox constructed the very first tree of life and showed that all species are phylogenetically related. Using the tree, they divided all cellular life into three major groups: eukaryotes (organisms whose cells have a nucleus), eubacteria (non-nucleated cells, or prokaryotes), and archaebacteria (a kind of prokaryote that shares similarities with eukaryotes -- I know, it gets complicated!). Interestingly, the paper went almost unnoticed at first, and then, when it did get noticed, it was highly criticized, as often revolutionary thinking is:
"The manuscript received severe criticisms when it was submitted to PNAS in the summer of 1977. One reviewer recommended that it not be published on methodological grounds that their claim for a tripartite division of the microbial world was as unfounded as their claims in regard to symbiosis and the origin of eukaryotic organelles."
It should be said that comparing genetic sequences back then wasn't as straightforward as today (hence the skepticism), and that, though not systematically proven, the general belief prior to this paper had been that life could be divided in two, not three, major groups.

Woese realized very early that the only way to quantify evolutionary change was to study the conservation and variation of molecular sequences across different organisms. So, together with Fox, they looked at small subunit ribosomal RNA from different organisms. In all cells protein synthesis is carried out in the ribosomes, which create proteins reading the information from the messenger RNA (mRNA). Ribosomes have an RNA component and a protein component, and ribosomal RNA, or rRNA, as you may have already guessed by now, is the RNA component of the ribosome.

Woese and Fox set the foundations that, years later, led to the discovery that the root of the tree of life was to be found in the eubacterial line and settled the question of whether chloroplast and mitochondria originated from a symbiotic event. Pace et al. conclude in [2]:
"Modern versions of the techniques used by Woese and Fox are now routinely used to sample environments as varied as geothermal hot springs and gastrointestinal microbiomes, providing unprecedented insight into community structure and dynamics. The results challenged the foundations of classical evolutionary theory, requiring new modes of evolution to be considered, indicating the presence of an unexpectedly large microbial pangenome (field of genes‚ to use Woese's favorite phrase), and forcing us to reconsider basic concepts such as the nature of species. Perhaps no other paper in evolutionary biology has left a richer legacy of accomplishments and promise for the future."

[1] Pace, N., Sapp, J., & Goldenfeld, N. (2012). Classic Perspective: Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life Proceedings of the National Academy of Sciences, 109 (4), 1011-1018 DOI: 10.1073/pnas.1109716109

[2] Woese, C., & Fox, G. (1977). Phylogenetic structure of the prokaryotic domain: The primary kingdoms Proceedings of the National Academy of Sciences, 74 (11), 5088-5090 DOI: 10.1073/pnas.74.11.5088

Thursday, February 2, 2012

Missing heritability: the humble opinion of a mathematician

Tomorrow, February 3, is Eric Lander's birthday, the director of the Broad Institute (the well-known MIT/Harvard genomic research center), and the first author of the historic 2001 Nature paper that marked the completion of the Human Genome Project [1]. I heard him once speak at USC and without ever getting technical he managed to engage the whole audience and share his passion for genetics. As you know, I've been honoring famous geneticists by discussing one of their papers on their birthday and today I'm facing a conundrum. You see, the natural choice would be to pick the latest PNAS paper titled "The mystery of genetic heritability" [2]. I want to talk about this paper and at the same time I don't want to talk about this paper.

I'm not a geneticist. I'm a computational biologist, which means my background is mostly analytical, not biological. I used to work on SNP associations and cancer epidemiology and now I work on HIV. I am NOT one of the players in this game. Hence, what does my opinion count when it comes to a highly debated paper as this one?

The thing is, this paper resonates with me. It makes a great point about a mathematical model that's been "assumed" for years now in the world of genetics. Often people don't get mathematical models. They don't get that mathematical models are tools, not the truth. Hence when one says "I present this model," you get two possible reactions: those who have seen data concordant with your model will smile and happily welcome your model. Those who instead have seen the opposite will boo you and challenge you. Problem is, models are neither right or wrong. Models are tools. Do they help describe what we see? Fine, we keep the model. When they don't, we go back to the data and try to understand which of our assumptions failed. We use the model to discern the situations that meet the assumptions stated in the model from those that don't. Models help us shape our thinking, not the data! For example, evolution is a model, too. Go tell that to creationists and followers of intelligent design. They can challenge evolution as much as they want, but until they hand me a model that explains the genetic diversity we observe today better than evolution does, I will stick with evolution.

Back to the PNAS paper. It's a hot topic right now, and I'm kind of late discussing this particular paper in the blogosphere. Razib Khan discussed it here, Luke Jostins here and here, and I'm sure many others whom I don't know have talked about it too.

So, what is the missing heritability? Since I've already defined it in an earlier post of mine, for the time being, let me just quote Razib Khan:
"The issue is basically that there are traits where patterns of inheritance within the population strongly imply that most of the variation is due to genes, but attempts to ascertain which specific genetic variants are responsible for this variation have failed to yield much. For example, with height you have a trait which is ~80-90 percent heritable in Western populations, which means that the substantial majority of the population wide variation is attributable to genes. But geneticists feel very lucky if they detect a variant which can account for 1 percent of the variance."
The implications of this are clear: we want to find risk alleles to predict common diseases, but given the missing heritability, we can't predict common diseases.

Is this surprising?

Given the reactions I saw on the internet, apparently it is. People claim we still haven't found all variants and that's where the missing heritability's hiding. Maybe. However, after reading so much about epigenetics, RNA editing, and epistasis, allow me to be skeptical. Traits (proteins, diseases, etc.) are not genes. The path from genes to traits is long and convoluted.

So, what's Lander's point in this PNAS paper? Something I've also previously discussed: epistasis, or the way genes interact together. We're missing heritability because we think of risks as additive, but additivity doesn't count for interactions. If you take into account interactions between genes, the total heritability is much smaller than anticipated and hence the percentage of what the variants are explaining (all together) much larger.
"Quantitative geneticists have long known that genetic interactions can affect heritability calculations. However, human genetic studies of missing heritability have paid little attention to the potential impact of genetic interactions."
Now here's the beauty of this paper. They do not deny the additive risk model. They extend it:
"We thus introduce the limiting pathway (LP) model, in which a trait depends on the rate-limiting value of k inputs, each of which is a strictly additive trait that depends on a set of variants (that may be common or rare). When k = 1, the LP model is simply a standard additive trait. For k > 1, we show that LP(k) traits can have substantial phantom heritability."
Again, mathematician thinking here, but that's exactly what models are for: some traits may very well be additive. However, the model does not fit all the data we observe it. Hence we need a better model, one that encompasses the old one and at the same time goes beyond it. Gene-gene interactions need not explain all missing heritability. But since they've been observed, we need to account for them in those situations where they may be real.
"The potential magnitude of phantom heritability can be illustrated by considering Crohn's disease, for which GWAS have so far identified 71 risk associated loci (13). Under the usual assumption that the disease arises from a strictly additive genetic architecture, these loci explain only 21.5% of the estimated heritability. However, if Crohn's disease instead follows an LP(3) model, the phantom heritability is 62.8%, thus genetic interactions could account for 80% of the currently missing heritability."
"In short, genetic interactions may greatly inflate the apparent heritability without being readily detectable by standard methods. Thus, current estimates of missing heritability are not meaningful, because they ignore genetic interactions."
"The results show that mistakenly assuming that a trait is additive can seriously distort inferences about missing heritability. From a biological standpoint, there is no a priori reason to expect that traits should be additive. Biology is filled with nonlinearity: The saturation of enzymes with substrate concentration and receptors with ligand concentration yields sigmoid response curves; cooperative binding of proteins gives rise to sharp transitions; the outputs of pathways are constrained by rate-limiting inputs; and genetic networks exhibit bistable states."
Mother Nature did not create mathematics. We created mathematics to describe Mother Nature. We start with a simple model and build up on it. The data is always the reality check, we should never forget that.

[1] Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., et al. (2001). Initial sequencing and analysis of the human genome Nature, 409 (6822), 860-921 DOI: 10.1038/35057062

[2] Zuk, O., Hechter, E., Sunyaev, S., & Lander, E. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability Proceedings of the National Academy of Sciences, 109 (4), 1193-1198 DOI: 10.1073/pnas.1119675109