Monday, May 20, 2013

Parsing the DNA Crazy Quilt

A measure of how little we know about the real-world workings of evolution is that science still can't explain why some organisms have huge imbalances in the chemical composition of their DNA. If you look at the genome of Clostridium botulinum (the botulism germ), 72% of the bases in its DNA are either 'A' or 'T': adenine or thymine. (The four possibilities are, of course, adenine, thymine, guanine, and cytosine.) Conversely, you can find many examples of organisms in which the DNA is mostly 'G' or 'C.' The question is why A, T, G, and C don't occur in roughly equal proportions (which is what you'd expect after millions of years of genetic averaging; you'd expect some sort of regression to the mean).

Just to give you an idea of what GC/AT imbalance really looks like, here's the gene for the enzyme adenine deaminase from Clostridium botulinum, with all the A and T values in red:

ATGTATAAAAATATACAAAGAGAAATCTATAAAAATACAAAAGGAGACGGGGATATGTTTAATAAATTTGATACAAAGCCTCTTTGGGAGGTAAGTAAA ACTTTATCAAGTGTAGCACAGGGGCTTGAACCGGCTGATATGGTTATTATAAATTCAAGGCTTATAAATGTCTGTACAAGAGAAGTCATAGAAAACACA GATGTAGCAATTAGCTGTGGAAGAATTGCTTTAGTAGGTGATGCAAAACATTGCATAGGGGAAAACACAGAGGTAATTGATGCAAAAGGACAATATATT GCACCAGGTTTTTTAGATGGTCATATTCATGTTGAATCATCAATGTTAAGTGTAAGCGAATATGCTCGTTCAGTAGTTCCACATGGTACTGTCGGAATA TATATGGATCCACATGAAATTTGTAATGTACTCGGATTAAATGGTGTACGTTATATGATTGAAGATGGCAAGGGTACTCCACTTAAAAATATGGTGACC ACACCATCCTGTGTACCAGCAGTTCCAGGTTTTGAAGATACAGGAGCGGCTGTAGGACCAGAAGATGTTAGAGAAACAATGAAGTGGGATGAAATAGTT GGATTAGGAGAAATGATGAACTTCCCAGGTATACTTTATTCTACAGATCATGCTCATGGAGTAGTAGGAGAAACTTTAAAAGCTAGTAAAACAGTAACA GGACATTATTCTTTACCTGAAACAGGAAAAGGATTAAATGGATATATTGCATCAGGTGTAAGATGTTGTCATGAATCCACAAGAGCGGAAGATGCTCTT GCTAAAATGCGCCTTGGAATGTATGCAATGTTTAGAGAAGGATCTGCATGGCATGACTTAAAGGAAGTAAGTAAAGCCATTACAGAAAATAAGGTAGAT AGTAGATTTGCTGTTTTAATATCTGATGATACTCACCCACACACATTGCTTAAGGATGGACATTTAGATCATATTATAAAACGTGCTATAGAAGAAGGG ATAGAGCCATTAACTGCAATTCAAATGGTAACAATAAATTGTGCACAATGTTTCCAAATGGATCATGAATTAGGTTCTATAACTCCAGGAAAATGTGCA GATATTGTATTTATAGAAGATTTAAAAGATGTAAAAATAACAAAGGTTATTATAGATGGAAATTTAGTTGCAAAGGGTGGACTATTAACTACTTCAATA GCTAAATATGATTATCCTGAAGATGCTATGAATTCAATGCATATTAAGAATAAAATAACACCAGATTCCTTTAATATTATGGCTCCTAATAAAGAAAAA ATAACTGCAAGGGTTATTGAAATTATACCTGAAAGAGTTGGTACATATGAGAGACATGTTGAACTTAATGTTAAAGATGATAAAGTTCAATGTGATCCA AGTAAAGATGTTTTAAAAGCAGTTGTATTTGAAAGACACCATGAAACAGGAACAGCAGGATATGGTTTTGTTAAAGGTTTTGGTATTAAGAGAGGAGCT ATGGCTGCAACAGTTGCCCATGATGCTCACAACTTATTAGTTATAGGAACAAATGATGAAGATATGGCATTAGCTGCTAATACATTAATAGAATGTGGT GGAGGAATGGTAGCCGTACAAGATGGTAAAGTATTAGGCTTAGTTCCATTACCAATAGCAGGACTTATGAGTAATAAGCCTTTAGAAGAAATGGCAGAA ATGGTAGAAAAACTAGATAGTGCATGGAAAGAAATAGGATGTGATATAGTTTCACCATTTATGACAATGGCACTTATTCCACTTGCCTGCCTACCAGAA TTAAGACTAACTAATAGAGGGTTAGTTGATTGTAATAAGTTTGAATTTGTATCATTATTTGTAGAAGAATAA

View gene at FastaView.


The organism Actinomyces oris (which occurs in the film that builds up on teeth) has an adenine deaminase gene that looks like this:

ATGGCCGATCAACCGTCCGCAGACCTGCTTATCAAGGACGCGCGCATCGTCCCTTTCCGGTCCCGTACCGAACTGGGTGCGCTGCGCCGAGGTGACCCT CACCCCGGCGCCTTGGCCGCGCCGCCGCCCCCGGGTGAGCCCGTGGATGTGCGTATCAAGGCGGGCCGGGTCGTCGAGGTGGGACAGGGGCTGAGTGCT CCCGGGACACGGGTCCTTGAGGCCGAGGGCTCCTTCCTCATTCCCGGCCTGTGGGACGCTCACGCCCACCTGGACATGGAGGCGGCGCGCTCGGCACGC ATCGACACGCTGGCCACCCGCAGCGCGGAGGAGGCCCTGGAGCTGGTGGCACGGGCGCTGCGGGATCATCCGGCCGGTTCGCCTCCGGCCACGATCCAG GGCTTCGGGCACCGCCTGTCCAACTGGCCCCGGGTGCCCACGGTGGCCGAGCTCGACGCCGTCACCGGGGAGGTTCCCACGCTGCTCATCTCCGGGGAC GTGCACTCCGGGTGGCTGAACTCGGCGGCGCTGCGTGTCTTCGGCCTGCCGGGGGCCAGCGCCCAGGACCCGGGAGCACCGATGAAGGAGGACCCGTGG TTCGCCCTACTCGACCGCCTCGATGAGGTCCCGGGGACACGCGAGCTGCGGGAGTCCGGCTACCGACAGGTCCTGGCCGACATGCTGTCCCGGGGCGTC ACCGGCGTGGTGGACATGAGCTGGTCGGAGGATCCCGATGACTGGCCGCGGCGCCTGCGGGCCATGGCGGACGAGGGCGTACTCCCCCAGGTGCTGCCC CGCATCCGCATCGGGGTCTACCGCGACAAGCTGGAACGGTGGATCGCCCGGGGCCTGCGCACCGGGACCGCGCTGGCAGGCTCACCCCGCCTGCCCGAC GGTTCCCCGGTGCTGGTGCAGGGGCCGCTCAAGGTGATCGCAGACGGCTCGATGGGCTCGGGCAGCGCACACATGTGCGAGCCCTATCCCGCCGAGCTG GGCCTGGAGCACGCCTGCGGCGTGGTCAACATCGACCGGGCCGAGCTCACCGACCTCATGGCCCACGCCTCCCGGCAGGGTTATGAGATGGCCATCCAC GCCATCGGGGACGCGGCGGTCGACGACGTCGCCGCGGCCTTCGCGCACTCGGGTGCCGCCGGGCG

For whatever reason (and that's the point: we have no idea why), Actinomyces has chosen an AT-poor dialect for its DNA, even though it has to make many of the same types of genes as Clostridium.

Some people don't see this as a major puzzle: One organism evolved its DNA to a super-AT-rich state, another one didn't. So what? It's all random drift.

I disagree. It's not drift. We know of two strong forces that should keep organisms like Actinomyces from developing high G+C content. First is "AT pressure." It's known that mutations naturally tend to go in the GC-->AT direction. (One study found that in Salmonella typhimurium, GC-->AT mutations outnumbered AT-->GC mutations 50 to 1.) In the absence of corrective measures, natural mutations would very quickly lead all organisms in the direction of DNA with a very low G+C content.

A second important force is that of lateral gene transfer, which we know is common in microorganisms; common enough, certainly, to "even out" GC/AT ratios over evolutionary timescales. Random uptake of foreign genes by cells should tend to make A, G, C, and T levels equal, over time. For organisms like Clostridium and Actinomyces (and many others), this clearly hasn't happened.

In an earlier post I mentioned one possible reason organisms drift away from the 50-50 GC/AT centerline. DNA replication is more efficient when the template is biased toward one extreme (GC) or the other (AT), assuming endogenous nucleotide levels can be regulated in a similarly biased fashion (which they presumably are, in these organisms).

One might speculate that GC/AT extremism also simplifies DNA maintenance and repair. Imagine that your DNA is 70% G+C. A super-simple DNA repair tactic for deaminated purines would be to just replace every defective purine with a guanine. Seven out of ten times, blind replacement of defective purines with guanine would be the correct repair, if you're Actionymyces. And one out of three times, mistakes wouldn't matter anyway, because high-GC codons tend to be fourfold degenerate. (In a fourfold degenerate codon, you can replace the third base with anything—A, G, C, or T—without changing the codon's meaning.) Blind guanine substitution would have a better than 80% success rate in a high-GC organism that needed to replace defective purines.

It turns out there are other reasons to live "away from centerline," if you're a bacterium. I'll talk about those in another post.

Saturday, May 18, 2013

Information Theory in Three Minutes

Claude Shannon, the father of information theory, used to play an interesting game at cocktail parties. He'd grab a book, open it to a random page, and cover up all but the first letter on the page, then ask someone to guess the next letter. If the person couldn't guess, he'd uncover the letter, then ask the person to guess the next letter. (Suppose the first two letters are 'th'. A reasonable guess for the next letter might be 'e'.) Shannon would continue in this manner, keeping score, until a good deal of text had been guessed. The further along one goes in this game, the easier it becomes (of course) to guess downstream letters, because the upstream letters provide valuable context.

What Shannon consistently found from experiments of this sort is that well over half of English letters are redundant, because they can be guessed in advance. In fact, Shannon found that when all forms of redundancy are taken into account, English is more than 75% redundant, with the average information content of a letter being approximately one bit per symbol. (Yes, one bit. See Shannon's "Prediction and Entropy of Printed English.")

Claude Shannon
Shannon became intrigued by questions involving the efficiency of information transfer. What is the nature of redundancy in an information stream? Are some encodings more redundant than others? How can you quantify the redundancy? Eventually, Shannon elaborated a mathematical theory around the encoding and decoding of messages. That theory has since become extremely important for understanding questions of encryption, compression, detection of faint signals in the presence of noise, recovery of damaged signals, and so on.

A central concept in Shannon's theory is that of entropy. "Shannon entropy" is very widely misunderstood and/or misinterpreted, so it's important to be clear on what it's not. It's not disorder: Entropy, in information theory, is not the same as entropy in thermodynamics, even though the mathematics are similar. Shannon liked to consider entropy a statistical parameter reflecting the amount of information (or resolved uncertainty) encoded, on average, by a symbol. We think of the English alphabet as having 26 symbols. Since 26 values can be encoded in log2(26) == 4.7 bits, we say that the channel bandwidth for 26-letter English is 4.7 bits per symbol, but this is not the entropy. Shannon found that the entropy (the actual bits used per symbol) was closer to 1.0 than to 4.7. How can this be? The answer has to do with the fact that some symbols are used far more often than others; and also (as noted), some symbols are redundant by virtue of context.

Entropy gets to the actual (rather than ideal) information content of a message by taking into account actual frequencies of usage of symbols. If English text used all letters of the alphabet equally (and unpredictably), then the entropy of text would be exactly 4.7 bits per symbol. Each symbol would contribute 1/26th of -log2(1/26) to the total. But because some letters are used more or less frequently than others, they contribute more or less than 1/26th of log2(1/26), and that total can add up to less than 4.7.

It's easy to visualize this with a simple example involving coin-tossing. Suppose, for sake of example, that a series of coin tosses comprises a message. As a medium of communication, the coin toss is capable of expressing only two states: heads, or tails. This could be represented in binary form as 1 and 0. If half of all tosses are heads and half are tails, then the total entropy in the message is 0.5 * log2(0.5) for heads plus 0.5 * log2(0.5) for tails, or one bit per symbol (Note: If you actually do the math you'll come up with a negative-1. Hence, in entropy calculations, the result is usually multiplied by -1 so it can be expressed as a positive number.)

Consider now the situation of a two-headed coin. In this case, there is no "tails" term and the heads term is 1.0 * log2(1.0), or zero. This means the tossing of a two-headed coin resolves no uncertainty and carries no information.

Continuing the example, consider the case of a weighted penny that falls heads-up two-thirds of the time. Intuitively, we know that this kind of coin toss can't possibly convey as much information as a "fair" coin toss. And indeed, if we calculate 2/3 * log2(2/3) for heads plus 1/3 * log2(1/3) for tails, we get an entropy value of 0.9183 bits per symbol, which means that each toss is (on average) 1.0 - 0.9183 == .0817 or 8.17% redundant. If one were to take a large number of coin tosses involving the weighted penny and convert those tosses into symbols ('h' for heads and 't' for tails, say), the resulting data stream would be compressible to 91.83% of its fully expanded size, and then it wouldn't compress any more beyond that, because that's the entropy limit.

Actually, that last statement needs to be qualified. We're assuming, throughout this example, that the result of any given coin toss does not depend on the outcome of the preceding toss. If that rule is violated, then the true entropy of the "message" could be much lower than 0.9183 bits per symbol. For example, suppose the result of 12 successive coin-tosses were: h-h-t-h-h-t-h-h-t-h-h-t. There's a recurring pattern, and the pattern makes the stream predictable. Predictability reduces entropy; remember Shannon's cocktail-party experiment. (You might ask yourself what a message with all possible redundancy removed would look like, and in what way or ways, if any, it would differ from apparent randomness.)

Technically speaking, when symbols represent independent choices (not depending on what came before), the entropy can be calculated as before, and it's called the order-zero entropy. But if any given symbol depends on the value of the immediately preceding symbol, we have to distinguish between order-zero and order-one entropy. There are also order-two, order-three, and higher-order entropies, representing contexts of contexts.

Suppose now I tell you that an organism's DNA can contain only two types of base-pairs: GC and AT. (You should be thinking "coin toss.") Suppose, further, I tell you that a particular organism's DNA is 70% GC. Disregarding higher-order entropy, does the DNA contain redundancy? If so, how much? Answer: 0.7 * log2(0.7) for GC plus 0.3 * log2(0.3) for AT equals 0.8813, meaning redundancy is about 12%. Could the actual redundancy be higher? Yes. It depends what kinds of recurring patterns exist in the actual sequence of A, G, C, and T values. There might be recurring motifs of many kinds. Each would send entropy lower.

Further Reading
Shannon's best-known paper, "A Mathematical Theory of Communication," Bell Systems Tech. Journal, October 1948
"A Symbolical Analysis of Relay and Switching Circuits," Shannon's unpublished master's thesis
Claude Shannon's contribution to computer chess
Shannon-Fano coding
Nyquist-Shannon Sampling Theorem





Wednesday, May 15, 2013

Back Pain: An Infectious Process?

I wrote a piece for Big Think the other day about the recent finding that many cases of back pain are septic in nature: the pain comes from propionic acid (and other acids) released by anaerobic bacteria that have found their way into spinal-disc tissues. The principal offender is something called Propionibacterium acnes, a common mouth and skin germ that also often can be found in lung tissue.

Propionibacterium acnes can take on
an intracellular lifestyle.
Propionibacterium is, for most of us, a harmless stowaway. It is characterized as a "low virulence" organism, meaning it doesn't aggressively pathologize the host by default, the way (for example) a tuberculosis bacterium does. But for certain individuals, under certain conditions, Propionibacterium can be a major hazard. In addition to causing acne (and in severe cases, an accompanying arthritis), P. acnes is also seen in post-operative infections, prosthesis failure, breast-implant infection, corneal infection, sarcoidosis, bacteremia, and inflammation of lumbar nerves. Its involvement in sarcoidosis is controversial. What seems to be happening is that a special protein (a "trigger factor"), secreted by P. acnes, stimulates a cellular immune response in sensitive individuals. The macrophages that arrive to attack P. acnes become overwhelmed by the bacteria as they go into intracellular-parasite mode. Granulomas then form as P. acnes takes up residence in the aggregated macrophages. (More about P. acnes's role in sarcoidosis can be found in the March 2013 paper by Eishi in Respiratory Investigation.)

The immunological response triggered by P. acnes can be far-reaching. In the 1980s, back when P. acnes was called Corynebacterium parvum, researchers found that killed suspensions of the bacteria, injected into mice, caused 80% to 100% suppression of tumor growth. The dead bacteria stimulated the murine immune system to the point where mice could fight off cancer. Why this technique has not been used for human cancer treatment, I don't know. (It might be because it's too cheap and too easy. What do you think?)

In recent years, researchers have been finding P. acnes in the lumbar discs of back patients, typically at the rate of 40% to 50%. (About half of patients don't have the bacterium.) Simply finding the bacterium in discs doesn't prove a causal role for P. acnes in back pain, of course, but in a double-blind randomized controlled trial involving back patients who got either placebo or amoxicillin for 100 days, the amoxicillin-treated patients did better (both over the 100 days and a year later), which tends to suggest that P. acnes might well be playing a causal role in back pain.

People hurt their backs (to a greater or lesser degree) all the time without experiencing huge pain or lasting damage, but in a certain proportion of cases, disc herniation leads to Type 1 Modic Change (so-called bone edema) in nearby vertebrae, and at that point you're almost guaranteed to be in excruciating pain. But antibiotics might obviate the need for surgery, in at least some cases.

The nuclear material of intervertebral discs is an ideal place for P. acnes (an anaerobe) to grow, because it's warm, nutrient-rich, and (with no vascular content) oxygen-depleted. The question of how P. acnes finds its way into a disc in the first place is an interesting one (which I discuss in my post at Big Think). The short answer is, there's a ton of P. acnes in your mouth, especially if you happen to be (how shall we say?) not very attentive to oral hygiene, and bacteria can enter the bloodstream directly via the gums when you brush your teeth or have them professionally cleaned (or when a dentist picks and pokes at your teeth with one of those sharp pointy thingies). Almost any dental event, even vigorous brushing, can lead to a transient bactermia. Your spleen and white blood cells will clear bacterial cells from your blood very quickly, of course, and there are factors in your blood that are chemotoxic to most bacteria, but if a few P. acnes cells happen to stay in your blood long enough to find an inflammation zone in your body (where they can take up residency), you could be in trouble. By "inflammation zone," I mean an inflamed joint, a catheter or shunt, an implant of any kind, or any irritated tissue, really. Did you recently hurt your back? That counts.

Because even tooth brushing poses a significant risk of bacteremia, you may want to consider investing in a stock of mouthwash and using it before every brushing, to cut the live-bacteria count down and thus reduce your risk of lumbar disc infection, endocarditis, sarcoidosis, acne, and other bacteremic sequelae involving P. acnes. If you think I'm being alarmist, fine; you're entitled to your opinion. For me, it's mouthwash five times a day.

Saturday, May 11, 2013

DNA G+C Content and Survival Value

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.
DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.
The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 4.25% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 4.25%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.


Monday, May 06, 2013

Hydrogen Peroxide Powers Evolution

I'm about to offer a conjecture that is a bit preposterous-sounding but could well hold true. I actually think it does.

I propose that evolution, at the level of bacteria (though probably not at higher levels), is driven by hydrogen peroxide.

This theory rests on three assumptions: One is that the creation of new bacterial species happens almost entirely via lateral gene transfer, not heritable point-mutations. Secondly, bacteria (marine and terrestrial) are regularly exposed to challenges by hydrogen peroxide in the environment. Thirdly, those challenges drive lateral gene transfer.

Evidence for the first assumption is embarrassingly abundant. If you're not up to speed on the subject, I suggest you read the excellent paper, "Lateral Gene Transfer," by Olga Zhaxybayeva and W. Ford Doolittle in Current Biology, April 2011, 21:7, pp. R242-246 (unlocked copy here). It's now common to find that any given bacterial species can trace a good percentage of its protein base to "ancestors" that are too far removed horizontally to be ancestors in the conventional sense.

Consider E. coli. There are hundreds of strains of E. coli, with genes ranging in number from 4,100 to about 5,300 per strain. The problem is, the various strains of E. coli have only about 900 genes in common (and that's far too few genes to render a fully functional E. coli). The E. coli pan-genome actually takes in more than 15,000 gene families, total. Certainly, you can draw a family tree of E. coli based on 16S ribosomal polymorphisms, but that doesn't explain where the 15,000 pan-genome genes came from. The "family tree" metaphor quickly breaks down if you start drawing trees based on proteins. You get many conflicting trees—all of them correct.

Trees like this are fiction where bacteria are concerned.
The tree of life is more like a net of life or web
of life than a directed acyclic graph.
Where are all of the genes coming from? Other species, of course. They arrive by way of mechanisms like transformation, transduction, and conjugation. all of which allow direct entry of foreign DNA into a bacterial cell. At one time it was thought that conjugation could only occur between bacteria of the same species, but it is now known that cross-species conjugation also occurs (as, for example, between E. coli and Streptomyces or Mycobacterium).

Transduction, which is where viruses package up an infected host's genes in virus capsules that are then taken up by another cell, occurs naturally in bacterial populations in response to environmental factors like ultraviolet light and hydrogen peroxide. Exposure of a virus-carrying (lysogenic) cell to UV light or peroxide can induce runaway production of virus, and in fact this mechanism is used by Streptococcus to kill competitive Staphylococcus cells, in a clever bit of chemical warfare. It's been known for years that hydrogen peroxide can cause many types of bacteria to shed DNA. Now we know why: Hydrogen peroxide is a signalling molecule. It signals (among other things) lysogenic bacteria to go into a lytic cycle. It also signals cells to mount what's known as the SOS response, which is a global response to oxidative challenge. Years ago, Bruce Ames and his colleagues showed that exposing Salmonella to very dilute (60 micromolar) hydrogen peroxide caused the cells to differentially express 30 "SOS" proteins, including heat-shock proteins and low-fidelity DNA-repair systems. We know that hydrogen peroxide as dilute as 0.1 micromolar can induce phage (virus) production in up to 11% of marine bacteria. This is significant, because rainwater contains hydrogen peroxide in concentrations of 2 to 40 micromolar, and ocean water has been known to reach millimolar levels of H2O2 after a rain storm.

If you're wondering why rain contains hydrogen peroxide, the peroxide gets there in two ways. One is UV-frequency photochemistry (where water is cleaved to H and OH, then reforms as H2 and H2O2); the other is via ionization reactions caused by lightning. (Lightning is energetic enough to bring airborne oxygen and water to a plasma state. The resulting ionization and rearrangement of free atoms yields a certain amount of hydrogen peroxide.) The presence of H2O2 in rainwater has been confirmed many times, and in fact there's a well-preserved "fossil record" of it in polar icepacks, going back centuries. (Polar snowpacks contain from 10 to 900 ppb of H2O2; it varies seasonally, the max coming in summer.)

Bottom line, every rain event (over land, over sea) constitutes a hydrogen peroxide challenge for microbes. Which induces viral transduction (and a release of whole-cell DNA through lysis, some of which will be inevitably be used in transformation). It also induces low-fidelity DNA repair (which is guaranteed to help evolution along). Every rain event, in other words, is a chance for evolution to do its thing. For bacteria, that means gene-sharing within and across species lines.
Darwin's theory of a tree-like ancestor basis
for all living things is dead wrong, at
least for bacteria.
W. Ford Doolittle (who wrote a classic book chapter about lateral gene transfer called "If the Tree of Life Fell, Would We Recognize the Sound?") estimates that if a horizontal gene transfer occurs once every ten billion vertical replications, "it would be enough to ensure that no gene in any modern genome has an unbroken history of vertical descent back to some hypothetical last universal common ancestor." (See this article.)

It's obvious (to me, at least) that every rain event carries with it the potential to cause far more gene transfers than are necessary (according to Doolittle) to make vertical inheritance fade into insignificance as an evolutionary bringer of change. The hydrogen peroxide in rain has been driving lateral gene transfer in bacteria for eons. In fact, it is arguably the dominant driver of evolution in bacteria.

Sorry, Mr. Darwin. Point mutations handed down to sons and daughters just isn't cutting it.

Sunday, May 05, 2013

More Science on the Desktop

Not to keep harping on the amazing power of desktop omics tools, but I thought I'd share a tip for those of you into genome-mining. The tip in a nutshell is that if you gang-load a bunch of FASTA sequences (DNA sequence data) into the FeatView form at http://genomevolution.org, then click the rather inconspicuous button labeled "Phylogeny.fr" at the bottom left of the FeatView page, you'll be taken automatically to http://www.phylogeny.fr, where you'll get a realtime-generated phylogenetic tree based on the sequence data you provided in FeatView, with no effort on your part (it's truly a one-click operation). Copy and paste DNA sequences into FeatView, click one button, and 30 seconds later a tree shows up on your screen, looking (perhaps) something like this:


The reason I made this tree is that I wasn't satisfied with my knowledge of the relatedness of certain weird microorganisms I've recently run into. Namely:
  • Ralstonia (which I mentioned yesterday), WEIRD BECAUSE: It turns hydrogen gas and CO2 into plastic.
  • Bordetella, a bronchial infection agent; WEIRD BECAUSE: It turns out to be very similar, genetically, to Ralstonia
  • Burkholderia, a soil organism (and human and animal pathogen), WEIRD BECAUSE: It has an unexpectedly large amount of genetic similarity to Ralstonia and Polynucleobacter
  • Polynucleobacter, a ditch-water bacterium, WEIRD BECAUSE: It can live as an intracellular parasite of freshwater ciliates or it can live independently in soil (making it potentially a great study organism for determining the genetic bases of intracellular symbiosis)
  • Thiomicrospira, a very tiny CO2- and sulfur-loving organism, WEIRD BECAUSE: It can only be found near deep-sea thermal vents (see my previous writeup)
  • Polaromonas, a relatively newly discovered and still poorly understood bacterium, WEIRD BECAUSE: It is abundant in glacier ice on multiple continents. Plus it has an amazing (and totally unexpected) amount of genetic overlap with our good friend Bordetella, the whooping-cough bug.
If you're not familiar with how bacterial classification works, let's just say it's a mess. There's a long historical tradition of classifying microorganisms based on a hodgepodge of ad hoc methods involving everything from physical appearance under the microscope (especially after staining with crystal violet), to the habitat of the organism, to its ability to metabolize various substances, its ability to make spores, adaptation to oxygen or lack of oxygen, serological characteristics, etc. It's always been an error-prone system, resulting in many misclassifications and later corrections, owing to its inconsistency and basic irrationality, to put it bluntly. With the advent of molecular genetic techniques, it's now possible to create accurate phylogenies based on little more than DNA sequence differences, usually involving the 16S ribosomal RNA (more here).

Freshwater ciliates (like this Euplotes) are
home for Polynucleobacter endosymbionts.
As big an advance as ribosome-based phylogeny is, it's pretty far from ideal (IMHO), mainly because it ignores phenotypes. In fact it's pretty far removed from anything at all having to do with an organism's ecology, metabolism, mode of living, etc. What are we really measuring when we measure relatedness according to a 16S ribosomal yardstick? Just the rate of random mutation accumulation in a pretty uninteresting cell artifact. I'd rather have a yardstick that's tied to phenotypic reality than to a slow-to-change, "highly conserved" piece of cold dead scaffolding.

So to create my own "family tree" of two dozen or so microbes, I said to hell with 16S ribosomes and decided to use, as my yardstick, genetic variation in the
GroEL gene, which codes for the 60-kiloDalton heat-shock protein. I chose this protein (or rather, the gene for it) as my phylo-yardstick for a number of reasons. First, the DNA sequence is sizable, at about 1643 nucleotides (making it somewhat bigger than the 16S rDNA). It's important to have a large yardstick gene when looking for faint genetic signals. Secondly, this protein is essentially universal in prokaryotes. It's ubiquitous but not necessarily highly conserved, in the same sense that rRNA is highly conserved. ("Highly conserved" is not what you want. Think about it. Taken to the extreme, a "highly conserved" sequence is invariant. It never changes. And is therefore useless for phylogenetics.) Thirdly, the GroEL heat-shock protein has multiple intracellular touchpoints: It's known to interact with GroES, ALDH2, and dihydrofolate reductase, and it's involved in signal tranduction (it's induced not just by heat but by hydrogen peroxide). Not to overlook the obvious, but it is also a touchpoint protein for any enzyme that can be repaired by the 60kDa heat shock protein. That's probably dozens if not hundreds of enzymes. Why is that important? Think about it: A protein that is sensitive to the 3D conformational requirements of other proteins has to evolve in response to the needs of all the proteins it services. A thermophile (Thermomicrospira)  is going to need a different heat-shock repair system than a psychrophile (Polaromonas). A salt-lover needs a different one than a freshwater-lover. GroEL has to reflect, in its own structure, the many shifting requirements of the host proteome. These considerations make GroEL a highly appropriate basis gene for phylogenetic analysis.

And frankly, I think the GroEL-based phylo-tree phylogeny.fr spit out for me (see illustration further above) speaks for itself. It's a remarkably informative (and accurate) tree. GroEL evolutionary differences not only accurately grouped endosymbionts together, soil organisms together, aquatic organisms, etc., it also correctly grouped the "enteric-alike" Erwinia with E. coli and Shigella, and it cannily put Polaromonas with soil organisms (rather than aquatics), which I think is correct, based on recent Polaromonas isolates being found in soil rather than snow. Likewise, it's good to see Bdellovibrio (a freshwater bug) clustered with Polynucleobacter (which is symbiotic with a ciliate protozoan), with Thiomicrospira (the saltwater hydro-vent organism) a very nearby out-node.

If you get an infection while in a hospital, pray
it's not Clostridium difficile, which is often deadly.
A harder call to make is Clostridium difficile, which is present in 1% to 5% of non-ill people's intestines. Is it an enteric (a la E. coli)? Definitely not. The Clostridia (botulism, tetanus, etc.) are spore-forming soil bacteria. Their placement in the tree not far from the soil-dwelling spore-former, Bacillus thuringensis, is thus eminently correct. Bacillus is a proximal out-node relative to Clostridium, which is understandable in that Bacillus is aerobic whereas Clostridia are strict anaerobes.

Buchnera
(an aphid symbiont) comes at an odd location, much further away from the insect-dwelling Wolbachia than I would have predicted, but then again Buchnera's host feeds on cold sap where Wolbachia's hosts typically feed on warm blood. All the organisms around Wolbachia in the tree are hemophiles.

Our good friend
Bordetella (of pertussis fame) is placed firmly in the soil group. I think that's real and significant. When you start to look at Bordetella's high DNA sequence similarity with Ralstonia and Burkholderia, it would be surprising, actually, if it fell anywhere else in the tree.

Honestly, when I took Bacterial Ecology 201 in college, many years ago, it was under duress and I hated the experience. But now, decades later, I'm starting to like it. With tools like those available for free at
http://genomevolution.org and http://www.phylogeny.fr, what's not to like?

Saturday, May 04, 2013

A Tale of Two Microbes

One area where Big Data has started to pay big dividends is in genome research, and you can begin to taste the payoff yourself, right now, if you want to come along as I show you how to mine genetic data from public databases in the service of a little desktop microbial genetics. You'll be amazed at what you can do.
No one knows why, but when Ralstonia eutropha
eats too much, it produces plastic granules
instead of, say, starch or fat. Go figure.

For today's experiment, we're going to compare the genomes of two bacteria, one of which you know very well, the other of which you don't, unless you've got way too much time on your hands. The germ you already know is Bordetella, the whooping cough bug. The bug you haven't heard of is Ralstonia eutropha, a soil organism that has the amazing ability to subsist only on hydrogen gas, nitrate, and carbon dioxide. In return, it produces wicked-crazy quantities of plastic (yes, plastic—it stores carbon as polyhydroxybutyrate), and because it's potentially useful to industry, Ralstonia's DNA, like Bordetella's, has been fully sequenced.

If you go right now to http://genomevolution.org/r/8o1x, you'll see that I've set up a little experiment for you. You shouldn't have to press the pink "Generate SynMap" button on that page. It should run automatically (but if you don't see an image like the one below, hit the button).

Every dot in this dot-plot represents a match between
a gene in Bordetella bronchiseptica and a gene in
Ralstonia eutropha. See text for discussion.
What has happened is that the SynMap server has been instructed to go find the complete DNA sequence of Ralstonia eutropha Strain H16 as well as the complete DNA sequence for Bordetella bronchiseptica Strain RB50, and run a comparison of one against the other. It so happens Bordetella has a single chromosome with 5,339,179 base pairs, whereas our hydrogen-loving, plastic-storing friend Ralstonia has 3 chromosomes totalling 7,416,678 base pairs. (It has one main chromosome, and two small auxiliary chromosomes called plasmids.)

Every point on the above graph represents a match between a gene in Bordetella and a gene in Ralstonia. The X-axis represents locations on the Bordetella genome (starting from one end and going to the other). The Y-axis plots locations on the Ralstonia genome. All we're doing is mapping one genome to another and tallying the significant matches.

This is a massive number of matches (well over 10,000), just to let you know. Usually, when you compare organisms, you don't see this many dots. I chose Bordetella and Ralstonia because I knew there'd be a lot of hits, based on my own prior experiments. And by the way, I don't think most microbiologists are aware (yet) that Bordetella and Ralstonia are extremely closely related. This is new information I'm sharing with you.

It's one thing to get a bunch of points on a dot-plot, but how do we really know these two organisms are related? This is where synteny comes in. Synteny is the degree to which two chromosomes share blocks of order. The key intuition is that merely sharing genes isn't enough; what counts is whether matching genes are in the same arrangements. If genome A has genes X, Y, and Z, in that order, and genome B also has genes X, Y, and Z (in the same order), we say that A and B share a syntenous triplet. The genomes have a degree of synteny.

The SynMap tool is very powerful because it lets you find syntenous regions in DNA, and it's tunable. If you go to the Analysis Options tab on the SynMap page, you'll see that you can set two parameters called Maximum Distance Between Two Matches, and Minimum Number of Aligned Pairs. The URL that I sent you to (for our experiment) has values of 50 and 2, respectively, already dialed in. That means the graph is plotting every occurrence of 2 gene-pair matches that occurred between genes no more than 50 genes apart. That's a pretty liberal setting. If two organisms are related, you can expect to see a lot of matches.

But what I propose you try (if you want) is setting "Maximum Distance Between Two Matches" to 500 and "Minimum Number of Aligned Pairs" to 250. (Then click the Generate SynMap button to refresh the graph.) This is a much more stringent requirement: It tells SynMap to try to find 250 matched genes within any given 500-gene region, do it for all regions of both genomes, and plot the results, if any. A 250-gene chunk is a pretty large syntenous region for a creature that has only 10,000-or-so genes to begin with.

The result of our hunt for super-large 250-gene syntenous regions is shown in the first graph below. The red dots represent the regions. They run from the top of the Y-axis to the lower right corner. Remember that the axes map directly to positions on the genome. What the diagonal line says is that there's a near-linear mapping of syntenous regions from one genome to the other.

The second graph below shows what happens when we re-tune our DNA-matching parameters to find blocks of 200 ordered genes within each 500-gene domain. We're looking for shorter runs of genes (200 instead of 250), which should be more plentiful. And they are. This time our graph looks like an 'X'. Why? Bacterial chromosomes do a lot of rearranging, and one of the most common events is a symmetric inversion around the origin of replication (and/or the terminus of replication). If you get enough of these inversions of various sizes, you end up with pieces of DNA that used to be near the start of the chromosome ending up near the end, and vice versa. (Repeat for all intermediate locations as well.) If you want to know more about how and why this ends up making an X-pattern on a dot-plot, be sure and read the classic paper by Eisen et al. called "Evidence for symmetric chromosomal inversions around the replication origin in bacteria," Genome Biology 2000, 1(6):research0011.1–0011.9 (unlocked PDF here).

Genomes compared with synteny-block size 250.
Synteny block size 200.
Block size 175.
Block size 120, max domain size 180 genes.
Block size 90, max domain 130.
Block size 2, max domain size 50.
 
The third and fourth graphs in this series show what happens when we tune our match for smaller block sizes. In the third graph, we've set "Maximum Distance Between Two Matches" to 500 and "Minimum Number of Aligned Pairs" to 175, which produces what looks like two really poorly drawn X's superimposed on each other. As we get more permissive with our synteny matches, we start to see the results of more inversion events. It makes sense that shorter synteny blocks will be swept up in more successful inversions, because an inversion that cuts across a large synteny block is probably fatal in many cases. (Some large groups of genes need to be kept together, for proper gene regulation. If an inversion event cuts through a critical regulon at the wrong spot, the cell might not go on to reproduce.)

As we keep tuning the "Minimum Number of Aligned Pairs" downward, the graphs become more cluttered as we see the results of many thousands of inversion events in the history of the chromosomes.

The fourth graph uses values of 180 and 120 for Max Distance and Minimum Number of Aligned Pairs, then in graph five we have values of 130 and 90. And finally, in the last graph, we have 50 and 2. The final graph is mostly noise. But buried in the noise are many faint signals that can be seen by twiddling the knobs on the synteny settings.

I hope this bit of desktop genomics has convinced you that desktop genomics has reached an exciting stage indeed. (I've only scratched the surface, here, of what the tools at http://genomevolution.org can do.) I also hope I've convinced any microbial geneticists who might be reading this that Bordetella and Ralstonia are very closely related indeed. (Which should come as news. I don't think it's been reported.) You wouldn't think a hydrogen-loving soil organism would have much in common with a throat-dwelling pathogen, but as I like to say: DNA doesn't lie!