Saturday, November 5, 2011

The China Study II: How gender takes us to the elusive and deadly factor X

The graph below shows the mortality in the 35-69 and 70-79 age ranges for men and women for the China Study II dataset. I discussed other results in my two previous posts () (), all taking us to this post. The full data for the China Study II study is publicly available (). The mortality numbers are actually averages of male and female deaths by 1,000 people in each of several counties, in each of the two age ranges.


Men do tend to die earlier than women, but the difference above is too large.

Generally speaking, when you look at a set time period that is long enough for a good number of deaths (not to be confused with “a number of good deaths”) to be observed, you tend to see around 5-10 percent more deaths among men than among women. This is when other variables are controlled for, or when men and women do not adopt dramatically different diets and lifestyles. One of many examples is a study in Finland (); you have to go beyond the abstract on this one.

As you can see from the graph above, in the China Study II dataset this difference in deaths is around 50 percent!

This huge difference could be caused by there being significantly more men than women per county included the dataset. But if you take a careful look at the description of the data collection methods employed (), this does not seem to be the case. In fact, the methodology descriptions suggest that the researchers tried to have approximately the same number of women and men studied in each county. The numbers reported also support this assumption.

As I said before, this is a well executed research project, for which Dr. Campbell and his collaborators should be commended. I may not agree with all of their conclusions, but this does not detract even a bit from the quality of the data they have compiled and made available to us all.

So there must be another factor X causing this enormous difference in mortality (and thus longevity) among men and women in the China Study II dataset.

What could be this factor X?

This situation helps me illustrate a point that I have made here before, mostly in the comments under other posts. Sometimes a variable, and its effects on other variables, are mostly a reflection of another unmeasured variable. Gender is a variable that is often involved in this type of situation. Frequently men and women do things very differently in a given population due to cultural reasons (as opposed to biological reasons), and those things can have a major effect on their health.

So, the search for our factor X is essentially a search for a health-relevant variable that is reflected by gender but that is not strictly due to the biological aspects that make men and women different (these can explain only a 5-10 percent difference in mortality). That is, we are looking for a variable that shows a lot of variation between men and women, that is behavioral, and that has a clear impact on health. Moreover, as it should be clear from my last post, we are looking for a variable that is unrelated to wheat flour and animal protein consumption.

As it turns out, the best candidate for the factor X is smoking, particularly cigarette smoking.

The second best candidate for factor X is alcohol abuse. Alcohol abuse can be just as bad for one’s health as smoking is, if not worse, but it may not be as good a candidate for factor X because the difference in prevalence between men and women does not appear to be just as large in China (). But it is still large enough for us to consider it a close second as a candidate for factor X, or a component of a more complex factor X – a composite of smoking, alcohol abuse and a few other coexisting factors that may be reflected by gender.

I have had some discussions about this with a few colleagues and doctoral students who are Chinese (thanks William and Wei), and they mentioned stress to me, based on anecdotal evidence. Moreover, they pointed out that stressful lifestyles, smoking, and alcohol abuse tend to happen together - with a much higher prevalence among men than women.

What an anti-climax for this series of posts eh?

With all the talk on the Internetz about safe and unsafe starches, animal protein, wheat bellies, and whatnot! C’mon Ned, give me a break! What about insulin!? What about leucine deficiency … or iron overload!? What about choline!? What about something truly mysterious, related to an obscure or emerging biochemistry topic; a hormone du jour like leptin perhaps? Whatever, something cool!

Smoking and alcohol abuse!? These are way too obvious. This is NOT cool at all!

Well, reality is often less mysterious than we want to believe it is.

Let me focus on smoking from here on, since it is the top candidate for factor X, although much of the following applies to alcohol abuse and a combination of the two as well.

One gets different statistics on cigarette smoking in China depending on the time period studied, but one thing seems to be a common denominator in these statistics. Men tend to smoke in much, much higher numbers than women in China. And this is not a recent phenomenon.

For example, a study conducted in 1996 () states that “smoking continues to be prevalent among more men (63%) than women (3.8%)”, and notes that these results are very similar to those in 1984, around the time when the China Study II data was collected.

A 1995 study () reports similar percentages: “A total of 2279 males (67%) but only 72 females (2%) smoke”. Another study () notes that in 1976 “56% of the men and 12% of the women were ever-smokers”, which together with other results suggest that the gap increased significantly in the 1980s, with many more men than women smoking. And, most importantly, smoking industrial cigarettes.

So we are possibly talking about a gigantic difference here; the prevalence of industrial cigarette smoking among men may have been over 30 times the prevalence among women in the China Study II dataset.

Given the above, it is reasonable to conclude that the variable “SexM1F2” reflects very strongly the variable “Smoking”, related to industrial cigarette smoking, and in an inverse way. I did something that, grossly speaking, made the mysterious factor X explicit in the WarpPLS model discussed in my previous post. I replaced the variable “SexM1F2” in the model with the variable “Smoking” by using a reverse scale (i.e., 1 and 2, but reversing the codes used for “SexM1F2”). The results of the new WarpPLS analysis are shown on the graph below. This is of course far from ideal, but gives a better picture to readers of what is going on than sticking with the variable “SexM1F2”.


With this revised model, the associations of smoking with mortality in the 35-69 and 70-79 age ranges are a lot stronger than those of animal protein and wheat flour consumption. The R-squared coefficients for mortality in both ranges are higher than 20 percent, which is a sign that this model has decent explanatory power. Animal protein and wheat flour consumption are still significantly associated with mortality, even after we control for smoking; animal protein seems protective and wheat flour detrimental. And smoking’s association with the amount of animal protein and wheat flour consumed is practically zero.

Replacing “SexM1F2” with “Smoking” would be particularly far from ideal if we were analyzing this data at the individual level. It could lead to some outlier-induced errors; for example, due to the possible existence of a minority of female chain smokers. But this variable replacement is not as harmful when we look at county-level data, as we are doing here.

In fact, this is as good and parsimonious model of mortality based on the China Study II data as I’ve ever seen based on county level data.

Now, here is an interesting thing. Does the original China Study II analysis of univariate correlations show smoking as a major problem in terms of mortality? Not really.

The table below, from the China Study II report (), shows ALL of the statistically significant (P<0.05) univariate correlations with mortality in 70-79 age range. I highlighted the only measure that is directly related to smoking; that is “dSMOKAGEm”, listed as “questionnaire AGE MALE SMOKERS STARTED SMOKING (years)”.


The high positive correlation with “dSMOKAGEm” does not even make a lot of sense, as one would expect a negative correlation here – i.e., the earlier in life folks start smoking, the higher should be the mortality. But this reverse-signed correlation may be due to smokers who get an early start dying in disproportionally high numbers before they reach age 70, and thus being captured by another age range mortality variable. The fact that other smoking-related variables are not showing up on the table above is likely due to distortions caused by inter-correlations, as well as measurement problems like the one just mentioned.

As one looks at these univariate correlations, most of them make sense, although several can be and probably are distorted by correlations with other variables, even unmeasured variables. And some unmeasured variables may turn out to be critical. Remember what I said in my previous post – the variable “SexM1F2” was introduced by me; it was not in the original dataset. “Smoking” is this variable, but reversed, to account for the fact that men are heavy smokers and women are not.

Univariate correlations are calculated without adjustments or control. To correct this problem one can adjust a variable based on other variables; as in “adjusting for age”. This is not such a good technique, in my opinion; it tends to be time-consuming to implement, and prone to errors. One can alternatively control for the effects of other variables; a better technique, employed in multivariate statistical analyses. This latter technique is the one employed in WarpPLS analyses ().

Why don’t more smoking-related variables show up on the univariate correlations table above? The reason is that the table summarizes associations calculated based on data for both sexes. Since the women in the dataset smoked very little, including them in the analysis together with men lowers the strength of smoking-related associations, which would probably be much stronger if only men were included. It lowers the strength of the associations to the point that their P values become higher than 0.05, leading to their exclusion from tables like the one above. This is where the aggregation process that may lead to ecological fallacy shows its ugly head.

No one can blame Dr. Campbell for not issuing warnings about smoking, even as they came mixed with warnings about animal food consumption (). The former warnings, about smoking, make a lot of sense based on the results of the analyses in this and the last two posts.

The latter warnings, about animal food consumption, seem increasingly ill-advised. Animal food consumption may actually be protective in regards to the factor X, as it seems to be protective in terms of wheat flour consumption ().