God Plays Dice: statistics

Showing posts with label statistics. Show all posts

26 May 2011

The most well-read cities in the United States

From Berkeleyside: Berkeley is the third most well-read city in the US, according to amazon.com data.

This is among cities with population 100,000 or greater. Number 1 is Cambridge, Massachusetts (105K people); number 2 is Alexandria, Virginia (140K); number 3 is Berkeley, California (112K); number 4 is Ann Arbor, Michigan (114K); number 5 is Boulder, Colorado (100K). There are 275 cities of population greater than 100,000 in the US; Alexandria, the most populous of these five, is ranked 177.

My first thought upon seeing this is that these are all small cities, and of course you expect to see more extreme results in small cities than in large cities. Small cities are perhaps more likely to be homogenous. (This seems especially likely to be true for small cities that are part of larger metropolitan areas.) Actually, my quick analysis of the top five doesn't hold up for the top twenty; the average rank of the top twenty cities listed at amazon is 127.1, which is LOWER (although not significantly different) from the 138.5 you'd expect if being on this top-twenty list was independent of size. But it's certainly possible that, say, some 100,000-person section of the city of San Francisco actually has higher amazon.com sales than Berkeley. (There are a surprisingly large number of bookstores in the Mission.)

Also, people in college towns tend to read a lot -- that's no surprise (although one does hear that students don't read any more a lot these days). Four of the top five (all but Alexandria) are college towns; also in the top 20 are Gainesville (Florida), Knoxville (Tennessee), and Columbia (South Carolina). And in case you're wondering, Alexandria is not named after the city in Egypt with the Great Library.

02 April 2011

A street-fighting approach to the variance of a hypergeometric random variable

So you all¹ know that if I have a biased coin with probability p of coming up heads, and I flip it n times, then the expected number of heads is np and the variance is npq. That's the binomial distribution. Alternatively, if I have an urn containing pN white balls and qN black balls, with p + q = 1, and I draw n balls with replacement then the distribution of the number of white balls has that mean and variance.

Some of you know that if I sample without replacement from that same urn -- that is, if I take balls out and don't put them back -- then the expected number of white balls is np and the variance is npq(N-n)/(N-1). The distribution of the number of white balls is the hypergeometric distribution.

So it makes sense, I think, to think of (N-n)/(N-1) as a "correction factor" for going from sampling with replacement to sampling without replacement. This is the approach taken in Freedman, Pisani, and Purves, for example, which is the book I'm teaching intro stats from this semester.

How do you prove this? On this, FPP are silent. The proof I know -- see, for example, Pitman -- is as follows. Write the number of white balls, when sampling without replacement, as

S_n = I₁ + ... + I_n

where IS_k is 1 if the kth draw gives a white ball and 0 otherwise. Then E(I_k) is just the probability of getting a white ball on the kth draw, and so it's equal to p by symmetry. By linearity of expectation E(S_n) = np. To get the variance, it's enough to get E(S_n²). And by expanding out that sum of indicators there, you get

S_n² = (I₁² + ... + I_n²) + (I₁ I₂ + I₁ I₃ + ... + I_n-1 I_n).

There are n terms inside the first set of parentheses, and n(n-1) inside the second set, which includes every pair I_j I_k where j and k aren't equal. By linearity of expectation and symmetry,

E(S_n²) = nE(I₁) + n(n-1)E(I₁ I₂).

The first term, we already know, is np. The second term is n(n-1) times the probability that both the first and second draws yield white balls. The first draw yields a white ball with probability p. For the second draw there are N-1 balls left, of which pN-1 are white, so that draw yields a white ball with probability (pN-1)/(N-1). The probability is the product of these. Do the algebra, let the dust settle, and you get the formula I claimed.

But this doesn't explain things in terms of the correction factor. It doesn't refer back to the binomial distribution at all! But in the limit where your sample is small compared to your population, sampling without replacement and smapling with replacement are the same! So can we use this somehow? Let's try to guess the correction factor without writing down any random variables. We'll write

Variance without replacement = f(N,n) npq

where n is the sample size and N is the population size, and think about what we know about f(N,n)

First, f(N,1) = 1. If you have a sample of size 1, sampling with and without replacement are actually the same thing.

Second, f(N,N) = 0. If your sample is the entire population, you always get the same result.

But most important is that if we sample without replacement, and take samples of size n or of size N-n, we should get the same variance! Taking a sample of size N-n is the same as taking a sample of size n and deciding to take all the other balls instead. So for each sample of size n with w white balls, there's a corresponding sample of size N-n with pN-w white balls. The distributions of numbers of white balls are mirror images of each other, so they have the same variance. So you get

nf(N,n)pq = (N-n)f(N, N-n)pq.

Of course the pq factors cancel. For ease of notation, let g(x) = f(N,x). Then we need to find some function g such that g(1) = 1, g(N)=0, and ng(n) = (N-n)g(N-n). Letting n = 1 you get g(1) = (N-1)g(N-1), so g(N-1) = 1/(N-1). The three values of g that we have so far are consistent with the guess that g is linear. So let's assume it is -- why should it be anything more complicated? And that gives you the formula. This strikes me as the Street-Fighting Mathematics approach to this problem.

Question: Is there a way to rigorize this "guess" -- some functional equation I'm not seeing, for example?

1. I use "all" in the mathematician's sense. This means I wish you knew this, or I think you should know it. Some of you probably don't. That's okay.

06 February 2011

Correlation in betting on the NFL.

Nate Silver points out that just because the spread in today's Super Bowl is small (the Packers are something like a three-point favorite) doesn't mean that the game will necessarily be close. It just means that it's almost equally likely to be a blowout in one team's favor as in the other's.

Not surprisingly, though, the regression line for margin of victory, as predicted from point spread, is very close to having slope 1 and passing through the origin. As it should, because otherwise bettors would be able to take advantage of it! Say that 7-point favorites won, on average, by 9 points. Assume that the distribution of actual margin of victory, conditioned on point spread, is symmetrical; then half of 7-point favorites would win by 9 points or more, so more than half would win by 7 points or more, and one could make money by betting on them. On the other hand, say that 7-point favorites won, on average, by 5 points; then you could make money by betting agsinst them.

(For what it's worth, I don't have a particular interest in this game. In fact I probably won't even watch it. I have no connection to either Pittsburgh or Green Bay, and as longtime readers know, I'm a baseball fan.)

09 March 2010

People round their incomes to the nearest $5,000?

Here's something interesting: lots of people, when asked by the US Census Bureau "how much money do you make?", round to the nearest five thousand dollars.

See the data tables from the 2006 census. These give the number of people whose personal income is in each interval of the form [2500N, 2500N+2499], for integer N.

One sees, for instance, that the number of people making between $27,500 and $29,999 (which is near the mode of the distribution) is less than both those making $25,000 to $27,499 and those making $30,000 to $32,499. Something similar occurs at all income levels -- the number of people making between 2500N and 2500(N+1)-1 dollars is smaller if N is odd (and thus this interval doesn't contain a multiple of 5000) than if N is even (and so it does).

Surprisingly, the effect occurs even at very low levels of earnings. If you make $87,714 in a year I can see rounding to $90,000 -- but is the person who makes $7,714 in a year really rounding to $10,000?

(I found this while trying to answer a question at Metafilter: How many people in the United States make more than $10,000,000 per year?. I seem to recall reading somewhere that personal income roughly follows a power law in the tails, but can't actually find a reference for this.)

There also seems to be a preference for multiples of $10,000 over multiples of $5,000 that are not multiples of $10,000. But I have work to do, so I'm not going to do the statistics.

08 December 2009

Distribution of Putnam scores

The distributions of Putnam exam scores are interesting. See, for example, the 2001 distribution. It takes a bit of number-crunching to get an actual distribution of scores from the data; they report the "rank" of the people getting each score. The rank corresponding to a given score is, I assume, A+(B+1)/2 where A is the number of people scoring higher than that score and B is the number of people scoring that particular score. For example, in 2001 -- which happens to be one of the years in which I took the Putnam -- the table begins

Score	101	100	86	80	79	77	73	72	71	70	69	68
Rank	1	2	3	4.5	6	7.5	9	11	14	16.5	19	23.5
Number	1	1	1	2	1	2	1	3	3	2	3	6

where the first two rows are provided by the organizers, and the third row can be worked out by working left to right. For example, once we know 17 people got 70 or better, the fact that the score 69 corresponds to rank 19 means that the people scoring 69 must have been the 18th, 19th, and 20th-best; so there were three of them. (Incidentally, most increasing sequences of half-integers, when interpreted as sequences of ranks, don't appear to correspond to legitimate score distributions; the number of people getting certain scores ends up negative if you're note careful.)

Anyway, if you crunch the numbers on a typical Putnam score distribution you observe two things:
- the scores follow, roughly, a power law; the number of people scoring 10n decays like some power of n, for integer n.
- once you remove this decay (which I haven't actually done; I've just eyeballed it), there are "spikes" at multiples of 10. For example, the number of people scoring 18, 19, 20, 21, 22, 23 in 2001 were 8, 23, 99, 60, 39, 11. Twenty-four people scored 50; seven scored each of 49 and 51.

I can't explain the first one (and it may just be an artifact of the way I'm doing the plotting; lots of things look close to linear when plotted on a logarithmic scale). But the second one is actually easy to explain; Putnam problems are worth ten points each, and most scores are 0 or 10 with a smattering of 1, 2, 8, or 9. Scores between 3 and 7 on a problem are exceedingly rare. So to get a score of, say, 55, one has to get five problems right and have made a bit of progress on three to five more, which is less likely than straight-out solving five or six problems (for 50 or 60, respectively).

Incidentally, I haven't looked at the problems from the 2009 Putnam, because I have work to do.

15 July 2009

Batting under .200

Stat of the day (from baseball-reference.com) has a list of players who went an entire season, had enough at bats to qualify for the batting title (I forget the statistics for this, but this basically means they have to play regularly), and are batting under .200.

Most of them are from a long time ago. Why? Because .200 is well below average and always has been (which is why the list was worth compiling) and the variance in batting averages has gone down as the standard of play has improved. Stephen Jay Gould wrote about this in Full House: The Spread of Excellence from Plato to Darwin; the argument is roughly that as baseball scouting and training has gotten better, there are not as many bad pitchers in the major leagues as there were in the past, so players can't inflate their batting average that way. (I'm in Ithaca and my copy of the book is in Philadelphia, so I can't check if I'm stating this correctly.)

23 June 2009

The Iranian election

The Devil Is in the Digits, an op-ed by Bernd Beber and Alexandra Scacco in Saturday's Washington Post.

This piece claims that the distribution of insignificant digits in vote totals in the recent Iranian election look funny, and that there's a good chance this is because the numbers were made up.

I haven't looked at the numbers myself, but this seems like an avenue worth pursuing.

22 January 2009

Tall people are smarter?

Men are more intelligent than women because they're taller, say psychologists Satoshi Kanazawa and Diane J. Reyniers.

I don't feel like picking this one apart. Have fun.

07 January 2009

Dictionaries -- not just for looking up words!

Robert W. Jernigan, A photographic view of cumulative distribution functions, Journal of Statistics Education Volume 16, Number 1 (2008). via reddit. Many dictionaries have little squares on the outside of the page, positioned according to the location of the words on that page in the alphabet. These are an approximation of the cumulative distribution function of the first letter of a randomly selected word. (I say an approximation; this would be exactly true if words with each first letter had equally long definitions on average.)

Jernigan also has a blog, statpics, which is "devoted to images that illustrate statistical ideas"; not surprisingly, he wrote about this.

05 November 2008

The notion of "typical" doesn't behave nicely

Matt Yglesias makes an interesting point. The "typical" American is white, in that more than half of all Americans are white. The "typical" American is Christian, in that more than half of all Americans are Christian. But does this mean that the "typical" American is a white Christian, in that more than half of all Americans are white Christians? Not necessarily; I don't have the numbers.

Moreover, the "typical" white Christian votes Republican. Thus typical people vote Republican, so the Republicans should have won last night. But they didn't.

The point is that most people are "typical" in some ways, but few people are "typical" in all ways. And a party that is based around just people who are "typical" in all ways (note that I'm not saying this describes the Republican party) is doomed to fail, because most people are unusual along some dimension. I don't think this deserves the name of "paradox", but it's just something worth keeping in mind about How Statistics Work.

28 October 2008

People are not balls

The margin of error is only the beginning of political polling: "If one or more of the above statements [about certain red and blue balls] are true, then the formula for margin of error simplifies to Margin of Error = Who the hell knows?"

05 September 2008

McCain's life span?

Here and here, people attempt to answer the question: what are the chances that John McCain will die in the next four or eight years?

A quick look at mortality tables says roughly 15% in four years, 30% in eight years, which are roughly ten times the corresponding figures for Obama --- although it gets more complicated than that pretty quickly. Obama smoked for a while, McCain had cancer. Obama's parents died relatively young, which seems bad for him-- but his father from an automobile accident and his mother from ovarian cancer, which Obama himself is obviously not at risk for. McCain's mother, on the other hand, is still alive at 96. The presidency is a very stressful job -- but it comes with great health care! (I don't actually know what sort of health care the president has, but somehow I don't see doctors turning away the president for inability to pay.) And so on.

And if we're talking about the probability that a president will survive his term, we also have to think about assassination. Four out of 44 US presidents have been assassinated; what are the probabilities that either of the nominees would be assassinated? I don't even know how one would begin to assess that.

Finally, as meep points out in the first post linked above, "The central limit theorem doesn't kick in at one person". We don't get to elect a president, branch off a large number parallel worlds, and see in what proportion of those worlds he survives four or eight years. (Unless you subscribe to the many-worlds interpretation, that is.) We get one shot.

(Well, we Americans get one shot. My statistics have historically shown that about half my readers are reading from outside the US.)

12 August 2008

Variance in Olympic events

It's often claimed that the reason that there are many more men than women in certain academic disciplines (mathematics is one, but that's not the point of this post) is not that men and women have different mean abilities, but rather that the standard deviation of male ability is larger than the standard deviation of female ability. (Of course, it is unwise to espouse these views publicly, for political reasons; that's what got Larry Summers in a lot of trouble.)

It occurs to me, having watched lots of the Olympics in the last few days, that something similar might be true in athletic events. I'm not claiming that men and women are physically identical (I'm not blind), or that their average performance in physical feats is the same. But it may be the case that the difference between the very best men and the very best women in physical feats (say, times in some sort of race, because these are the most easily quantified) is larger than the difference between the average man and the average woman, because there could be more variance among men than women.

Is there any evidence for this? I'm obviously not a student of this sort of thing (in fact, I don't even know what "this sort of thing" is called, although it's clearly some subfield of biology or medicine).

Oh, and Jordan Ellenberg wrote an explanation of why the new gymnastics scoring system is good. I'm glad he did, because I'd had a feeling it was better than the old system but was having trouble articulating why.

10 July 2008

Why medians are dangerous

Greg Mankiw provides a graph of the salaries of newly minted lawyers, originally from Empirical Legal Studies.

There are two peaks, one centered at about $45,000 and one centered at about $145,000. The higher peak corresponds to people working for Big Law Firms; the lower to people working for nonprofits, the government, etc.

The median is reported at $62,000, just to the right of the first peak, since the first peak contains slightly more people. But one gets the impression that if a few more people were to shift from the left peak to the right peak, the median would jump drastically upwards. We usually hear that it's better to look at the median than the mean when looking at distributions of incomes, house prices, etc. because these distributions are heavily skewed towards the right. But even that starts to break down when the distribution is bimodal.

30 June 2008

A tail bound for the normal distribution

Often one wants to know the probability that a random variable with the standard normal distribution takes value above x for some positive constant x.

(Okay, I'll be honest -- by "one" I mean "me", and the main reason I'm writing this post is to fix this idea in my head so I don't have to go looking for my copy of Durrett's text Probability: Theory and Examples every time I want this result. Durrett gives a much shorter proof -- two lines -- on page 6 of that book, but it involves an unmotivated-seeming change of variables, which is why I have trouble remembering it.)

The probability density function of the standard normal is ${1 \over \sqrt{2\pi}} \exp( -x^2/2)$ , and so the probability in question is

$f(x) = \int_x^\infty {1 \over \sqrt{2\pi}} \exp (-t^2/2) \, dt$

It's a standard fact, but one that I can never remember, that this is bounded above by ${1 \over \sqrt{2\pi} x} \exp(-x^2/2)$ (and furthermore bounded below by 1 - 1/x² times the upper bound, so the upper bound's not a bad estimate).

How to prove this? Well, here's an idea -- approximate the tail of the standard normal distribution's density function by an exponential. Which exponential? The exponential of the linearization of the exponent at t. The exponent has negative second derivative, so the new exponent is larger (less negative) than the old one and this is an overestimate. That is,

$f(x) < \int_x^\infty {1 \over \sqrt{2\pi}} \exp(-x^2/2-x(t-x)) \, dt$

where the new exponent is the linearization of -t²/2 at t=x.

Then pull out factors which don't depend on t to get

${\exp(x^2/2) \over \sqrt{2\pi}}\int_x^\infty \exp(-xt) \, dt$

and doing that last integral gives the desired bound.

Basically, the idea is that since the density to the right of x is dropping off as the exponential of a quadratic, most of it's concentrated very close to x, so we might as well approximate the density of the function by the exponential of a linear function, which is easier to work with.

By similar means one can show that the expectation of a real number selected from the standard normal distribution, given that it's greater than x, is something like x + 1/x. The tail to the right of x looks like an exponential random variable with mean 1/x. For example, the expectation of a real number selected from the standard normal distribution, conditioned on being larger than 10, is 10.09809.... But this is probably useless, because the probability of a real number selected from the standard normal distribution being larger than 10 is, by the previous bound, smaller than 1 in 10(2π)^1/2e⁵⁰, or about one in 1.3 x 10²³.

25 June 2008

White People hate math, but like statistics

Did you know White People hate math, but like statistics?

This is from Stuff White People Like. There are three things you should know about SWPL, if you don't already. First, it's SATIRICAL. Second, it is not actually about "white people" (i. e. people whose ancestors originally hail from Europe) but "White People". These are best defined as people who like things on this list, like irony, Netflix, Wes Anderson movies, indie music, having two last names, Oscar parties, having black friends, indie music, The Wire, and the idea of soccer. (This is actually a randomly chosen sample from the list, which is conveniently numbered; random.org gave me 41 twice, and indie music is #41, hence the duplication. I was going to just pick a few things at "random", but I realized that I was kind of biased towards the things that I like.)

By "statistics" is meant not the mathematical field but various interesting-sounding numbers. For example, if each White Person has a favorite thing from the list of Stuff White People Like, and you pick ten white people at random, there's a 36% chance that two of them will have the same favorite White Person Thing. (This is the White Person version of the birthday paradox.)

Also of interest there: the entry on graduate school, which I think pretty clearly refers to grad school in the humanities.

14 June 2008

A consequence of seven not dividing thirty

I use Google Reader to read a large number of blogs and other RSS feeds. It displays charts of the number of posts I've read in each day of the last thirty, and also the number of posts I've read in each hour of the day and day of the week over a 30-day period.

There's pronounced periodicity in the number of posts read on each day, which comes from periodicity in posting frequency (I probably read blogs twice a day or so). In particular certain days of the week see more posts than others -- in most weeks I read the most posts on Wednesday; then Tuesday and Thursday; then Monday and Friday; then Saturday and Sunday. I've known this for a while.

But in the last thirty days, I have read more posts on Friday than any other day. The numbers are as follows: Sunday, 369; Monday, 636; Tuesday, 719; Wednesday, 784; Thursday, 724; Friday, 883; Saturday; 448.

Where are all these extra Friday posts coming from?

Well, it's Saturday. The last 30 days are a period going from a Friday (May 16) to a Saturday (June 14), including five Fridays and Saturdays and four of each other day.

If I consider the last twenty-eight days instead, I get 696 for Fridays and 378 for Saturdays, and the profile for the month looks like the profile for any given week.

But the fact that thirty isn't divisible by seven has an interestinng effect there.

16 May 2008

Correlation coefficients and the popularity gap

The Popularity Gap (Sarah Kliff, Newsweek, May 15 issue).

Apparently, the people who end up being successful later in life are the ones who think people like them in middle school, not necessarily the ones who are actually well-liked in middle school. This reports on a study by Kathleen Boykin McElhaney, who is not particularly important to what I'm going to say, because I'm going to comment on something that I assume was introduced by the folks at Newsweek.

The Newsweek article continues:

One of McElhaney's most interesting findings is that self-perceived and peer-perceived popularity don't line up too well; most of the well-liked kids do not perceive themselves as well liked and visa versa. The correlation between self-perceived and peer-ranked popularity was .25, meaning about a quarter of the kids who were popular according to their classmates also thought they were popular. For the other three quarters, there was a disconnect between how the teen saw themselves and what their peers thought.

I can't read the original journal article (the electronic version doesn't become available for a year after publication, and I'm not going to campus in the rain and looking around an unfamiliar library just to track this down!) but the Newsweek article says enough to make it clear that the study wasn't using a two-point "popular/unpopular" scale. I'm inclined to think that the "correlation" here is what's usually referred to as the "correlation coefficient" -- and this is usually explained in popular media by saying that "one-fourth of the variation in how popular students believed they were was due to how popular they actually were" or some such similar phrase. I'm not a statistician, so I won't try to explain why that phrase might be wrong; if you are, please feel free to weigh in!

But let's assume that half of students are actually popular, and half of students think they're popular. (This might be a big assumption; recall the apocryphal claim that 75 percent of students at [insert elite college here] come in thinking they'll be in the top 25 percent of their class.) Then if only 25 percent of the students who are actually popular think they're popular, there's actually a negative correlation between actual popularity and perceived popularity! More formally, let X be a random variable which is 0 if someone's not (objectively) popular and 1 if they are; let Y play the same role for their self-assessed popularity. Then E(XY) is the probability that a randomly chosen student both is popular and thinks they are, which is 1/8 in this case; E(X) E(Y) = 1/4, which is larger.

Then again, if there actually were a negative correlation -- if people were so bad at self-assessment as to be worse than useless at it -- then that would be quite interesting. As it is, there seems to be in general a weak positive correlation between how P someone is (where P is some desirable trait, say popularity in this case) and how P they think they are.

And the fact that I bothered to write this post probably will lead you to guess -- correctly -- that I wasn't all that popular in high school.

15 May 2008

Well-intentioned money advice

Suze Orman, "internationally acclaimed personal finance expert" (actually the title of her web page!),, said yesterday on myphl17 News At Ten something like: "you are not to spend your economic stimulus check. you must save it." (Don't ask me why I ever watch this newscast. It consists of recycled press releases, news that someone got shot and somebody's house burned down, and sports scores. The only score I care about is the Phillies and I usually know how that turned out anyway.)

Anyway, Orman's advice seemed to be based on the idea that because the economy as an aggregate is doing poorly, we all must be suffering. There are surely some people who have had a very good year and don't need the six hundred bucks. And there are surely some people who have had a very bad year and for whom six hundred bucks is just a drop in the bucket.

I'll call this the "distributional fallacy" (does it have another name) -- assuming that any individual must be representative of some sample from which they're drawn. Not a horrible assumption in the absence of other information -- but I know more about my financial situation than someone appearing on my television!

But "if times have been bad and you don't have money saved up, you should save the money -- and maybe you should save it even if things have been good for you, because they might turn bad" doesn't have the same ring to it.

I'm not arguing that people shouldn't save their money, because life has a way of causing people trouble. But to assume that everybody is going through hard times is kind of short-sighted. Then again, if you tell people "some people should save their money", that American instinct to consume will kick in and people will assume that "some people" doesn't include them.

(For the record, I will be saving my economic stimulus check. I think. It's hard to say, because money's fungible, and I stand to have negative cash flow this summer because I won't be teaching like I have the last two summers. So it'll go into savings, but then I'll spend "it" later. Money is money, it all mixes together. It's a scalar, not some sort of crazy vector in a non-Euclidean space as some people would probably like you to think.)

Simpson's paradox and climate change

Solving the climate change attitude mystery, from Statistical Modeling, Causal Inference, and Social Science, originally from Wired.

The facts appear to be as follows: 19 percent of college-educated Republicans believe that human activites cause global warming; 75 percent of Democrats believe the same. Among the non-college-educated, 31 percent of Republicans have that belief and 52 percent of Democrats.

One conclusion is that college-educated people are more likely to toe the party line.

But here's another idea. The assumption underlying this seems to be that "Republican" and "Democrat" are fixed labels -- so clearly college makes you think that global warming is more likely if you're a Democrat, but less likely if you're a Republican. But of course those labels are not fixed; people can switch parties! So maybe what happens is that education doesn't change your beliefs -- but if you are the sort of person who would believe that humanity causes global warming, but otherwise tend to agree with Republicans, a college education would flip you to the Democratic side. (And vice-versa for Democratic-leaning global-warming-skeptics.) I'm not saying that there are people doing indoctrination at our colleges, but that such an education changes the way one looks at things.

It basically seems to be Simpson's paradox in different guise. You've got to be careful when the groups you're analyzing change size!