20 May 2009

Binomial probabilities

This also goes under the name of 'Bernoulli trials'. The idea is that you have a circumstance in which one of two things will happen. And you're going to try out the process many times. This could be tossing coins, or dice, or a batter going to the plate (or wicket) many time, a basketball or hockey player taking a number of shots, and so on. After a bunch of trials, you then ask what the chances are that you got that many heads (or sixes, hits, baskets, ...) or more. Even better is when you try to figure out the chances before you start the trials.

But let's be concrete. I find it easier to think of a particular example first, and then think about more general cases. (Some people prefer the other way around; people are different. But they're not writing this blog :-) Consider tossing a coin 5 times. What are the chances that we get 3 heads?

There's only 1 way to get 5 heads -- get a head on the first throw, and the second, and the third, and the fourth, and the fifth. 'And' is an important word in probability, meaning that we should multiply the probability of the individual events involved. Since we're using a fair coin with a 1/2 chance of turning up heads, this means 1/2 * 1/2 * 1/2 * 1/2 * 1/2. So, 1/32 chance of turning up 5 heads in 5 throws.

But let's think about getting 4 heads in 5 throws. We have 5 different ways of getting that. I'll list them off with 'H' meaning a heads, and 'T' meaning tails. They are:
THHHH or HTHHH or HHTHH or HHHTH or HHHHT.
Each one of these has a 1/32 chance of happening. The other important probability word is present. 'or' means to add the chances. So there is a 5/32 chance of getting 4 heads in 5 throws of a fair coin. (Or for your team to win 4 games out of 5 between evenly matched teams, or for a player to hit 4 shots out of 5 when he has a 50% shooting percentage), and so on.

It gets more complicated with 3 heads, 2 tails. And a lot more complicated when it's 493 tosses of the coin. That's where we want the general formula to do the legwork for us. When we get to cases where one event is more than 50% likely, again, it becomes nice to have a more general formula. I've made up a little spread sheet in Open Document format, if there's interest.

For the case at hand, about 'Grumbine scientists', we're wondering what the chances are that there'd be 5 or more scientists in a group of 493 people. By the way, speaking of family odds, the Bernoullis had an extraordinary run of mathematicians (hence the name for this bit of probability) and mathematical physicists ('Bernoulli effect' in fluid dynamics). In the case of 'scientist', the coin is weighted heavily against coming up that way. I made up a probability of 99.9% that a given person (in the US) was not a scientist who had published in the scientific literature in the last 20 years. Don't know that this is correct, which we'll return to. Nor, as we've already discussed and had comments on, are we confident that the number 493 is right or even very close.

With only 1 person in 1000 qualifying, we wouldn't be surprised to see 0 of 493 turn out to be scientists. The actual calculation gives 61% of the time that we'd expect 0. 30% of the time, we'd expect only 1. Conversely, the chance of something happening is 100% minus the chance of it not happening. Having 2 or more scientists show up, then, is only about a 9% chance. We shouldn't worship at the altar of the 5% level, but it's a good rule of thumb for getting started. With a 9% chance, we're not very impressed to see 2 or more scientists in this group of 493. But chances of getting 2 scientists are 7.4%. Between that, and some rounding in the earlier figures, there's only a 1.4% chance of finding 3 or more scientists in a group of 493 people. And that beats the 5% requirement handily. It's 0.2% for 4 or more, and 0.016% for 5 or more. This gets to a level where, as a matter of the probability, we'd be pretty confident that something real was going on.

But there's a joker in this deck, and I'm it. There is a selection effect problem. Namely, this group contains me -- because I started looking at the subject on the grounds that I was in it. Any group that contains me is guaranteed to have at least 1 scientist, 1 left-handed person, 1 runner, and so on. If I'm the person selecting the group, then we have to not count me. That leaves us with only 4 identified Grumbine scientists for the purpose of our research. As that's still at the 0.2% level, we're still pretty confident that something real is going on. Or at least if not, it's a pretty surprising coincidence.

Suppose that the number I made up for fraction of people who are scientists is too low. Let's say it's 0.3% instead of 0.1%. Then our chances of getting 4 or more scientists in the sample of 493 rises to 6.3%, which would not pass our standard for 'probably not chance'. So it is important to get a good idea of the figure. Now I'm pretty sure that the true figure isn't as high as that. It would mean 1 million people in the US had published in the last 20 years, and that just doesn't seem plausible. Even 300,000 (the 0.1%) strikes me as high, but I was trying to err on the high side in the first place.

Some odds and ends:
  • We had some difficulty in finding data to work with.
  • Once found, we saw that the data had some serious quality control problems
  • Our starting point included a selection bias problem
  • In trying to evaluate our conclusion, we discovered that the conclusion depended on an assumption we'd made without much evidence (the 0.1%)
All of this is common. To do science, you have to be ready to go back over your whole process to verify that where each step was not extremely strong, it at least doesn't change your conclusions. If it could, you have to mention this. Something that happens, though, in science is that once it's been mentioned, issues don't necessarily get rehashed in every paper for ever after. Inexperienced readers of science sometimes complain, because of this, about scientists 'hiding' problems. It isn't hidden, it's there in the scientific literature. It's just that the scientific literature assumes you're all big kids and have done your homework in reading the previous work. Citations aren't there for decoration.

Partly, this set of notes is to illustrate the Central Skill of a Scientist note. Although it turned out that the original idea, of there being surprisingly many Grumbines in science, is probably acceptable, it could have turned out otherwise. At that point, move on to other ideas. Scientists have many ideas, which is one of the secondary skills. So moving on to others is not a big deal.

Then again, having passed this far, we are in a position to ask -- again -- "So what?". More politely, "What would be shown even if the idea were true?" If there really were some exceptional number of Grumbines (or Bernoullis or Darwins to name some much better known families) in science, what would that mean? Unfortunately, nothing particular. Maybe it is a sign that there's a genetic contribution to entering science. Maybe it's a sign that there are family environment features which, for some reason, are common in this crowd. And maybe it really is just chance. These are more reasons to not get too wedded to ideas.

3 comments:

Alastair said...

“Science is organized common sense where many a beautiful theory was killed by an ugly fact.” Thomas Henry Huxley (Darwin's bulldog.)

Fleeming Jenkins pointed out to Darwin that any trait that evolved would very soon be bred out, so natural selection could not work. In your case, you only have 1/(2^6) = 1/64 of the genes of your 5th Grandfather! Of course it may be that intelligence is a dominant gene, and so it has been passed down to the male Grumbines with a greater frequency than 1/64.

OTOH, perhaps intelligent Grumbines have married intelligent wives, thus improving the odds!

Cheers, Alastair.

Robert Grumbine said...

Do you have a source for where Huxley said that? I've been looking casually for some time to find a source.

As we know later, Jenkins was wrong. But the arguments about how inheritance worked did set the stage for understanding the importance of inheritance being by way of discrete things (genes, it turned out) rather than blending.

Looks like we count generations differently. I wonder if that's a UK vs. US thing. The common ancestor of US Grumbines is my 5 times great-grandfather, so 7 generations back. Makes 1/128th of my ancestors and even less of my genome. If it turned out that there was a genetic link, one way to avoid losing the genes to chance would be to put the genes on the Y chromosome, which is inherited in full by men from their fathers. Given the daughters of Grumbines I've known and know of, this seems unlikely. They're a sharp crowd.

The social side seems much more likely. The women who married my Grumbine male ancestors have typically been accomplished and educated themselves -- several school teachers, mathematician here (who had to abandon the field because of discrimination against women), a PhD there (in science). One family I don't know much about the women (married 6 generations back) has a very suggestive name as to family interest in education -- Buchs (German for books, and the name probably was adopted by her grandfather; I'll guess that a family that adopts such a name in about 1730 values learning).

A different prompter of these notes was one of my sisters observing that we had the scientist side of the family (our father's), and the teacher side (our mother's). We're even more loaded with teachers on my mother's side.

From where I sit, good teachers and good scientists are very similar personality types. A good scientist is passionate about learning about the universe and sharing what they discover. A good teacher is passionate about sharing what they know about the universe and finding out more. It's only a slight shift in emphasis.

Alastair said...

I found the previous quote on the web using Google. However I found the following in Oxford Compendium of Quotations loaded on my computer, which may be more accurate.

The great tragedy of Science - the slaying of a beautiful hypothesis by an ugly fact.
Huxley, T.H. Collected Essays (1893-4) 'Biogenesis and Abiogenesis'

My mix up with 6 generations was just my poor reading, I though you wrote 5th grandfather not great-grandfather.

I have been pointed to this page as a result of posting 1709: The year that Europe froze.

Don't know whether they are of any relevance.

Cheers, Alastair.