07 September 2008

First number not in Google?

Walking Randomly asks: what's the smallest positive integer that gives no Google hits?

I don't know exactly, but it's in the high eight digits. Why, you ask?

Well, I searched a series of numbers, starting with 1 and roughly doubling. This gave the following data:




Search stringNumber of hits
13033319 158
26066638 37
52133277 17
104266555 12
208533110 4

(Each number in the left column is either twice the previous one, or twice the previous one plus one.)

So here's what I'm thinking; numbers around 26 million, have, on average, 37 google hits. Since containing a given number is a Rare Event, I'm guessing that we can treat the number of hits of numbers around that magnitude as Poisson with parameter 37. The probability that a Poisson(37) random variable is equal to zero is exp(-37), or about one in 1016. So the first integer with no Google hits is probably larger than 26 million.

But by the same argument, numbers around 52 million have probability exp(-17), or about one in 25 million, of having no Google hits. So we expect one between, say, 52 million and 77 million.

And by the time we get up near 100 million, we should be seeing these numbers at a frequency of one in exp(12), or six in a million; they should become commonplace.

To get a better model, I'd take more samples. The number of Google hits for a number seems to follow a power law; the number of hits for n is a constant times n for some exponent α, somewhere between 1 and 1.5. (There are issues around saying that things follow power laws, though; it's easy to see them even when they're not there.) And there are various complications -- for example, powers of ten and powers of two are more common than the numbers around them. And how do we know that the occurrence of a number on various web pages is actually independent? To be honest, we don't; if a number exists on a web page, it's there for a reason, and if one person has something to say about it, why not someone else?

15 comments:

Anonymous said...

And if you knew exactly, you couldn't really say. (If you did, Google might pick it up...)

Jonathan

Anonymous said...

Well, there may be a bunch of 8-digit numbers not in Google, but when you get to the 9-digit range, you'll pick up a lot of numbers of the form 617 253 8800. Some prefixes won't be represented, of course, but there are a lot more 9-digit numbers than there would otherwise be.

Michael Lugo said...

plam, I assume your choice of number was not accidental, although I didn't recognize the last four digits at first. (And on a related note, I'd expect 617253xxxx to be seriously overrepresented. Although maybe not -- MIT people would perhaps be unlikely to put their phone numbers on their web pages. Nobody uses phones there.)

On a similar note, while playing around with this I noticed that numbers equal to or slightly less than 2008 tend to be overrepresented, because they often refer to years. (2008 is more common than 6, and twenty times as common as 2009.) And five-digit numbers have interpretations as US zip codes, so for example 90210 is much more common than 90209 or 90211. And 02138, 02139, 02142 are more common than 02140, 02141 -- which makes sense if you look at a map of Cambridge zip codes and think about which areas of Cambridge are likely to be more internet-heavy, but the leading zero means that wouldn't show up in the search I was talking about.

niklas said...

348494233 doesn't have any hits on Google, let's see how that lasts.

Anonymous said...

Wouldn't it make more sense to fit some kind of power law distribution over the numbers rather than a Poisson distribution?

To quickly test this, I took the number 13033319 and repeatedly incremented the first digit to see if the resulting distribution fit with Benford's law. Here's what I got:

13033319: 157

23033319: 81

33033319: 33

43033319: 39

53033319: 10

63033319: 46

73033319: 65

83033319: 23

93033319: 23

So not a great fit for Benford's law but it's suggestive.

Michael Lugo said...

Mark, I wasn't entirely clear. What I meant is that there should be a power law for the "expected" number of google hits for a certain number (where the expectation is taken over some ensemble of Internets) and then the fluctuations from that expectation are given by the Poisson distribution.

Anonymous said...

Thanks for clarifying that. I think I follow your argument now.

I'm half tempted to write a script that performs a search for the smallest number but considering the sheer number of queries it would have to make I'm sure it would be against Google's terms of service.

Anonymous said...

niklas... nice. We used to have an upper boung....

Anonymous said...

michael, you might recognize that number in the encoding 'DEF TUV TUV OPER OPER', as seen on posters everywhere at MIT.

mark, that might be a script best run from within Google...

Michael Lugo said...

yes, after I googled it I realized that the "DEF TUV TUV OPER OPER" form was more memorable. It's fortunate that they have a number such that all those strings are pronounceable; if their phone number has, say, 7s in it, people would be wondering "um, how do I pronounce 'PQRS'?"

Unknown said...

So, I took the number in the origianl post that indicated that it had 4 hits. And entered it into Google search (used double quotes around the number). I expected it to have 5 hits (since the original post would account for the 5th hit).

And sure enough there were 5 hits.

:o)

Unknown said...
This comment has been removed by the author.
Anonymous said...

The distribution is skewed in non-mathematical ways. 5-digit numbers are zipcodes, 7-digit numbers are phone numbers, 8-digit with leading zero are also phone numbers, 10-digit are phone numbers with area codes (or Social Security numbers or ISBNs), 13-digit are EANs, 16-digit are credit card numbers. If you allow leading zeroes, 9-digit "numbers" not in Google's database are common. That makes for a lumpy distribution.

Anonymous said...

scaliger, your SSN has 10 digits? Dammit, I knew the gummint was cheating me!

Anonymous said...

The following paper provides some intersting insights: Frequency of occurrence of numbers in the World Wide Web