26 August 2007

how many Scrabble racks are there?

I've been playing a lot of Scrabble on Facebook lately. While Googling around for some stuff on Scrabble strategy (I'm not the type to memorize words, mostly because that crosses some sort of invisible line between "fun" and "work", but knowing about things like how to try to keep a good mix of vowels and consonants on my rack makes the game more interesting just because I'm less likely to get stuck with a bad rack, which is frustrating), I found a couple things of interest.

First, this handout from the MIT Scrabble club. (I actually knew the guy who started it. He lived down the hall from me my senior year, when he was a freshman.) On that handout they quote Joel Sherman, who says:
Accept the idea that Scrabble is a math game just as much as it is a word game. All the strategic theory of the game is based on statistical analysis, probabilities, spatial relationships on the board, maximizing the value of small-numbered tiles by playing bingos (using all 7 tiles on your rack to earn the 50-point bonus), and large-numbered# tiles by causing them to interact with the colored premium squares on the board, or with other words on the board. (I.e.: you can score just as much for placing a 4-letter word in a place where it lies parallel to 4 other letters or whole words without hitting any premium squares as you might for the same word hitting a double or even triple word square but forming only one or two words on the play.)


Indeed, the conventional wisdom is that people who are good at math do well at Scrabble, and that having a large vocabulary is not particularly useful. (I suspect a lot of this is because large vocabularies tend to consist of long words, and words much longer than seven letters are quite rare in Scrabble.) The same thing is said to be true of crossword puzzles, although there it's not as obvious why there should be a correlation, and my belief is that it's not as pronounced.)

I also found at Scrabble on the Brain a question: "How many games of Scrabble would you have to play before you see every possible rack?"

How to answer this question is not obvious, because the racks one gets in a game aren't independent. I'll answer the easier question of how many games of Scrabble you'd have to play before you see every possible rack on the opening play. This is an instance of the "coupon collector's problem", which says: if we pick a sequence of elements from {1, 2, ..., n} at random, with replacement, how long will it take until we've picked each of the n integers at least once? The standard version of the problem has one picking uniformly at random, so this models, for example, the number of rolls of a standard six-sided die that would be necessary to have each number come up at least once. The answer in this case is n(1 + 1/2 + 1/3 + ... + 1/n), which is asymptotically equal to n(log n + γ), where γ = 0.5772156649... is the Euler-Mascheroni constant. I suspect that in the case where the distribution is non-uniform, as it is for Scrabble racks, it takes longer to observe all possiblilities;

and so the important thing is to compute the number of possible Scrabble racks. (If this has been done before, I can't find it!)

If Scrabble tiles could be differentiated (so instead of having nine tiles labeled A, we had tiles labeled A1 through A9), then the answer would just be the number of ways of picking seven tiles from 100, which is C(100,7) = 16,007,560,800. But they're not. I can't see a nice way to compute the actual number (although that doesn't mean it's not there, and I'd appreciate knowing it!) But the following sampling procedure should yield an estimate. Let X be a random variable defined as follows: pick a rack R of seven Scrabble tiles uniformly at random from all 7-sets of (distinguished) Scrabble tiles, and let X(R) be the number of different 7-sets of distinguished Scrabble tiles that collapse to the same set of non-distinguished tiles. So, for example, if we picked the tiles EXAMPLE, there are 12 E's, 1 X, 9 A's, 2 M's, 2 P's, and 4 L's in the tile set, so the number of possible sets of distinguished tiles reading EXAMPLE is C(12,2) C(1,1) C(9,1) C(2,1) C(2,1) C(4,1) = 9504. Then the number of possible racks of undistinguished tiles is the sum of 1/X(R) over all possible racks of distinguished tiles, or C(100,7) times the expectation of 1/X(R).

(Compare if we were trying to find the number of possible one-letter racks, that is, letters; we'd have, say, 9 terms which are each 1/9 corresponding to the labeled tiles A1 through A9, 2 terms each 1/2 corresponding to B1 and B2, and so on. The terms corresponding to each tile type sum to 1, so the sum is just the number of tile types, or 27 including the blank.)

And the expectation of 1/X(R) can be found by sampling. I generated a million racks at random and evaluated 1/X(R) (with some hacked-together Maple code); the expectation of 1/X(R) calculated from this sample is 0.0002014090957, and I approximate the number of possible Scrabble racks to be this quantity times C(10,7), or 3224068. (I'm not interested enough to try to calculate the standard error on this -- and I hope I can get an exact answer.) The number of games of Scrabble you'd have to play before you see every possible rack on the opening play is probably at least 3224068 log 3224068, or around fifty million.

6 comments:

Anonymous said...

I also found that memorizing words for Scrabble crosses the line between fun and work. I also have a degree in math.

Thus, I changed the game to eliminate the utility of memorization.

Hope you will take a look.

Peter

http://www.wildwords.us

Brett said...

I wrote a perl program to calculate the number of unique scrabble racks. There are 3199724 racks - your estimation was very close!

My program assumes a rack size of 7 and these tile counts:
A=>9, B=>2, C=>2, D=>4, E=>12, F=>2, G=>3, H=>2, I=>9, J=>1, K=>1, L=>4, M=>2, N=>6, O=>8, P=>2, Q=>1, R=>6, S=>4, T=>6, U=>4, V=>2, W=>2, X=>1, Y=>2, Z=>1, '#'=>2

Michael Lugo said...

Brett,

Indeed, a couple days later I came up with the same number.

smh62 said...

I get 3199724 too. But I managed to find a reasonably simple way of computing this by hand. I'll try to describe it.

The first part is how you think about the rack. In this calculation, I think about it in terms of the form of it's repetition. So the rack specified by the frequency list [3,2,1] refers to a rack where there are three different letters. The first occurs 3 times, the second appears 2 times and the third appears 1 time only. There are only 14 such possible rack specs in a scrabble rack (listed later).

In order to know how many ways a spec can be satisfied, it is necessary to know how many different letters can be repeated a given number of times. I denote this fequency by F[] and I have these values:

F[1] = 27 (see NOTE A)
F[2] = 22
F[3] = 12
F[4] = 11
F[5] = 7
F[6] = 7
F[7] = 4 (see NOTE B)

NOTE A: This counts the blank as a letter.
NOTE B: There are only four letters which can be repeated seven times.

A further consideration is the number of permutations of the frequencies in a rack spec which would leave the spec unchanged. I denote this number as P[] and list the values of P[] for all possible rack specs:

P[7] = 1
P[6,1] = 1
P[5,2] = 1
P[5,1,1] = 2 (see NOTE A)
P[4,3] = 1
P[4,2,1] = 1
P[4,1,1,1] = 6 (see NOTE B)
P[3,3,1] = 2
P[3,2,2] = 2
P[3,2,1,1] = 2
P[3,1,1,1,1] = 24
P[2,2,2,1] = 6
P[2,2,1,1,1] = 12 (see NOTE C)
P[2,1,1,1,1,1] = 120
P[1,1,1,1,1,1,1] = 5040

NOTE A: since the second and third letter choices can be swapped.
NOTE B: six ways of permuting the 1s)
NOTE C: That's 2*6.

Now the question is, given a spec, how many different letter choices does that specification correspond to? Let's take [4,1,1,1] as an example spec in order to demonstrate how to calculate N[4,1,1,1] the number of possible letter choices. First we'll calculate a necessary interim quantity X[]:

X[4,1,1,1] = (F[4] - 0) * (F[1] - 1) * (F[1] - 2) * (F[1] - 3)
= (11 - 0) * (27 - 1) * (27 - 2) * (27 - 3)
= 11 * 26 * 25 * 24
= 171600

and in general we have:

X[n0,n1,...,nK] = (F[n0] - 0) * (F[n1] - 1) * ... * (F[nK] - K)

At first glance, this looks like the number we're looking for but not quite. The calculation considers the letter choices (L,A,B,C) as distinct from the choices (L,A,C,B), (L,B,A,C), (L,B,C,A), (L,C,A,B) and (L,C,B,A) but which all actually correspond to the same rack. The number of possible equivalent choices is given by P[4,1,1,1] so in order to get N[4,1,1,1] we must divide X[4,1,1,1] by P[4,1,1,1]. So:

N[4,1,1,1] = X[4,1,1,1] / P[4,1,1,1]
= 28600

and for some general rack spec, (S), we have:

N[(S)] = X[(S)] / P[(S)]

Now we're in a position to calculate N[] for all valid rack specs:

N[7] = 4
N[6,1] = 182
N[5,2] = 147
N[5,1,1] = 2275
N[4,3] = 121
N[4,2,1] = 5775
N[4,1,1,1] = 28600
N[3,3,1] = 1650
N[3,2,2] = 2520
N[3,2,1,1] = 75600
N[3,1,1,1,1] = 179400
N[2,2,2,1] = 36960
N[2,2,1,1,1] = 531300
N[2,1,1,1,1,1] = 1447160
N[1,1,1,1,1,1,1] = 888030

And totalling all of this to get, N[*], the number of all possible racks is:

N[*] = N[7] + N[6,1] + ... + N[1,1,1,1,1,1,1]
= 3199724.

Tada!!! :)

Unknown said...

Generating function solution:

It's coefficient of x^7 for the following polynomial:

(1 + x)^5 *
(1 + x + x^2)^10 *
(1 + x + x^2 + x^3) *
(1 + x + x^2 + x^3 + x^4)^4 *
(1 + x + x^2 + x^3 + x^4 + x^5 + x^6)^3 *
(1 + x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8) *
(1 + x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9)^2 *
(1 + x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9 + x^10 + x^11 + x^12)

Some explanation:
There are 5 letters: J, K, Q, X, and Z that have frequency = 1. So, each rack can have 0 or 1 of each letter. (Those five letters contribute (1+x)^5 to the polynomial)

There are 10 letters (including the blank) which occur twice in the set of tiles. So, each rack can have 0, 1 or 2 of each letter.
(Those 10 letters contribute (1+x+x^2)^10 to the polynomial).

Etc, for other letters of other frequencies.

So, the coefficient of x^7 of the expanded polynomial gives you the number of distinct racks with 7 tiles.

If you run this in Mathematica or similar, you get the figure 3199724. (As a byproduct, you get every other distinct rack count for different rack sizes.)

1 + 27x + 373x^2 + 3509x^3 + 25254x^4 + 148150x^5 + 737311x^6 + 3199724x^7 + 12353822x^8 + 43088473x^9 + 137412392x^10 ...
+ x^100

As expected, the polynomial is symmetric in that the coefficient of x^y = the coefficient of x^(100-y).

So, there's one distinct set of zero tiles and one distinct set of 100 tiles.
There are 27 sets of one tile (26 letters + the blank) and 27 ways distinct sets of 99 tiles.

The maximum occurs at 50 (as expected): 913,201,857,455,724 (almost 1 quadrillion)

This almost fits into the input box for www.wolframalpha.com. If you only care about rack sizes <=7,
you can run this modified polynomial (the exponents top out at 7 because you couldn't have more than 7 tiles in
a rack of 7)

expand (1+x)^5*(1+x+x^2)^10*(1+x+x^2+x^3)*(1+x+x^2+x^3+x^4)^4*(1+x+x^2+x^3+x^4+x^5+x^6)^3*(1+x+x^2+x^3+x^4+x^5+x^6+x^7)^4

At time of writing though, WolframAlpha "doesn't know what to do with the input".

Unknown said...

This is interesting. The real question of course, is not how many unique racks are there?, or how long it would take to observe them all on the opening draw from the tile bag? but: "How much luck is involved in playing SCRABBLE(R)?"

If there are 3,199,724 unique racks, then a player playing 10 games/day with 15 draws from the tile bag/game would need a minimum of 58.4 years to draw all the unique racks. That's a big IF of course, since it would require the player to draw a new unique rack every time.

Using the approximation above to derive some estimate of the sampling time required to actually draw all of the unique racks (n(log(n)) this gives a value of approximately 380 years of random draws, 10 games/day, 15 draws/game. It might require more time, since experienced players manage their racks by saving certain tiles or combinations such as the blank, or ERS.

So a hardcore player, playing ten games per day over a 38 year career might see 1/10 of all possible racks. This means that any single player's draws over the players' entire career could possibly all fall at the extreme of one tail of a normal probability distribution, especially if the player played fewer than 10 games/day every day for 38 years.

Speculating a little, and assuming an sampling unit of an entire SCRABBLE career, (and not the more appropriate unit of a single draw from the tile bag- a big assumption!) we can get an intuitive feel for the amount of luck involved.

If the probability distribution mentioned above was that of drawing "good racks" (i.e. racks of tiles likely to score well given the current configuration of tiles on a game board- the very best racks being distributed toward the right tail of the distribution from the mean), and even if "good racks" comprised 70% of all possible draws over a career (and, again, if it is reasonable to suggest that an entire SCRABBLE career is an appropriate sampling unit) then it is very possible that the hardcore players spending most of their SCRABBLE(R) careers at the extreme left tail of the probability distribution would rarely see a "good rack", and be considered much unluckier than others.

Of course, we do not play randomly. The combined effects of rack management skills, the genius of Alfred Butts' choice of tile frequency distribution, and the distribution of letters in words acceptable for play, together very likely increase the probability of spending your career at the more pleasant right-hand side of the "good rack" probability distribution. This analysis highlights the importance of rack management! If you are consistently leaving "bad" tile combinations on your rack you are increasing the likelihood of dwelling in the less pleasant left-hand tail of the distribution!