15 January 2009

The synonym-following game

An interesting random fact: form a graph where the vertices are words that the dictionary says has at least one synonym, and the edges are synonym pairs. Then the resulting graph has a giant component. In particular, if "the dictionary" is Merriam-Webster's dictionary, there are 23,279 words that have at least one synonym, and the resulting graph has a component of size 22,311. It also has a clustering coefficient of 0.7. The clustering coefficient is the probability that if we pick a vertex (word) u uniformly at random, and then pick two of its neighbors (synonyms) v and w uniformly at random, then v and w are neighbors (synonyms). So it's not surprising this is high for the dictionary network. This seems consistent with synonyms being words that are "near" each other in some "semantic space". I'm also kind of curious if the results are different for different dictionaries; a dictionary that's less aggressive in declaring things "synonyms" might not show this behavior, and in particular I suspect there's a critical point at which small perturbations of aggressiveness lead to large perturbations in the size of the giant component. So if you've ever played that game of following synonyms in a dictionary and ending up at words that seem to have nothing to do with where you started, this is why.

I'm paraphrasing this from "Statistical mechanics of complex networks", by Reka Albert and Albert-Laszlo Barabasi (cond-mat/0106096); apparently it comes from an unpublished manuscript of Yook, Jeong, and Barabasi. (The article, from 2001, called it a "preprint" but I can't find anything with that set of authors that fits the description. Also, does anybody else find the habit of not including titles of articles in citations supremely annoying? There are actually two "preprints" by that three-author set cited in this article, both from 2001; these are distinguished only as "2001a" and "2001b".) If you actually point me to this paper (or a similar study done by someone else) I'll appreciate it and will publicly thank you.

5 comments:

Anonymous said...

What happens if you expand to signed synonyms?

Anonymous said...

I want to know what two words are the farthest, and the path connecting them ...

Anonymous said...

If you are interested in the graph theoretic properties of synonyms check out the WordNet bibliography. My own current interesting mathematics of words paper is Toward a statistical mechanics of four letter words.

Aaron said...

"Also, does anybody else find the habit of not including titles of articles in citations supremely annoying?"

YES. The practice probably made sense when "searching" was something you did in a library and article lengths were limited by the price of dead trees... but it's a hell of a lot easier to Google titles than it is to Google names and dates!

Michael Lugo said...

Aaron,

your guess about the "price of dead trees" being the cause of title-free citation is good, but I don't think it's correct. In most old mathematics papers (by "old" I mean, say, between 1900 and 1950), citations are given with titles. Of course, I have not made any sort of systematic sampling of old math journals -- I'm just talking about places where I've dipped into the early twentieth-century literature to check a particular classical result, or where I've read some review paper from the time because I wanted to get an idea of how people thought about some concept in the past. So I suspect it's possible that there are portions of the literature where title-free citation wasn't done and those papers are just not as widely known -- which would in itself be interesting.

But I actually believe that the difference is not one of time but one of (academic) discipline. Most of the papers I run across that use title-free citation are from the physics literature -- for example, the paper of Albert and Barabasi that I referenced in the original post. I don't ever remember running across a paper in pure mathematics that didn't give titles in the citations. (And since it's a pet peeve of mine, I think I would remember.) Also, I suspect title-free citation would have been more annoying in the past, not less; although a page number and set of authors isn't easily googlable, it's at least easy to find the paper if one has access to an electronic version of the journal or to match up with the online CV of one of the authors. It still makes following the citation harder, though, and I'll admit I'm less likely to read a paper if it's given in a title-free citation. For one thing, the title of the paper gives me information about whether the paper is likely to interest me!

On the third hand, you might be on to something about the age of the paper being a factor -- the practice might date from before the modern explosion of the literature, when there were few enough journals that it might be reasonable that an institution's library might have all of them. I'm not well-versed enough in the history of academic libraries to comment more intelligently than that.

And now I find myself wondering if some social scientist has actually studied patterns in citation styles -- perhaps someone working near a disciplinary boundary, after being frustrated by variations like this between disciplines.