Anyway, Joel Spolsky comments today in passing on how Google does things fundamentally differently than other companies. Here's the cite that caught my eye:
Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.As anyone who's mistyped a phrase in Google knows, Google is eerily good at DWIM searching.
What's interesting to me about Joel's comment is that from Google's perspective, the accuracy of a term -- specifically its spelling -- is effectively a democratic process. Put another way, Google does not care what any given authority might say about the correctness of a particular spelling; instead, it is the ultimate in descriptivist empiricism -- the term with the most usage is the "more correct" term.
Given this hypothesis, let's see if I can devise a way to test it. Using GoogleFight, I'll compare some terms whose official spelling, speaking very broadly, might be open to debate:
light (468,000,000) versus lite (53,100,000)
Not a close contest, but that's still a respectable number of hits for a comparatively new variant. (One in nine, right?)
night (421,000,000) versus nite (8,780,000)
Clearly lite has made more inroads into light than nite has into night.
dependent (131,000,000) versus dependant (7,430,000)
I guess I can tell my writers that empirical evidence overwhelmingly favors the first. (And, may I add, whew.)
checkbox (9,880,000) versus check box (7,670,000)
It appears that common high-tech usage has not yet had broad influence.Aha. Now I can go back to our editorial committee and tell them to get with the 21st century.Update: Someone pointed out (see Comments) that I was searching for "check+box" (two words on same page), not on the literal string "check box". Stats updated, conclusion updated. (I thought I'd looked it up as a literal, but guess not.)
collectable (7,600,000) versus collectible (15,900,000)
Bet you didn't think that one would be this close, did you?
canceling (5,210,000) versus cancelling (3,950,000)
Ooh, close one. Brits, obviously.
through (2,350,000,000) versus thru (35,600,000)
I'm sure purists everywhere are relieved.
cachable (95,300) versus cacheable (362,000)
This, if I read it right, contradicts what we're told in our corporate styleguide (2,100,000) or style guide (25,900,000).
dialog (71,900,000) versus dialogue (124,000,000)
The former I would bet is both more American and definitely far, far more prevalent in computers ("dialog box").
vendor (172,000,000) versus vender (8,980,000)
hiccup (2,060,000) versus hiccough (166,000)
Well, that seems pretty clear.
donut (3,330,000) versus doughnut (1,760,000)
The historical spelling is the loser here.
I don't think there are too many surprises in there, and nothing that would contradict what a reasonably contemporary authority would say. But of course the point is that Google cares not a whit for what authorities say (the learnèd opinions of an august "Usage Panel," say); it's going entirely by what people actually use. Then again, do people actually use what an authority (AHD, for example) says they should? Well, mostly, but people do what they want, and no one is going to put lite back into the can.
 In finding a link for DWIM, I found ran across the Wikipedia entry, which says this: "Obviously, no real-life implementations of DWIM exist for any platform." I wonder if one could say that Google goes some way toward refuting this statement.