Anyway, Joel Spolsky comments today in passing on how Google does things fundamentally differently than other companies. Here's the cite that caught my eye:
Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.As anyone who's mistyped a phrase in Google knows, Google is eerily good at DWIM searching.[1]
What's interesting to me about Joel's comment is that from Google's perspective, the accuracy of a term -- specifically its spelling -- is effectively a democratic process. Put another way, Google does not care what any given authority might say about the correctness of a particular spelling; instead, it is the ultimate in descriptivist empiricism -- the term with the most usage is the "more correct" term.
Given this hypothesis, let's see if I can devise a way to test it. Using GoogleFight, I'll compare some terms whose official spelling, speaking very broadly, might be open to debate:
light (468,000,000) versus lite (53,100,000)
Not a close contest, but that's still a respectable number of hits for a comparatively new variant. (One in nine, right?)
night (421,000,000) versus nite (8,780,000)
Clearly lite has made more inroads into light than nite has into night.
dependent (131,000,000) versus dependant (7,430,000)
I guess I can tell my writers that empirical evidence overwhelmingly favors the first. (And, may I add, whew.)
checkbox (9,880,000) versus check box (7,670,000)It appears that common high-tech usage has not yet had broad influence.Aha. Now I can go back to our editorial committee and tell them to get with the 21st century.Update: Someone pointed out (see Comments) that I was searching for "check+box" (two words on same page), not on the literal string "check box". Stats updated, conclusion updated. (I thought I'd looked it up as a literal, but guess not.)
collectable (7,600,000) versus collectible (15,900,000)
Bet you didn't think that one would be this close, did you?
canceling (5,210,000) versus cancelling (3,950,000)
Ooh, close one. Brits, obviously.
through (2,350,000,000) versus thru (35,600,000)
I'm sure purists everywhere are relieved.
cachable (95,300) versus cacheable (362,000)
This, if I read it right, contradicts what we're told in our corporate styleguide (2,100,000) or style guide (25,900,000).
dialog (71,900,000) versus dialogue (124,000,000)
The former I would bet is both more American and definitely far, far more prevalent in computers ("dialog box").
vendor (172,000,000) versus vender (8,980,000)
hiccup (2,060,000) versus hiccough (166,000)
Well, that seems pretty clear.
donut (3,330,000) versus doughnut (1,760,000)
The historical spelling is the loser here.
I don't think there are too many surprises in there, and nothing that would contradict what a reasonably contemporary authority would say. But of course the point is that Google cares not a whit for what authorities say (the learnèd opinions of an august "Usage Panel," say); it's going entirely by what people actually use. Then again, do people actually use what an authority (AHD, for example) says they should? Well, mostly, but people do what they want, and no one is going to put lite back into the can.
[1] In finding a link for DWIM, I found ran across the Wikipedia entry, which says this: "Obviously, no real-life implementations of DWIM exist for any platform." I wonder if one could say that Google goes some way toward refuting this statement.
10 comments:
English, at least when I learnt (UK) / learned (US)it, distinguishes between "dependent":
the value of this post is dependent upon your reading it;
and "dependant":
my son is my dependant.
So, simply counting usages is not a complete test of the "accepted" spelling!
In the US, "dependant" is considered a variation of "dependent":
http://dictionary.reference.com/search?q=dependant
Merriam-Webster's Dictionary of Law (http://dictionary.reference.com/search?q=dependent) defines "dependent" as:
"relying on another for esp. financial support b : lacking the necessary means of support or protection and in need of aid from others (as a public agency) 'have the child declared dependent and taken away from his or her parents —L. H. Tribe'"
The basis for these comparisons is not what anyone would consider rigorous. It's just for fun ...
I noticed that The New Yorker magazine (I'm not even sure i should capitalize "the"!) started using "vender" vice "vendor" a few years ago. I don't know why, and I just cannot get used to it. Hot dog vender? Street vender? "The Peanut Vender"? I don't think so.
Not surprisingly, Google wins its own Googlefight against googol.
Re: Light vs. Lite and Night vs. Nite
"Light" has many more meanings than "Night" does. One general short-coming of the "googlefight" method is that homonymical usage ("turn on the lights!", "new light whipped cream") isn't differentiated. "Lite" may be winning over "light" as an adjective, but that may be invisible to google (and googlefight).
I realize this post was made years ago, but just wanted to add that "dialog" is also the spelling for the same word in the German language.
Hence one would have to consider omitting the results of 'dialog' on websites written in German.
Obviously, the point of your post was that this does not happen in the results used for comparison - but it can mislead results.
The same is likely to be true for any other words from another language sharing spelling.
If you select a preferred language (e.g. in Google), don't the results of the search reflect this filter?
When I originally commented I clicked the "Notify me when new comments are added"
checkbox and now each time a comment is added I get four e-mails with
the same comment. Is there any way you can remove people
from that service? Appreciate it!
Also visit my weblog :: KRD Entertainment
Hi -- sorry about the notification repetition. I think that must be some flaw with Blogger; I don't really know of any setting to control it, and certainly don't know of any setting that specifies that notifications should go out more than once. :-(
i don't usually write posts or comments on articles but your blog was so convincing and is written with such diligence i had to praise. great work !
Post a Comment