Using Gzip To Identify Authors Of Text
from the odd-uses dept
An article at ABCNews about some researchers who have figured out a way to use Gzip to identify authors of text files. They point out that in compressing a document, Gzip has to learn about it, to figure out what it can compress – and then it can use what it learns to identify similar documents. The researchers ran a test where they were 93% successful in identifying authors of sample texts using this process, but it only had to chose from 11 different authors. Something about it sounds fishy to me, though. It’s unclear from the description of the study if there was any sort of control group as well. Even the guy who wrote Gzip is skeptical.
Comments on “Using Gzip To Identify Authors Of Text”
utter crap.
Correct me if I’m wrong, but gzip doesn’t “learn” anything. While it may do some simple pattern matching, I don’t think it has anywhere near the “learning capabilities” to distinguish more than a few “trained” samples.
This is just a statistical anomaly. Gzip is not a magic alternative to artificial intelligence. Shame on masnick for not slamming this article harder. Didn’t you TA a stat class? Isn’t this just a case of poor sampling size?
Re: utter crap.
The capabilities sound a bit oversold to me, but there’s probably something to it. Unfortunately trying to connect to the web servers at a university in Italy hasn’t been very fruitful, but I can surmise that the gzip compression ratio is used as a measure of the entropy in the text. With only a single number to go on, you couldn’t pick the author out of a large population, but it might be useful in deciding whether something was written by Author X or Author Y.