Using Gzip To Identify Authors Of Text

from the odd-uses dept

An article at ABCNews about some researchers who have figured out a way to use Gzip to identify authors of text files. They point out that in compressing a document, Gzip has to learn about it, to figure out what it can compress – and then it can use what it learns to identify similar documents. The researchers ran a test where they were 93% successful in identifying authors of sample texts using this process, but it only had to chose from 11 different authors. Something about it sounds fishy to me, though. It’s unclear from the description of the study if there was any sort of control group as well. Even the guy who wrote Gzip is skeptical.


Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Using Gzip To Identify Authors Of Text”

Subscribe: RSS Leave a comment
2 Comments
mhh5 says:

utter crap.

Correct me if I’m wrong, but gzip doesn’t “learn” anything. While it may do some simple pattern matching, I don’t think it has anywhere near the “learning capabilities” to distinguish more than a few “trained” samples.

This is just a statistical anomaly. Gzip is not a magic alternative to artificial intelligence. Shame on masnick for not slamming this article harder. Didn’t you TA a stat class? Isn’t this just a case of poor sampling size?

Ed says:

Re: utter crap.

The capabilities sound a bit oversold to me, but there’s probably something to it. Unfortunately trying to connect to the web servers at a university in Italy hasn’t been very fruitful, but I can surmise that the gzip compression ratio is used as a measure of the entropy in the text. With only a single number to go on, you couldn’t pick the author out of a large population, but it might be useful in deciding whether something was written by Author X or Author Y.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...