Using Gzip To Identify Authors Of Text

from the odd-uses dept

Wed, Jan 30th 2002 03:13pm - Mike Masnick

An article at ABCNews about some researchers who have figured out a way to use Gzip to identify authors of text files. They point out that in compressing a document, Gzip has to learn about it, to figure out what it can compress – and then it can use what it learns to identify similar documents. The researchers ran a test where they were 93% successful in identifying authors of sample texts using this process, but it only had to chose from 11 different authors. Something about it sounds fishy to me, though. It’s unclear from the description of the study if there was any sort of control group as well. Even the guy who wrote Gzip is skeptical.

2 Comments Leave a Comment

If you liked this post, you may also be interested in...

Comments on “Using Gzip To Identify Authors Of Text”

Subscribe: RSS Leave a comment

mhh5

January 30, 2002 at 5:59 pm

utter crap.

Correct me if I’m wrong, but gzip doesn’t “learn” anything. While it may do some simple pattern matching, I don’t think it has anywhere near the “learning capabilities” to distinguish more than a few “trained” samples.

This is just a statistical anomaly. Gzip is not a magic alternative to artificial intelligence. Shame on masnick for not slamming this article harder. Didn’t you TA a stat class? Isn’t this just a case of poor sampling size?

January 30, 2002 at 7:01 pm

Re: utter crap.

The capabilities sound a bit oversold to me, but there’s probably something to it. Unfortunately trying to connect to the web servers at a university in Italy hasn’t been very fruitful, but I can surmise that the gzip compression ratio is used as a measure of the entropy in the text. With only a single number to go on, you couldn’t pick the author out of a large population, but it might be useful in deciding whether something was written by Author X or Author Y.

Add Your Comment Cancel reply

Googlewhacking Goes Prime Time

Corley Keeps On Fighting DeCSS Case

Follow Techdirt

Subscribe to Our Newsletter

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »

Essential Reading

The Techdirt Greenhouse

Read the latest posts:

Read All »

Techdirt Deals

Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Recent Stories

Tuesday
15:34	John Deere Pays $99 Million To Settle 'Right To Repair' Class Action (4)
13:30	Techdirt Podcast Episode 450: Infrastructure For The New Private Internet (0)
11:09	438 Experts Said Age Verification Is Dangerous. Legislators Are Moving Forward With It Anyway. (19)
11:04	Daily Deal: The 2026 Complete Godot Stack Development Bundle (0)
09:26	Trump Invites More Criminal Acts By Promising Pardons To Everyone Who Works For Him (19)
05:31	1,000+ Hollywood Insiders Write Letter Opposing Paramount/Warner Bros Merger (10)
Monday
20:12	Oh God: RFK Jr. Unveils Plan To Be First Sitting Cabinet Secretary To Host A Podcast (18)
15:12	The FAA’s “Temporary” Flight Restriction For Drones Is A Blatant Attempt To Criminalize Filming ICE (13)
13:05	DOJ Is Using A Grand Jury To Force Reddit To Unmask An Anonymous User (13)
11:05	Section 230 Is Dying By A Thousand Workarounds, And Massachusetts Just Added Another One (45)

Tools & Services

Company

Contact

Brought to you by Floor64

Designed with WordPress. Hosted by Pressable.

Using Gzip To Identify Authors Of Text

from the odd-uses dept

Comments on “Using Gzip To Identify Authors Of Text”

utter crap.

Re: utter crap.

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Tuesday

Monday

More

Tools & Services

Company

Contact

More

Using Gzip To Identify Authors Of Text

from the odd-uses dept

Comments on “Using Gzip To Identify Authors Of Text”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Tuesday

Monday

More

Email This Story

Tools & Services

Company

Contact

More