Can A Computer Pick Out Fake Online Reviews When Humans Can't?

from the sounds-like-it dept

It's no surprise that there are a ton of "fake" reviews online of just about anything that can be reviewed. Businesses, hotels, authors, musicians, etc., all want to make sure that whatever it is they're selling, people see good reviews when they go searching. But, of course, that's a problem for consumers who rely on such fake reviews... and on the sites who host such reviews and want them to be as accurate as possible. So it's fascinating to see that some researchers at Cornell (yes, my alma mater) were able to come up with an algorithmic way to figure out what reviews are fake. You can read the full paper here (pdf). It's only 11 pages.

The method was pretty clever. First, they used Mechanical Turk to create 400 faked 5-star reviews of Chicago hotels:
To solicit gold-standard deceptive opinion spam using AMT, we create a pool of 400 Human- Intelligence Tasks (HITs) and allocate them evenly across our 20 chosen hotels. To ensure that opinions are written by unique authors, we allow only a single submission per Turker. We also restrict our task to Turkers who are located in the United States, and who maintain an approval rating of at least 90%. Turkers are allowed a maximum of 30 minutes to work on the HIT, and are paid one US dollar for an accepted submission.

Each HIT presents the Turker with the name and website of a hotel. The HIT instructions ask the Turker to assume that they work for the hotelís marketing department, and to pretend that their boss wants them to write a fake review (as if they were a customer) to be posted on a travel review website; additionally, the review needs to sound realistic and portray the hotel in a positive light. A disclaimer indicates that any submission found to be of insufficient quality (e.g., written for the wrong hotel, unintelligible, unreasonably short, plagiarized, etc.) will be rejected
Then, of course, they need "real" reviews. But since part of the issue is that many "real" reviews are faked, the team did their best to find a bunch of real reviews from TripAdvisor, by narrowing them down based on a few factors:
For truthful opinions, we mine all 6,977 reviews from the 20 most popular Chicago hotels on TripAdvisor. From these we eliminate:
  • 3,130 non-5-star reviews;
  • 41 non-English reviews;13
  • 75 reviews with fewer than 150 characters since, by construction, deceptive opinions are at least 150 characters long...
  • 1,607 reviews written by first-time authorsó new users who have not previously posted an opinion on TripAdvisorósince these opinions are more likely to contain opinion spam, which would reduce the integrity of our truthful review data...
Finally, we balance the number of truthful and deceptive opinions by selecting 400 of the remaining 2,124 truthful reviews, such that the document lengths of the selected truthful reviews are similarly distributed to those of the deceptive reviews. Work by Serrano et al. (2009) suggests that a log-normal distribution is appropriate for modeling document lengths. Thus, for each of the 20 chosen hotels, we select 20 truthful reviews from a log-normal (left-truncated at 150 characters) distribution fit to the lengths of the deceptive reviews.
They then test how humans see the two kinds of reviews, and discover that they can't tell them apart. In fact, their accuracy was only slightly above 50%. However, they then work out algorithmic ways of distinguishing the "real" reviews from the fake reviews, and come up with a system that is 90% accurate in picking out which reviews are which. Apparently, while humans can't pick out the differences, faked reviews have some common characteristics:
We observe that truthful opinions tend to include more sensorial and concrete language than deceptive opinions; in particular, truthful opinions are more specific about spatial configurations (e.g., small, bathroom, on, location). This finding is also supported by recent work by Vrij et al. (2009) suggesting that liars have considerable difficultly encoding spatial information into their lies. Accordingly, we observe an increased focus in deceptive opinions on aspects external to the hotel being reviewed (e.g., husband, business, vacation)...

[....]

... we find increased first person singular to be among the largest indicators of deception, which we speculate is due to our deceivers attempting to enhance the credibility of their reviews by emphasizing their own presence in the review.
Obviously, it's just one bit of research, but apparently those involved in it have been contacted by... well, just about everyone doing online reviews. Hopefully this means that we're not too far off from better quality online reviews.


Reader Comments (rss)

(Flattened / Threaded)

  1.  
    identicon
    Rich Kulawiec, Aug 22nd, 2011 @ 4:06pm

    There are other ways to do this...

    ...using methods that these researchers either overlooked, omitted, or are not aware of. Hint: combinations of IP, network block, ASN, passive OS fingerprinting, browser fingerprinting, etc. are quite effective in dealing with comment/blogspam. The problem isn't that these methods don't work; the problem is that very people know how to use them.

     

    reply to this | link to this | view in thread ]

  2.  
    icon
    Squirrel Brains (profile), Aug 22nd, 2011 @ 4:25pm

    I experimented with being a Mechanical Turk (wanted to see what it was all about). I saw several HITs like this one, but I never did them. There are a lot of shady HITs and I never wanted to do any that I thought would contribute to spam (there are a lot of HITs where you read something and then type it into various fields). The higher the payout, the more suspicious the tasks (usually). It is interesting to see it used this way, though I would never participate.

     

    reply to this | link to this | view in thread ]

  3.  
    icon
    Pitabred (profile), Aug 22nd, 2011 @ 4:26pm

    Re: There are other ways to do this...

    What if a company tells it's employees to go home at night and write a positive review for them? Not sure how all your fancy filtering would stop that, because every employee would likely pass all of those tests you set up. At least if you didn't make it to block all kinds of legitimate feedback.

     

    reply to this | link to this | view in thread ]

  4.  
    identicon
    mischab1, Aug 22nd, 2011 @ 4:29pm

    And this will be true until the fake reviewers find out and modify their behavior to match. :-)

     

    reply to this | link to this | view in thread ]

  5.  
    identicon
    Lawrence D'Oliveiro, Aug 22nd, 2011 @ 4:29pm

    Now All We Need ...

    ... is a computer algorithm to generate canít-be-distinguished-from-genuine fake reviews.

     

    reply to this | link to this | view in thread ]

  6.  
    identicon
    out_of_the_blue, Aug 22nd, 2011 @ 4:49pm

    "...we eliminate: * 3,130 non-5-star reviews;"

    Nothing elitist here.

    Anyone who even reads online "reviews" is a ninny and a sucker.

     

    reply to this | link to this | view in thread ]

  7.  
    identicon
    blown_out, Aug 22nd, 2011 @ 4:59pm

    Re:"...we eliminate: * 3,130 non-5-star reviews;"

    Anyone who doesn't read online "reviews" is a ninny and sucker. There, that makes as much sense and is as backed-up with data as your post.

     

    reply to this | link to this | view in thread ]

  8.  
    icon
    blaktron (profile), Aug 22nd, 2011 @ 5:37pm

    What they are really doing is learning how to build a better lie...

     

    reply to this | link to this | view in thread ]

  9.  
    identicon
    Anonymous Coward, Aug 22nd, 2011 @ 5:42pm

    Re: Re: There are other ways to do this...

    I'd sell my mother for enough money.

     

    reply to this | link to this | view in thread ]

  10.  
    icon
    That Anonymous Coward (profile), Aug 22nd, 2011 @ 6:08pm

    I find this review to be amazing, 5 stars.
    From the gleaming exterior to the spacious inside this is the right blog for you.
    My partner was impressed with the attentiveness of the staff and the expansiveness of the offerings.

     

    reply to this | link to this | view in thread ]

  11.  
    identicon
    Anonymous Coward, Aug 22nd, 2011 @ 6:48pm

    Re: "...we eliminate: * 3,130 non-5-star reviews;"

    Are you an idiot? They eliminated non-5-star reviews because they wanted to compare real 5-star reviews to fake 5-star reviews.

    I know you want to shoehorn elitist conspiracies into every single comment you make about every single post, but it doesn't work.

     

    reply to this | link to this | view in thread ]

  12.  
    icon
    Overcast (profile), Aug 22nd, 2011 @ 10:08pm

    What they are really doing is learning how to build a better lie...

    I kinda had that thought too. Who's going to govern what the rules are on the filtering? Probably whoever's paying for it, which makes perfect sense.

    This may convert opinions and reviews into conceptual paid content from the company who hosts them. Immediately this brings the question of bias into the picture.

    Consumers will see it as 'controlled' and as factual as any paid advertisement... might be. Then we get to company reputation - and many are lacking sorely there. It seems the concept now is if they all suck - then consumers have no choice.

     

    reply to this | link to this | view in thread ]

  13.  
    identicon
    Slicerwizard, Aug 23rd, 2011 @ 1:32am

    Re: Re: "...we eliminate: * 3,130 non-5-star reviews;"

    "Are you an idiot?"

    The evidence strongly suggests that yes, old blueballs certainly is.

     

    reply to this | link to this | view in thread ]

  14.  
    icon
    Mike Masnick (profile), Aug 23rd, 2011 @ 2:18am

    Re: "...we eliminate: * 3,130 non-5-star reviews;"

    Nothing elitist here.


    You're correct. There's nothing elitist there at all. The whole project was around comparing 5-star reviews, so of course it makes sense to eliminate all other reviews.

    Not sure how anyone could turn that into a statement on elitism.

    Perhaps you misread?

     

    reply to this | link to this | view in thread ]

  15.  
    icon
    Kevin (profile), Aug 23rd, 2011 @ 5:27am

    The Human Coefficient

    This is interesting (at least for the time being, until other commenters are proven correct and the fake-review cottage industry learns how to "beat" this) for humans as well, because it provides some interesting human-applicable strategies for bumping up that 50% (i.e., look for first-person, weight spatial descriptions more heavily, etc.).

     

    reply to this | link to this | view in thread ]

  16.  
    identicon
    Anonymous Coward, Aug 23rd, 2011 @ 5:51am

    Slightly Off-Topic

    Is the term "Turk" important/meaningful when it comes to AI?

    I ask this because the only "Turk" in know (that's related to AI - I know that a country called Turkey exists) is from Terminator: The Sarah Connor Chronicles, so was wondering if the term was inspired from the series.

     

    reply to this | link to this | view in thread ]

  17.  
    identicon
    Anonymous Coward, Aug 23rd, 2011 @ 7:03am

    Re: Slightly Off-Topic

    TSCC was referencing the same thing that Amazon is:

    http://en.wikipedia.org/wiki/The_Turk

    (Also, that show was terrible. Terminator ended with T2. Cameron said so)

     

    reply to this | link to this | view in thread ]

  18.  
    identicon
    Rich Kulawiec, Aug 23rd, 2011 @ 7:08am

    Re: Re: There are other ways to do this...

    Of course, a single paragraph is only enough to just mention the techniques used -- it's not a full exposition of the methodology. Suffice it to say that these used in combination are quite, quite effective when used properly.

     

    reply to this | link to this | view in thread ]

  19.  
    icon
    The Groove Tiger (profile), Aug 23rd, 2011 @ 8:04am

    Re: "...we eliminate: * 3,130 non-5-star reviews;"

    They also eliminated reviews with less than 150 words. They're clearly trying to discriminate against people who are too poor to afford more words.

     

    reply to this | link to this | view in thread ]

  20.  
    identicon
    Anonymous Coward, Aug 23rd, 2011 @ 9:52am

    I love it!

    If the same sort of standards were used to decide what was and was not copyright violations on a file site, you guys would be shitting yourselves and yelling about free speech and all that.

    It's amazing to watch you guys go sometimes!

     

    reply to this | link to this | view in thread ]

  21.  
    identicon
    Dohn Joe, Aug 23rd, 2011 @ 11:19pm

    Another Leap Forward

    This is actually a huge leap forward if you think about the abstract implications: computer algorithms which can tell you if someone is lying!!

     

    reply to this | link to this | view in thread ]

  22.  
    identicon
    hhotelconsult, Aug 24th, 2011 @ 2:16pm

    Filter bubbles

    Anyone engaged in this conversation, ESPECIALLY the author, needs to read Eli Pariser's book "The Filter Bubble: What the internet is hiding from you", basically about auto-filtering and personalization.

    It's vital, it's on point, and it addresses just how frightening this concept is.... especially if the algorithm automates and doesn't really provide any real logic.

    Just read it.

     

    reply to this | link to this | view in thread ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This