Faux Randomness Strikes Again: How Researchers Realized Research 2000's Daily Kos Data Looked Faked

from the random-ain't-so-random dept

You may have heard by now that the political website Daily Kos has come out and explained that it believes the polling firm it has used for a while, Research 2000, was faking its data. While it's nice to see a publication come right out and bluntly admit that it had relied on data that it now believes was not legit, what's fascinating if you're a stats geek is how a team of stats geeks figured out there were problems with the data. As any good stats nerd knows, the concept of "randomness" isn't quite as random as some people think, which is why faking randomness almost always leads to tell-tale signs that the data was faked or manipulated. For example, one very, very common test is to use Benford's Law to look at the first digit of data in a data set, because in a truly random set, the distribution is not what people usually expect.

In this case, the three guys who had problems with the data (Mark Grebner, Michael Weissman, and Jonathan Weissman) zeroed in on just a few clues that the data was faked or manipulated. The first thing they noticed was that when R2K did polls that tested how men and women viewed certain politicians or political parties (favorable/unfavorable) there was an odd pattern: if the percentage of men that rated a particular politician favorable or unfavorable was an even number, so was the the percentage of female raters. It seemed like these two points always matched up. If the male percentage was even the female percentage was even. If the male percentage was odd, the female percentage was odd. Yet, as you should know, these are independent variables, not influenced by each other. That 34% of men find a particular politician favorable should have no bearing on why an even percentage of women find that politician favorable. In fact, this happened in almost every such poll that R2K did, to such a level as to suggest it being as close to impossible as you can imagine:
Common sense says that that result is highly unlikely, but it helps to do a more precise calculation. Since the odds of getting a match each time are essentially 50%, the odds of getting 776/778 matches are just like those of getting 776 heads on 778 tosses of a fair coin. Results that extreme happen less than one time in 10228. That's one followed by 228 zeros. (The number of atoms within our cosmic horizon is something like 1 followed by 80 zeros.) For the Unf, the odds are less than one in 10231. (Having some Undecideds makes Fav and Unf nearly independent, so these are two separate wildly unlikely events.)

There is no remotely realistic way that a simple tabulation and subsequent rounding of the results for M's and F's could possibly show that detailed similarity. Therefore the numbers on these two separate groups were not generated just by independently polling them.
The other statistical analysis that I found fascinating was that when you looked at weekly changes in favorability ratings, the R2K data almost always changed a bit. But, if you look at other data, no change is the most common result. As they point out, if you look at, say, Gallup data, you get this nice typical bell curve:
But if you look at the R2K data, you get things like the following:

Notice that substantial dip at the 0% mark. That seems to indicate a likelihood of faked or manipulated data, from someone who thinks that "random" data means the data has to keep changing. Back to Grebner, Weissman and Weissman:
How do we know that the real data couldn't possibly have many changes of +1% or -1% but few changes of 0%? Let's make an imaginative leap and say that, for some inexplicable reason, the actual changes in the population's opinion were always exactly +1% or -1%, equally likely. Since real polls would have substantial sampling error (about +/-2% in the week-to-week numbers even in the first 60 weeks, more later) the distribution of weekly changes in the poll results would be smeared out, with slightly more ending up rounding to 0% than to -1% or +1%. No real results could show a sharp hole at 0%, barring yet another wildly unlikely accident.
Kos is apparently planning legal action, and so far R2K hasn't responded in much detail other than to claim that its polls were conducted properly. I'm not all that interested in that part of the discussion however. I just find it neat how the "faux randomness" may have exposed the problems with the data.


Reader Comments (rss)

(Flattened / Threaded)

  •  
    icon
    interval (profile), Jun 29th, 2010 @ 3:29pm

    Look at the poll results

    They're obviously what R2K thought DKos wanted to see. It speaks to DKos' credit that they immediately dis-associated themselves from R2K. Good for them, even if I don't agree with their politics.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Stuart, Jun 29th, 2010 @ 3:40pm

    All polls are bullshit. Wither they just fake the numbers or they rig the questions. Left polls are normally good for the left and right polls are normally good for the right. Fuck pollsters.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Jun 29th, 2010 @ 3:57pm

    Update

    Nate Silver was one of the people involved in this statistical evaluation of the polls, and he was promptly sent a cease and desist order regarding his efforts to "discredit and damage [Research 2000] by posting negative comments."

    http://www.fivethirtyeight.com/2010/06/research-2000-issues-cease-desist.html

    So - if you do a statistical analysis of the problems with a company, and publish those results, you may damage "the company's reputation and the company's existing and prospective business relationships." Actually, I would imagine that is exactly what will happen -- but it sure isn't illegal to do that.

     

    reply to this | link to this | view in chronology ]

    •  
      identicon
      Anonymous Coward, Jun 29th, 2010 @ 4:04pm

      Re: Update

      Let's hope the Streisand Effect kicks into full gear.

       

      reply to this | link to this | view in chronology ]

    •  
      identicon
      Anonymoose, Jun 29th, 2010 @ 4:35pm

      Re: Update

      Streisand effect indeed. I would not have heard about this but for the horrible letter from Howrey sent to Nate Silver. You should really do a post on that letter, although that will likely prompt them to send you a cease and desist letter in response.

      Seriously, though - Howrey? R2K is lawyering-up big time. Hope that Kos and Nate have good lawyers of their own.

       

      reply to this | link to this | view in chronology ]

      •  
        icon
        Richard (profile), Jun 30th, 2010 @ 1:58am

        Re: Re: Update

        Seriously, though - Howrey? R2K is lawyering-up big time. Hope that Kos and Nate have good lawyers of their own.

        The fact that you (and R2K) think that hiring expensive lawyers could have any possible effect on the outcome here shows how ridiculous the legal system is.

        A lawyer giving a guilty criminal the best possible defence is one thing.

        A Lawyer standing up in court to defend the proposition that 2+2=5 is quite another.

        One would hope that the expensive lawyers would just tell R2K that they have no chance and refuse to take the case - but of course they won't.

         

        reply to this | link to this | view in chronology ]

    •  
      identicon
      abd gum, Jun 29th, 2010 @ 5:55pm

      Re: Update

      "if you do a statistical analysis of the problems with a company, and publish those results, you may damage"

      I would think the damage was already done.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    bob, Jun 29th, 2010 @ 3:57pm

    LOL

    The KOS was taken by BS that matched there BS, LOL

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Anonymous Coward, Jun 29th, 2010 @ 4:03pm

    "Yet, as you should know, these are independent variables, not influenced by each other."

    I wouldn't say they are independent variables, after all, both men and women are people and have similar brains (I know, I know, I'm going to get criticized for this one. Men are from mars and women are from Venus, blah blah blah). Still, if they are really really close to each other consistently that is certainly suspicious.

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Richard (profile), Jun 30th, 2010 @ 2:09am

      Re:

      I wouldn't say they are independent variables,

      The oddness/evenness of the percentages certainly are - because they depend on your choice of percentages (base 10 to two significant digits) as a mechanism to express the data.

      That choice (although standard) is certainly independent of the data itself and effectively decouples the two pieces of data from each other in that respect.

      Supposing the sample was 100000 of each gender and the actual numbers were 66124 and 26223. If you choose percentages both are even, if you choose parts per thousand (3 digits) one is even and one is odd. If you choose parts per 10000 (4 digits) then both are even again.

      Whilst you are correct in saying that the actual data are interdependent, that particular aspect of the statistical expression of the data is independent.

       

      reply to this | link to this | view in chronology ]

      •  
        icon
        Richard (profile), Jun 30th, 2010 @ 2:12am

        Re: Re:

        A simpler way to put the above is to say that whilst the data are potentially correlated the even/odd property of the percentages comes from the noise in the data. Whilst the data are correlated the noise is not.

         

        reply to this | link to this | view in chronology ]

    •  
      icon
      Drew (profile), Jun 30th, 2010 @ 5:12am

      Re:

      They're independent in the sense that one result isn't directly influenced by the other. If I go out on the street and poll a hundred women, and then go out a day later and poll a hundred men, you wouldn't call my results dependent upon each other. They might be (in the general sense) somewhat similar or different based on your "mens and womens brains" theory, but that would have NOTHING to do with the trailing digit of each of my individual polls being odd or even...

       

      reply to this | link to this | view in chronology ]

    •  
      identicon
      no one, Jun 30th, 2010 @ 8:44am

      Re:

      Look at what they are calling the variables again. The problem is not that an increase in one correlated with an increase in the other, but that they were both odd percentages or both even percentages...regardless of the direction of a shift in popularity. This is in no way influenced by the similarity of peoples brains, as you put it. Also, if talking about polls, where human opinions are the variables, please refrain from making the argument "we're all alike because we're human."

       

      reply to this | link to this | view in chronology ]

    •  
      icon
      W Klink (profile), Jun 30th, 2010 @ 9:23am

      Re:

      If I flip two quarters, the outcome of the first flip will not have any effect on the results of the second flip. The two flips are independent.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    Beta, Jun 29th, 2010 @ 5:25pm

    not even competent liars

    I find it very telling that not only did they cook the data, they did so in a way that any competent statistician would expect to leave obvious clues. It would have taken very little effort to model the real data (assuming there was any!) and then do random draws by computer; that would have eliminated both of these features. Instead they seem to have done the cooking by hand, and anyone who can pass a first semester course on the subject knows that doesn't give realistic results.

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Richard (profile), Jun 30th, 2010 @ 2:16am

      Re: not even competent liars

      Yes - it shows that they were statistically illiterate. All they needed to do was to generate the fakes with a decent quality random number generator ( many such are available in the GSL library for example) and no one would have known.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    CC, Jun 29th, 2010 @ 5:30pm

    improper design

    This didn't work because they dealt with the problem in an improper fashion. It probably would have worked if they made a fake population of voters who had set tastes, with a small chance to change their vote. A simple compilation of voting history per state could model this. Then, running a fake vote generation each week would give better fake results. They failed because they didn't model the system correctly. Voting isn't a random action.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Jakob, Jun 30th, 2010 @ 1:43am

    Alternative explanation

    Alternative explanation: They only ask 50 men and 50 women each week, so all percentages are initially even. To hide their small sample size they add 1% to the favorable and deduct 1% from unfavorable every other week.

    This would explain both the odd/even coincidence and the data changing by at least 1% per week way too often.

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Robert Speirs, Jun 30th, 2010 @ 2:28am

    The rest of the story

    So, where's the rest of the story? What was the import of the faked numbers? I know! It favored George W. Bush. Otherwise kos never would have become suspicious.

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Drew (profile), Jun 30th, 2010 @ 5:16am

      Re: The rest of the story

      Well, the thrust of this is that the Daily Kos probably devoted resources to certain races based upon the information they were getting from their pollster. And if they did that on the basis of fraudulent information from their pollster, Kos can definitely claim damages. And on top of that, they were paying a polling firm to poll, which it looks like they weren't actually doing...

       

      reply to this | link to this | view in chronology ]

    •  
      identicon
      Lonnie, Jun 30th, 2010 @ 5:48pm

      Re: The rest of the story

      Actually, R2k tended to be a few percent higher in favor of Obama than most other polls.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    bobby b, Jun 30th, 2010 @ 3:31am

    Non-random hysteria

    "All polls are bullshit . . . Fuck pollsters."

    See, now, just like those R2K people, you're throwing facially bad data at us and then aggressively asserting unsupported conclusions supposedly made obvious by that data, and you're expecting us to . . . what? . . . sign on as your Mouseketeers?

    No, not all polls are BS. Just today, I asked my two kids if they would enjoy raw oysters for breakfast. They both said no. (Really loudly, too.) Served 'em anyway, and, Lo!, the poll turned out to be completely accurate! And that's all it took to disprove your data.

    And as for your conclusion - well, maybe I will, but it's going to depend on things like the gender of the pollster at hand, their relative desirability, their willingness and interest . . . point is, your assertion about "all polls" is not going to have one iota of influence on how part two comes out. (Ooo, bad pun. Sorry.)

     

    reply to this | link to this | view in chronology ]

    •  
      icon
      Richard (profile), Jun 30th, 2010 @ 5:09am

      Re: Non-random hysteria

      No, not all polls are BS. Just today, I asked my two kids if they would enjoy raw oysters for breakfast. They both said no. (Really loudly, too.) Served 'em anyway, and, Lo!, the poll turned out to be completely accurate! And that's all it took to disprove your data.

      Strictly that wasn't actually a poll (in the usual sense of opinion poll) - since your sample was the whole population.

      One could argue the meanings of words but...

       

      reply to this | link to this | view in chronology ]

      •  
        icon
        nasch (profile), Jul 1st, 2010 @ 10:15am

        Re: Re: Non-random hysteria

        That's the most accurate poll possible, right? And why are we on the net in the first place if not to argue semantics, really? ;-)

         

        reply to this | link to this | view in chronology ]

  •  
    identicon
    Daniel Feenberg, Jun 30th, 2010 @ 6:32am

    Lack of zero changes

    The lack of zero changes could easily be explained if they were rounding away from zero, rather than the more usual rounding towards zero.

     

    reply to this | link to this | view in chronology ]

    •  
      identicon
      ChrisB, Jun 30th, 2010 @ 6:44am

      Re: Lack of zero changes

      "Rounding" usually refers to going to the nearest integer. "Truncating" is "rounding towards zero" and, in my experience, is rarely used.

       

      reply to this | link to this | view in chronology ]

    •  
      icon
      Richard (profile), Jun 30th, 2010 @ 6:57am

      Re: Lack of zero changes

      The lack of zero changes could easily be explained if they were rounding away from zero, rather than the more usual rounding towards zero.
      But the changes are not being rounded - it is thevalues that are being rounded. The changes are dervived from the values that are already rounded. So that explanation does not work.

       

      reply to this | link to this | view in chronology ]

  •  
    identicon
    Chester White, Jun 30th, 2010 @ 7:47am

    "I find it very telling that not only did they cook the data, they did so in a way that any competent statistician would expect to leave obvious clues."

    You don't need to be a statistician to know this was crap. I trained as a chemist and know squat about stats, but had I ever looked at this data (not being a Kossack I am never over there), I'd have seen it in an instant.

    All it takes is experience in the real world (in business, investing, whatever) to know that these data are obviously faked (or run through some totally bogus algorithm that forces the odd/even results [though why there were a couple of counterexamples if it was an honest algorithm error is pretty curious]).

     

    reply to this | link to this | view in chronology ]

  •  
    identicon
    Challeron, Jun 30th, 2010 @ 12:01pm

    Re W Klink @ 28

    I don't understand why so many "geeks", who are supposed to be all about numbers, think that, even after 773 coin-flips turning up Heads, that the odds of Flip #774 will be *astronomical*, instead of remaining 1:1.

    Markov, anyone?...

     

    reply to this | link to this | view in chronology ]

    •  
      identicon
      Anonymous Coward, Jun 30th, 2010 @ 2:09pm

      Re: Re W Klink @ 28

      If someone flipped a coin heads 773 times in a row, I would bet that it is not a fair coin.

       

      reply to this | link to this | view in chronology ]

    •  
      icon
      nasch (profile), Jul 1st, 2010 @ 10:18am

      Re: Re W Klink @ 28

      You're not saying this is related to the statistical analysis in this article, right? Just a vaguely related comment? Just making sure, because what you describe is not what these statisticians are talking about.

      To your question, I think good understanding of probability is really rare. Almost non-existent in the general American populace IMO, and possibly a minority even among geeks.

       

      reply to this | link to this | view in chronology ]

  •  
    icon
    nasch (profile), Jul 1st, 2010 @ 10:20am

    Smarter crooks

    How long before people like this start cooking up numbers that include artifacts like appropriate number of zeros, more low starting digits than high, etc, and statistical analysis won't catch them anymore? I'm sure it's a *lot* harder to do, but there will be some situation that makes it worthwhile for someone. Maybe it's happening already and we just don't know it.

     

    reply to this | link to this | view in chronology ]

  •  
    icon
    fodder99 (profile), Jun 30th, 2012 @ 10:22am

    Exactly !

     

    reply to this | link to this | view in chronology ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This