Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

With the silly introduction last week of the AP's attempt to create a weird and totally unnecessary new data feed to keep out aggregators and search engines, it seems that Google has gotten fed up. Google execs and employees have made similar statements on various panels and discussions, but Senior Business Product Manager Josh Cohen put up a blog post directed at newspapers, that can be summarized as: Dear newspapers: let me introduce you to a tool that's been around forever. It's called robots.txt. If you don't like us indexing you, use it. Otherwise, shut up. In only slightly nicer language.


Reader Comments (rss)

(Flattened / Threaded)

  1.  
    identicon
    Petréa Mitchell, Jul 16th, 2009 @ 9:17am

    The response I expect to hear...

    ...is that robots.txt is only a request. A polite aggregator will respect it, but the wicked, devious, pirating bloggers and scrapers that the AP is fighting an urgent battle against won't.

    And you have to acknowledge that there are occasional unscrupulous people who don't pay attention to robots.txt. So then the AP goes, "Ha, we were right!"

    So I think Google allowing itself to be drawn into an argument about technical details is a mistake here. Better to keep the focus on the disporportionate harm to positive, legitimate use caused by an attempt to guard against a small number of true pirates.

     

    reply to this | link to this | view in thread ]

  2.  
    icon
    technomage (profile), Jul 16th, 2009 @ 9:30am

    Re: The response I expect to hear...

    While I agree with your overall assumption, I know that those "unscrupulous" people will find ways around the paywalls as well. The newspapers are trying to make a blanket rule against everyone, but the problem is any blanket always has holes. The fact that these newspaper corps refuse to use the tools already at hand, but would rather force the government to get involved to make new rules and tools, shows just how out of touch they really are concerning technology. Creating new standards to benefit only one thing does not solve the underlying issue: People want news, they want it now, and they don't want to jump through hoops to get it.

     

    reply to this | link to this | view in thread ]

  3.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 9:34am

    The reason the robots.txt argument is flawed is because the newspaper don't want google to remove them from search. They want to show up in google. They just want google to pay them as well. It's nothing more than a money grab. Google says, "If you have good, useful content, we will rate you high in our index. You will get traffic and we will make a little money from ads on the search page." Newspapers say, "Sounds like a good deal. Except, we want you to pay us as well." Google says, "No, we don't think we should have to pay you. If you don't like the deal you can opt out." Newspapers say, "No, we like the deal, but you still have to pay us." Google says, "WTF."

     

    reply to this | link to this | view in thread ]

  4.  
    identicon
    Ryan Z, Jul 16th, 2009 @ 9:37am

    Re:

    Well, that doesn't make the argument flawed, does it? It just means the AP doesn't have a leg to stand on, except the politicians they've paid off, of course.

     

    reply to this | link to this | view in thread ]

  5.  
    identicon
    Yakko Warner, Jul 16th, 2009 @ 9:45am

    Re: The response I expect to hear...

    That may be true, but their argument will be with the aggregators that do not respect robots.txt, not with Google (which, according to the point of the blog, is a "polite aggregator" and respects robots.txt).

    Not that you can trust the newspapers not to confuse "any other random (misbehaving, non-robots.txt-respecting) search engine" with "Google".

    I more expect to see one or more of the following:

    * Newspapers write an invalid robots.txt file that ends up allowing Google to index their site, and they blame Google for their own technical ineptitude.

    * Newspapers complain that they have to write a robots.txt file or META tag on their page, and demand Google adopt an "opt-in" policy for indexing (rather than this "opt-out" policy that is robots.txt).

    * Some newspapers do everything correctly, stop their content from being indexed, and then blame Google when their traffic goes down (especially compared to the sites that aren't blocking Google, which end up getting more traffic)

    * Newspapers block their content from being indexed, but other papers or sites take the same stories and republish them and allow them to be indexed, and the original papers blame Google for indexing the republishing sites.

    * Newspapers ignore this blog post completely and simply continue to blame Google for indexing content they don't want indexed.

     

    reply to this | link to this | view in thread ]

  6.  
    icon
    stat_insig (profile), Jul 16th, 2009 @ 9:46am

    Re: The response I expect to hear...

    Well......... There is an easy solution if you don't want people to access your content..... keep it offline!

     

    reply to this | link to this | view in thread ]

  7.  
    identicon
    John Doe, Jul 16th, 2009 @ 9:46am

    Re:

    This is exactly what is going on. Newspapers are trying to get the government to point their guns at Google to make them hand over bags full of money. It is an attempt at "legal" extortion.

     

    reply to this | link to this | view in thread ]

  8.  
    identicon
    Ryan, Jul 16th, 2009 @ 9:51am

    yeah but

    the problem with robots.txt is that nothing in the specification involves people paying the AP.

    The AP doesn't want people to stop using their content - they want to change the way the web works so that they can be paid whenever they think they should be.

    They know that blocking google would be devastating to their industry, so instead they bitch and whine hoping that somebody will pay them to shut up.

     

    reply to this | link to this | view in thread ]

  9.  
    identicon
    duderino, Jul 16th, 2009 @ 9:54am

    sad

    This is just sad that Google has to keep making these kind of public statements while the AP doesn't read it, and then they keep digging themselves a bigger hole.

     

    reply to this | link to this | view in thread ]

  10.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 9:55am

    Re:

    I completely agree with google here, these newspaper companies are being evil and selfish. If they don't want Google linking to them then MAKE A ROBOTS.TXT file or TELL google to remove them from their index. I'm sure Google will be more than glad to remove them from their index. But you can't force someone to use your product and then force them to pay for it. That's like when the RIAA tried to force people to buy their music (and not boycott it) and then they tried to force them to pay for it ( http://www.techdirt.com/articles/20090616/1527385253.shtml ). Nonsense.

     

    reply to this | link to this | view in thread ]

  11.  
    identicon
    Ryan, Jul 16th, 2009 @ 10:01am

    i wish

    I wish that google would call the papers bluff and completely remove them from search results - offering only to re-include them if and when they put up a robots.txt file

    They'll never do it, users would complain about not being able to find their news, but man would it be hilarious.

     

    reply to this | link to this | view in thread ]

  12.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:10am

    Where can I write the politicians that these newspapers are lobbying, everyone should write the politicians and explain to them the technology and how absurd it is for these stupid newspapers to come crying to them for money grabs from google.

     

    reply to this | link to this | view in thread ]

  13.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:17am

    ACAP "protocol"

    have you looked through the ACAP document? it's like 40 pages long, and all they do is explain how to use robots.txt for the first 35 pages or so. then they introduce a few new tags for inline meta types and markup for robots.txt... while disallowing *.

    i encourage these guys to disallow * just so they can die faster and get replaced by better news outlets (newscientist, courthousenews, etc).

    besides, nothing prevents someone from using a spider that sets the user agent as one of the standard IE/FF user agent strings. then you're stuck taking a javascript or IP address route which are also both unreliable.

     

    reply to this | link to this | view in thread ]

  14.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:18am

    Fine block your sites from Google - and I'll just find another - no big deal. That's the POINT of Google - finding another site.

     

    reply to this | link to this | view in thread ]

  15.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:30am

    Re: i wish

    I have a better idea. Remove them from the search results/news/everything and make them pay to come back in.

     

    reply to this | link to this | view in thread ]

  16.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:31am

    Google also does not respect robots.txt 100% of the time either, often indexing internal pages that are blocked by the robots file because of direct external links.

    Do no evil. Right.

     

    reply to this | link to this | view in thread ]

  17.  
    icon
    Hulser (profile), Jul 16th, 2009 @ 10:34am

    Re: Re:

    Well, that doesn't make the argument flawed, does it? It just means the AP doesn't have a leg to stand on

    Exactly. Google knows that the AP knows about robots.txt. So the purpose of the Google blog post is not to let the AP know about robots.txt, it's to let everyone else know that the AP knows about robots.txt, which will result in undermining the AP's arguments for a legislative "solution".

     

    reply to this | link to this | view in thread ]

  18.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:43am

    Re:

    "Google also does not respect robots.txt 100% of the time either"

    Do you have any examples of this? Or are you just making things up.

     

    reply to this | link to this | view in thread ]

  19.  
    icon
    Hulser (profile), Jul 16th, 2009 @ 10:44am

    Re:

    Google also does not respect robots.txt 100% of the time

    You tell me which is a more compeling argument...

    A) We're really pissed off that Google is linking to our web site but we can't be bothered to implement a simple technical solution that would stop this.

    B) We don't want Google linking to our web site, but they're ignoring our configuration and linking to it anyway.

    Because the AP is choosing option A, it's all but irrelevent whether Google respects robots.txt 100% of the time. Right now, the ball is in the AP's court.

     

    reply to this | link to this | view in thread ]

  20.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 10:46am

    Re: Re:

    It's time for everyone to boycott the AP.

     

    reply to this | link to this | view in thread ]

  21.  
    icon
    The Buzz Saw (profile), Jul 16th, 2009 @ 10:47am

    Re:

    I'd be interested in seeing proof of this statement. I'm not trying to be obnoxious or anything; I'm just genuinely interested to see this happen. Several people have mentioned that Google does not always honor robots.txt, but I have never seen any matching evidence.

    Source?

     

    reply to this | link to this | view in thread ]

  22.  
    identicon
    Ryan, Jul 16th, 2009 @ 10:56am

    Re: Re:

    Yeah, I don't see this...Google is going to code their bots the same way, so it'll treat every site the same way. Unless they added in exceptions to specific sites, although I don't know why they'd do that. Do they have a shit list of webmasters they don't like that they periodically update in their scrapers? Seems to me like an exception would be an improperly used robots.txt file.

     

    reply to this | link to this | view in thread ]

  23.  
    identicon
    Ryan, Jul 16th, 2009 @ 11:22am

    google DOES follow robots.txt

    "Google also does not respect robots.txt 100% of the time either"

    I think you misunderstand crawling vs indexing. Robots.txt says don't crawl my site. It doesn't mean Google can't index it - it just means they won't cache it, or visit it, or anything.

    They will still show it in the search results, but only as a URL - with no snippet or text under it.

    You're thinking of the noindex tag if you don't want to be listed.

     

    reply to this | link to this | view in thread ]

  24.  
    identicon
    Anonymous Coward, Jul 16th, 2009 @ 11:53am

    Re: google DOES follow robots.txt

    Robot.txt has many commands that can say many different things INCLUDING don't index my site. See the post by Google's blog.

    "Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:

    User-agent: *
    Disallow: /"

    http://googlepublicpolicy.blogspot.com/2009/07/working-with-news-publishers.html

    They can have their website not INDEXED on google if they so choose just by a simple robot.txt file.

     

    reply to this | link to this | view in thread ]

  25.  
    identicon
    william, Jul 16th, 2009 @ 1:08pm

    Okay, let's put it this way. problem = opportunity = money.

    The Internet and search engines are fine the way it is with REP...etc. However, Newspapers want a share of THAT "internet money" without having to do any work or use their brain to come up with something new and novel.

    What do they do? Create an artificial problem by pretending they know nothing about the current Internet technology. Create another standard that's inferior to what we have right now. Whine to create pressure to make people use them.

    Then everyone will have to PAY THEM to NOT USE that sh*t standard.

    Business model or extortion? You tell me.

     

    reply to this | link to this | view in thread ]

  26.  
    identicon
    MattP, Jul 16th, 2009 @ 2:50pm

    Re: The response I expect to hear...

    FTA: "REP isn't specific to Google; all major search engines honor its commands."

     

    reply to this | link to this | view in thread ]

  27.  
    identicon
    MattP, Jul 16th, 2009 @ 2:57pm

    Opportunity Lost

    "Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read -- Google delivers more than a billion consumer visits to newspaper web sites each month." I'm sure there are plenty of sources wanting a share of 1 billion visitors a month. Let AP die and move on.

     

    reply to this | link to this | view in thread ]


Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here
Get Techdirt’s Daily Email
Save me a cookie
  • Note: A CRLF will be replaced by a break tag (<br>), all other allowable HTML will remain intact
  • Allowed HTML Tags: <b> <i> <a> <em> <br> <strong> <blockquote> <hr> <tt>
Follow Techdirt
A word from our sponsors...
Essential Reading
Techdirt Reading List
Techdirt Insider Chat
A word from our sponsors...
Recent Stories
A word from our sponsors...

Close

Email This