Stories filed under: "search engines"

A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright

from the not-how-any-of-this-works dept

Tue, Jul 11th 2023 09:29am - Mike Masnick

You may have seen some headlines recently about some authors filing lawsuits against OpenAI. The lawsuits (plural, though I’m confused why it’s separate attempts at filing a class action lawsuit, rather than a single one) began last week, when authors Paul Tremblay and Mona Awad sued OpenAI and various subsidiaries, claiming copyright infringement in how OpenAI trained its models. They got a lot more attention over the weekend when another class action lawsuit was filed against OpenAI with comedian Sarah Silverman as the lead plaintiff, along with Christopher Golden and Richard Kadrey. The same day the same three plaintiffs (though with Kadrey now listed as the top plaintiff) also sued Meta, though the complaint is basically the same.

All three cases were filed by Joseph Saveri, a plaintiffs class action lawyer who specializes in antitrust litigation. As with all too many class action lawyers, the goal is generally enriching the class action lawyers, rather than actually stopping any actual wrong. Saveri is not a copyright expert, and the lawsuits… show that. There are a ton of assumptions about how Saveri seems to think copyright law works, which is entirely inconsistent with how it actually works.

The complaints are basically all the same, and what it comes down to is the argument that AI systems were trained on copyright-covered material (duh) and that somehow violates their copyrights.

Much of the material in OpenAI’s training datasets, however, comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI without consent, without credit, and without compensation

But… this is both wrong and not quite how copyright law works. Training an LLM does not require “copying” the work in question, but rather reading it. To some extent, this lawsuit is basically arguing that merely reading a copyright-covered work is, itself, copyright infringement.

Under this definition, all search engines would be copyright infringing, because effectively they’re doing the same thing: scanning web pages and learning from what they find to build an index. But we’ve already had courts say that’s not even remotely true. If the courts have decided that search engines scanning content on the web to build an index is clearly transformative fair use, so to would be scanning internet content for training an LLM. Arguably the latter case is way more transformative.

And this is the way it should be, because otherwise, it would basically be saying that anyone reading a work by someone else, and then being inspired to create something new would be infringing on the works they were inspired by. I recognize that the Blurred Lines case sorta went in the opposite direction when it came to music, but more recent decisions have really chipped away at Blurred Lines, and even the recording industry (the recording industry!) is arguing that the Blurred Lines case extended copyright too far.

But, if you look at the details of these lawsuits, they’re not arguing any actual copying (which, you know, is kind of important for their to be copyright infringement), but just that the LLMs have learned from the works of the authors who are suing. The evidence there is, well… extraordinarily weak.

For example, in the Tremblay case, they asked ChatGPT to “summarize” his book “The Cabin at the End of the World,” and ChatGPT does so. They do the same in the Silverman case, with her book “The Bedwetter.” If those are infringing, so is every book report by every schoolchild ever. That’s just not how copyright law works.

The lawsuit tries one other tactic here to argue infringement, beyond just “the LLMs read our books.” It also claims that the corpus of data used to train the LLMs was itself infringing.

For instance, in its June 2018 paper introducing GPT-1 (called “Improving Language Understanding by Generative Pre-Training”), OpenAI revealed that it trained GPT-1 on BookCorpus, a collection of “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” OpenAI confirmed why a dataset of books was so valuable: “Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information.” Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google, Amazon, and others.

BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of AI researchers for the purpose of training language models. They copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost. Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.

If that’s the case, then they could make the argument that BookCorpus itself is infringing on copyright (though, again, I’d argue there’s a very strong fair use claim under the Perfect 10 cases), but that’s separate from the question of whether or not training on that data is infringing.

And that’s also true of the other claims of secret pirated copies of books that the complaint insists OpenAI must have relied on:

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Again, think of the implications if this is copyright infringement. If a musician were inspired to create music in a certain genre after hearing pirated songs in that genre, would that make the songs they created infringing? No one thinks that makes sense except the most extreme copyright maximalists. But that’s not how the law actually works.

This entire line of cases is just based on a total and complete misunderstanding of copyright law. I completely understand that many creative folks are worried and scared about AI, and in particular that it was trained on their works, and can often (if imperfectly) create works inspired by them. But… that’s also how human creativity works.

Humans read, listen, watch, learn from, and are inspired by those who came before them. And then they synthesize that with other things, and create new works, often seeking to emulate the styles of those they learned from. AI systems and LLMs are doing the same thing. It’s not infringing to learn from and be inspired by the works of others. It’s not infringing to write a book report style summary of the works of others.

I understand the emotional appeal of these kinds of lawsuits, but the legal reality is that these cases seem doomed to fail, and possibly in a way that will leave the plaintiffs having to pay legal fees (since in copyright legal fee awards are much more common).

That said, if we’ve learned anything at all in the past two plus decades of lawsuits about copyright and the internet, courts will sometimes bend over backwards to rewrite copyright law to pretend it says what they want it to say, rather than what it does say. If that happens here, however, it would be a huge loss to human creativity.

Filed Under: ai, christopher golden, copyright, inspiration, joseph saveri, llms, mona awad, paul tremlay, richard kadrey, sarah silverman, search engines, training
Companies: meta, openai

If It’s Impossible To Compete With Google, How Come New Search Engines Keep Launching?

Venture Capital

from the the-internet-is-quite-the-dynamic-place dept

Fri, Jul 8th 2022 03:51pm - Mike Masnick

We’re talking a lot these days about competition and antitrust, and the narrative over the past few years is that four companies — Facebook, Apple, Amazon, and Google — have basically sewn up the entire internet market, and no new entrants can ever succeed. Of course, we keep seeing that argument challenged by reality. First off, for a while people were including Netflix in that list, but over the last few years, Netflix has been facing competition from all different directions and is now struggling. On the social media front, TikTok certainly showed that it’s possible for other entrants to become very big, very fast, even if Facebook wants to kill them. And, of course, basically every month now we hear about this or that new social network that is gaining ground, especially among younger generations who don’t trust Facebook.

But, on search, we’ve been told that there really can’t be a new entrant, since Google has such control over the market. Of course, Bing is out there, and DuckDuckGo has carved out a pretty healthy slice of the market.

Perhaps most interesting to me, however, is how I keep hearing about new entrants in the search market. Last fall, privacy-protecting browser Brave announced that it was launching its own search engine, for example. However, in the last few weeks I’ve heard about two other brand new search engines as well. First up, Russ Roberts interviewed former Google exec Sridhar Ramaswamy, who recently launched the new search engine Neeva, which appears to be a search engine with a freemium model that promises not just no tracking (a la DDG), but also no ads ever.

Last year, the company raised $40 million from two top VC firms, Sequoia and Greylock, which, again, goes against the narrative that VCs won’t invest in these spaces. In just four months since the site launched, it has half a million monthly active users. That’s pretty tiny, but it’s still a starting point.

Then, just about the same time I learned about Neeva, I learned about another new search engine, called Yep (I wonder how much that domain cost!). Yep was just launched a few weeks ago, after the big search engine optimization company Ahrefs spent an apparent $60 million building it.

With Yep, their attempted differentiator is (like so many others) no tracking of personal info, including search history, and then a weird “profit-sharing” model, in which they promise to share 90% of ad profits with content publishers. I’ll be honest: I don’t quite understand what that means or how it works. First off, it seems unlikely that they’ll be making any “profits” in the short run (and perhaps longer) so is this just a future promise?

And, second, how are they going to (1) keep track of which content providers they owe money to and how much, and (2) get hooked up with those content providers to give them the money. The company’s “hypothetical” is that they would fund a ton for Wikipedia:

“Let’s say that the biggest search engine in the world makes $100B a year. Now, imagine if they gave $90B to content creators and publishers.

Wikipedia would probably earn a few billion dollars a year from its content. They’d be able to stop asking for donations and start paying the people who polish their articles a decent salary.

There would be no more need for paywalls and affiliate links, so publishers who’ve had to resort to chasing traffic with clickbait articles and filling their pages with ads would be able to get back to doing investigative pieces and quality analysis. A citizen journalist uncovering corruption on the side of a full-time job could get compensated without having to spend time trying to monetize content.

Again, this is not clear at all. How are they tracking that? How do they prevent gaming the system? Hell, they’re an SEO firm, they know that everyone tries to game search engines to get an indirect benefit. When you switch it to cold, hard cash, I imagine it’ll get that much worse. Perhaps the people at the company think their experience with SEO will help them spot the gamers, but it’s quite a challenge.

So, yes, neither of these may succeed. Both seem to have some pretty big challenges ahead. But I’m just generally fascinated by the idea that, despite the narrative about how it’s so impossible to build a search engine that there are “Venture Capital Kill Zones” where no VC would invest — and that includes search.

Yet, just here, within a week, I found out about approximately $100 million being spent on building two separate competing search engines, both with at least some plans to differentiate themselves in the market.

The internet is incredibly dynamic. There may be policy options for increasing competition, but it’s hard to argue that some companies have so dominated the field that no one even dares attempt to build competitors any more. They seem to be happening all around us.

Filed Under: competition, investing, kill zones, search, search engines, vcs
Companies: google, neeva, yep

31 Comments

Expand

Performative Conservatives Are Mad That A Search Engine Wants To Downrank Disinformation

Overhype

from the you-want-what-now? dept

Fri, Mar 11th 2022 12:09pm - Mike Masnick

DuckDuckGo, of course, is a popular “alternative” search engine, using Microsoft’s Bing as its underlying search engine but then doing a bunch of generally good stuff for the wider internet/public, such as not trying to collect as much information on you as possible for tracking based ads, but focusing instead of intention based ads around your search (like Google did in the early days). I regularly use it and appreciate the more privacy protective approach.

Like lots of internet companies over the last few weeks, apparently DuckDuckGo has been trying to figure out how to deal with the Russian invasion of Ukraine, as well as the concerted effort by certain sources to push a Russia-driven narrative about the invasion. A few days ago, DDG’s founder and CEO Gabriel Weinberg announced on Twitter that the company was rolling out a search update that downranked “sites associated with Russian disinformation.”

https://twitter.com/yegg/status/1501717193855283201

Now, there are (of course!) reasonable questions to be asked about what any particular company considers to be “disinformation.” As we’ve spent years detailing, defining disinformation is a lot more difficult than most people think. It’s also prone to abuse by governments looking to censor. And, quite frequently, disinformation flows are really more closely related to the issue of confirmation bias.

Still, the job of a search engine is to rank websites based on what that website thinks will provide the searcher with the most relevant information. It is, inherently, biased. It can’t not be. This is why the entire concept of “search neutrality” is nonsense. A “neutral” search engine is a search engine that just returns random results, rather than useful results. Every search engine is biased, because that bias is what determines what results will be ranked first, second, third, etc.

But… a whole bunch of overly performative Trumpists who must always play the victim, responded to Gabriel’s announcement by falling on their fainting couches to bemoan the fact that a search engine was downranking false information. This includes a Peter Thiel-backed Senate candidate, Blake Masters, who is shocked, shocked, shocked, that a search engine might try to minimize false information:

But there were lots more, and all seemed to be based on the idea that before this, DDG’s results were somehow… pure and untouched by any bias.

“I will determine for myself what is quality”

“just show all results and let people decide”

How dare a website try to determine what’s credible!

“manipulating search results” as if there’s a natural order of search results handed down from God.

If you don’t want “filtered” results, don’t use a search engine, Brad.

Yeah! How dare a search engine try to determine quality results!

They *are* returning search results, dude.

Search engines rank thinks?!? Since when?!?

Can’t believe a search engine would dare to try to figure out what’s relevant. What is the world coming to?

Just give me all the results in no conceivable order and let me decide for myself!

That’s kind of the whole point of a search engine, Grant.

Apparently trying to show you relevant information above less relevant information is now censorship.

You had ONE JOB: to provide me with random results in no particular order!

I’m an adult: please make sure you don’t rank any search results and force me to wade through garbage to find anything useful. Like all adults.

If you want a search engine where you get to “make your own decisions about information” then you don’t want a search engine, Glen.

It really is unfortunate when a *search engine* tries to prioritize more relevant results.

Search engines must return all results and let me rank them.

Time to find a search engine that **doesn’t** try to find what’s best for me.

There are SO MANY more tweets like this, nearly all of which seem to think that there’s a divine set of search results that are perfect, unbiased, and untouched by human hands, and that somehow DDG’s latest search ranking modification (something every search engine ever has always done as they attempt to continually rank information in a more relevant fashion) is against the norm.

It truly is incredible how little people understand how any of this works.

Filed Under: bias, disinformation, search, search engines, search neutrality
Companies: duckduckgo

226 Comments

Expand

Other Big CJEU Case Says Google Must Put Certain Links At The Top Of Search Results

Legal Issues

from the must-carry dept

Tue, Sep 24th 2019 12:10pm - Mike Masnick

While most of the attention today was focused on the CJEU “right to be forgotten” ruling concerning global censorship, the court actually released another ruling concerning the right to be forgotten, again around disagreement between French regulators and Google. And, as intermediary liability expert Daphne Keller notes, this ruling may turn out to be more interesting in the long run.

This case involved how Google should deal with “sensitive data,” when it’s a part of a RTBF request. The court does decide that a “notice and takedown” regime makes sense for such sensitive content, which is better than the possible alternative advanced by some: that the law requires Google to pro-actively stop the indexing of such sensitive information (or even to first get consent). The court points out that this wouldn’t make any sense at all, given how search engines work:

In practice, it is scarcely conceivable ? nor, moreover, does it appear from the documents before the Court ? that the operator of a search engine will seek the express consent of data subjects before processing personal data concerning them for the purposes of his referencing activity.

But what really stands out is what appears to be a totally uncalled for random aside by the court towards the end of the ruling:

It must, however, be added that, even if the operator of a search engine were to find that that is not the case because the inclusion of the link in question is strictly necessary for reconciling the data subject?s rights to privacy and protection of personal data with the freedom of information of potentially interested internet users, the operator is in any event required, at the latest on the occasion of the request for de-referencing, to adjust the list of results in such a way that the overall picture it gives the internet user reflects the current legal position, which means in particular that links to web pages containing information on that point must appear in first place on the list.

This seems like it could be a very big deal. It’s the court saying that Google is required to make sure that top search results “reflects the current legal position.” In other words, if someone was exonerated after being accused of a crime, that must now be the top link. As Keller notes, this is going to have some strange consequences that probably won’t be very good:

– If the top results on searches for the data subject's name would otherwise not be about the criminal case, this creates a serious Streisand effect.
– This invites all kinds of abuse by SEOs and reputation management companies.

— Daphne Keller (@daphnehk) September 24, 2019

Having a court come in and tell a search engine what is “required” to be the top search result, tossed off without much detail or thought in a world where getting certain links to certain places on search engines is literally an entire industry, is going to have pretty significant consequences — nearly all of them I can guarantee you the court did not even begin to think about.

Filed Under: cjeu, cnil, eu, links, must carry, right to be forgotten, rtbf, search engines, search results, seo
Companies: google

40 Comments

Expand

TurboTax Did Everything It Could To Hide The Free-Filing Its Supposed To Offer

(Mis)Uses of Technology

from the hide-and-seek dept

Mon, Apr 29th 2019 11:57am - Timothy Geigner

For years, advocates for the non-wealthy public have put forward plans to simplify the tax-preparation process by having the IRS pre-prepare a tax filing with the information it already has, sending it to citizens, and allowing those citizens to either sign and return it or do their own tax preparation if they think there are errors. Several politicians have put versions of this plan forward, including Elizabeth Warren. The idea is that, for the vast majority of Americans, the IRS already has all the information it needs for the tax filing. Why make most people do tax prep when they don’t have to?

Well, for just as many years, the companies that make money by doing this tax prep work have lobbied heavily in Congress to keep this from becoming law. Intuit, makers of TurboTax software, has been particularly active on this front, with novel arguments that amount to, “But if you make this law, then we’ll make less money.” When that messaging became a PR disaster, the company tricked a bunch of mouth-pieces to say all this for it.

Now, if all of that seems like shady shit, you ain’t seen nothing yet. One of the ways companies like Intuit hand-wave concerns that its lobbying efforts are coercing the poor and middle class to pay for tax prep that is so simple it should be free is by pointing out that it entered into an agreement with the IRS to offer their own free-to-file programs for anyone that makes less than $66k in a given year. While that’s true, ProPublica has a nice write up of just how far Intuit in particular goes to hide this program from the very public it’s supposed to be serving.

Intuit and other tax software companies have spent millions lobbying to make sure that the IRS doesn’t offer its own tax preparation and filing service. In exchange, the companies have entered into an agreement with the IRS to offer a “Free File” product to most Americans — but good luck finding it.

Here’s what happened when we went looking.

Our first stop was Google. We searched for “irs free file taxes.” And we thought we found what we were looking for: Ads from TurboTax and others directing us to free products.

Spoiler alert: those products didn’t end up being free. Despite ads that mentioned “free” several times over, the researchers that created a profile of a house cleaner making $29k for the year, TurboTax’s site declared that free to file wasn’t an option because the fictional citizen was an independent contractor. Instead, the tax prep would cost $119.99. ProPublica continued:

Then we tried with a second scenario. We went back to TurboTax.com and clicked on “FREE Guaranteed.” This time, we went through the process as a Walgreens cashier without health insurance, entering personal information and giving the company lots of sensitive data.

Again, TurboTax told us we had to pay — this time because there’s an extra form if you don’t have insurance. The charge? $59.99.

Per the article, both instances are not kosher based on the agreement with the IRS. That agreement is quite simple: if you make less than $66k in the year, you get to file for free, period. From there, the researchers dug into TurboTax’s source code.

Even though we clicked on the “FREE Guaranteed” option and met all the requirements to file for free, the company had tagged us as a potential paying customer. In the source code, TurboTax had branded us as “NONFFA.” That stands for “Non Free File Alliance.” In other words, we were not on track to file for free after all. Even though TurboTax could tell we were eligible to file for free, the company never told us about the truly free version.

It turns out that if you start the process from TurboTax.com, it’s impossible to find the truly free version. The company itself admits this.

So, despite that site being laced with as many “free”s as could be mustered, you can’t actually get to free filing at all. How many folks using the site to file for free do you think make it all the way to the FAQ page and realize their mistake compared with how many accept what the site tells them and pay up to file instead? Especially when “free” appears all over the sites on which you cannot file for free, but the actual free filing site is called, sigh, TurboTax Freedom?

But let’s pretend most people do get to that FAQ. The researchers threw “turbotax freedom” into Google to see what popped up.

The first link was from TurboTax and said “Free File Program” right in the text. We clicked, and it brought us to this new page. While the orange “See If You Qualify” link did take us to the real Free File program, the blue “Start for Free” link brought us back to the version of TurboTax where we ended up having to pay.

Whatever this is, it clearly isn’t Intuit comporting with the spirit of the agreement it signed with the IRS. All of these shady tactics are quite obviously designed to keep people from ever finding the free to file site, to trick the lower classes into paying for tax prep work when they should not be, and depressing the number of people that actually use it as much as possible.

The reward for all of that shady behavior are calls from the same Congress that receives Intuit’s lobbying dollars to end the program as it’s not achieving its goals. And, it seems Congress is also considering legally barring the IRS from offering any free to file program itself, because that certainly serves common people.

Congress is now moving to put the Free File program into law, including its restriction on the IRS creating its own free service. We wrote about that earlier this month and the opposition to this provision by freshman Democratic Reps. Katie Hill, Katie Porter, Alexandria Ocasio-Cortez and others. The House ultimately passed the bipartisan Taxpayer First Act, which also contains some provisions that consumer advocates support, such as restrictions on private debt collection of unpaid taxes.

Now the Senate is considering the bill. Its sponsors have argued that it doesn’t tie the IRS’ hands, but outside legal experts we’ve spoken to disagree. The text in the bill codifying the Free File program has long been sought by lobbyists for Intuit.

In addition to all of this, to make matters way, way worse, more information has come to light since the original ProPublica post and the original writing of this piece. In a follow up post, ProPublic has unearthed that Intuit specifically and actively de-indexed the free-to-file website from Google’s search engine with the robot.txt file.

The code on TurboTax’s Free File site says “noindex,nofollow” — instructions for it not to show up in search results.

In contrast, the TurboTax page that puts many users on track to pay signals to Google that it should be listed in search results.

Sen. Ron Wyden, the ranking Democratic member of the Senate Finance Committee, said in a statement that he plans to raise Intuit’s misleading marketing with the IRS. “Intuit’s tactics to reduce access to the Free File program and confuse taxpayers are outrageous,” he said.

Don’t expect Wyden to be the only member of Congress to get into the act. Several of the Democratic Presidential candidates are members of Congress as well, and you can bet that this is the kind of subject many of them will be all over. Elizabeth Warren in particular, as one of those pushing the IRS to do its own free to file program via legislation, should be an interesting watch here.

What the end result of all of this is unknown at the moment, but it looks very, very bad for Intuit. Again, Intuit signed an agreement with the IRS not only to offer free to file itself, but to take action to increase the use of it. Delisting the website where it can be done is about as counter to that promise as can be had.

Filed Under: congress, free tax filing, free taxes, irs, low income, search engines, tax prep, taxes, turbotax
Companies: intuit

43 Comments

Expand

There's A Reason That Misleading Claims Of Bias In Search And Social Media Enjoy Such Traction

Overhype

from the but-it's-not-a-good-reason dept

Tue, Sep 4th 2018 11:59am - Tarleton Gillespie

President Trump’s tweets charging that Google search results are biased, against him and against conservatives, are the loudest and latest version of a growing attack on search engines and social media platforms. It is potent, and it’s almost certainly wrong. But it comes at an unfortunate time, just as a more thoughtful and substantive challenge to the impact of Silicon Valley tech companies has finally begun to emerge. If someone were truly concerned about free speech, news, and how platforms subtly reshape public participation, they would be engaging these deeper questions. But these simplistic and ill-informed claims of deliberate political bias are the wrong questions, and they risk undermining and crowding out the right ones. Trump’s charges against Google, Twitter, and Facebook reveal a basic misunderstanding of how search and social media work, and they continue to confuse “fake news” with bad news, all in the service of scoring political points. However, even if these companies are not responsible for silencing conservative speech, they may be partly responsible for allowing this charge to gain purchase, by being so secretive for so long about how their algorithms and moderation policies work.

So what do search engines actually do when users access them for information or news? Search engines deliver relevant results, nothing more. That judgment of relevance is based on hundreds of factors: including popularity, topic relevance, and timeliness. Results are fluid and personalized. There’s plenty of room in this complex process for overemphasis and oversight, and these are important questions to examine. But serious researchers who actually already study this are careful to take into account the effects of personalization, changes over time, and the powerful feedback effects of users. This is a far cry from looking at your own search results and being troubled by what you see. (Even the author of the report Trump was likely reacting to acknowledges that it was unscientific and disagrees with the suggestion that regulation of search should follow.)

To understand, for instance, the results for “Trump” in Google News, or “Trump news” in Google — different things, by the way — we would need to consider some much more likely explanations than deliberate political manipulation: major outlets like CNN may publish a lot more content a lot more often; more users may click on, read, and forward links from these sources; outspoken right-wing sites like Gateway Pundit may have much less trust outside of their devoted base than they imagine; CNN may be much more congruent with centrist political leanings than Trump and conservative critics admit; well-established news sources may already circulate more widely and successfully on social media platforms like Facebook and Twitter, boosting their rankings on search engines; users may simply be more convinced by these news sources, “voting” for them with their clicks and links in ways that Google picks up on.

In truth, there are important questions to be asked about search engines, social media platforms, and the circulation of news online. There are profound concerns about the economic sustainability of journalism itself when it has to compete on social media platforms. There a profound concerns about the subtle effects of how algorithms work. But the noise that right-wing critics are stirring up is not subtle, it is not helpful, it is not well informed — and more than that, it is clearly about scoring political points. Those claiming political bias seem wholly uninterested in acknowledging the inquiries already underway.

Charges of left-leaning bias are not new, of course. They come from a very old playbook conservatives have used against newspapers and broadcasters for decades. Unfortunately, Silicon Valley is partly to blame for why it is working so well today. Search engines and social media platforms have been too secretive about how their algorithms work, and too secretive about how content moderation works. In the absence of substantive explanations, users have been left to wonder why search results look the way they do, or why some posts get removed and others don’t. This uncertainty breeds suspicion, and that suspicion goes looking for other explanations. This leaves room for trolls, conspiracy mongers, and demagogues to suggest that the platforms are silencing them for their political speech — conveniently overlooking the fact that they been suspended for making hateful threats, or can’t reach the first page of search results because readers trust other sources. And Silicon Valley has bruised their users’ trust for so long, that even their genuine explanations sound suspect.

Some of the press coverage, when it’s not careful, can inadvertently make the very same easy assumptions that these critics do. Search results, trending lists, and content moderation are not the same thing, they are not managed by the same people, and they are not handled in the same way. Too often, a critic will thread together ill-informed charges against search, one outdated incident regarding trending, and continued uncertainty about moderation practices, and lace them together into a blanket charge of bias. But they are simply different things.

It is unnerving to feel like an apologist for these tech companies. There are real and concerning questions about how search and social media work. I ask some of these questions in my own research, and my field has been thinking about them for years. The ways these companies have addressed, or often failed to address, the public ramifications of search algorithms and moderation policies has been deeply problematic. But these questions of bias distract us from the deeper problems.

It is also disconcerting, just as the public is finally grasping the subtle ways in which search and social media platforms matter, that we are ready to fall back on so simplistic a charge as deliberate political bias. I feel a bit like critics of mainstream news media, who for years have tried to highlight the way contemporary US news organizations are subtly centrist, structurally cautious, founded by commercial imperatives, and under attentive to marginalize voices — who now have to bracket those critiques and come to the defense of CNN when the President dismisses them as “fake news.” Those of us who ask hard questions about search and social media should do so, but we must also steadfastly refused to lump these real concerns in with facile, politically motivated charges of bias that miss the deeper point.

Tarleton Gillespie is the author of Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. He is a principal researcher at Microsoft Research and an affiliated associate professor at Cornell University.

Filed Under: bias, hype, politics, search engines, social media, transparency
Companies: facebook, google, twitter

108 Comments

Expand

Five Senators Agree: Search Engines Should Censor Drug Information

Free Speech

from the foot-in-the-door-for-greater-government-control-of-web-content dept

Thu, Mar 8th 2018 09:30am - Tim Cushing

The US government would like to be involved in the web censorship business. The anti-sex trafficking bill recently passed by the House would do just that, forcing service providers to pre-censor possibly harmless content out of fear of being sued for the criminal acts of private citizens. Much has been made recently of “fake news” and its distribution via Russian bots, with some suggesting legislation is the answer to a problem no one seems to be able to define. This too would be a form of censorship, forcing social media platforms to make snap decisions about new users and terminate accounts that seem too automated or too willing to distribute content Congressional reps feel is “fake.”

For the most part, legislation isn’t in the making. Instead, reps are hoping to shame, nudge, and coerce tech companies into self-censorship. This keeps the government’s hands clean, but there’s always the threat of a legal mandate backing legislators’ suggestions.

Key critic of Russian bots and social media companies in general — Senator Dianne Feinstein — has signed a handful of letters asking four major tech companies to start censoring drug-related material. Her co-signers on these ridiculous letters are Chuck Grassley, Amy Klobuchar, John Kennedy, and Sheldon Whitehouse. As members of the Senate Caucus on International Narcotic Control, they apparently believe Microsoft, Yahoo (lol), Pinterest, and Google should start preventing users for searching for drug information. (h/t Tom Angell)

The letters [PDFs here: Google, Yahoo, Microsoft, Pinterest] all discuss the search results returned when people search for information on buying drugs. (For instance, “buy percocet online.”) But the letter doesn’t limit itself to asking these companies to ensure only legitimate sites show up in the search results. It actually asks the companies to censor all results for drug information.

The senators specifically urge Google, Microsoft, Yahoo and Pinterest to take the following steps in helping us fight the opioid crisis:

Directing users to legal and legitimate pharmacies that require a valid prescription as a condition of sale when users search for medicines on each platforms;

Disabling the ability to search for illicit drugs through each platform;

Requiring each platform to report to law enforcement when that platform receives information indicating that a company wants to advertise the use of or sale of illicit narcotics;

Establishing a 24/7 telephone point of contact with whom law enforcement can communicate directly; and

Incorporating training for each platform’s security reviewers to enable them to better recognize these threats when they first arise.

It’s the second bullet point that’s key. It simply says “disable the ability to search for illicit drugs.” There’s no way to comply with that directive that won’t result in the disappearance of useful information needed by thousands of search engine users. As Angell points out in this tweet, this would possibly cause information about drug interactions to be delisted. On top of that, students often need to research illegal drugs for class assignments and term papers. Authors and journalists also need access to a variety of drug info, including various ways they can be purchased online. Law enforcement Googles stuff just like the rest of us and its ability to track down purveyors of illegal drugs would be harmed if it was all pushed off the open web.

Those seeking to buy illegal drugs would find other ways of accomplishing this even if the info disappears. The so-called dark web is an off-the-radar option that many are using already. A whole host of useful info is in danger of being removed simply because questionable purveyors of prescription drugs have found a way to game search engine algorithms.

All of the companies receiving letters already have policies in place to restrict the illicit sale of drugs. They also have policies in place to forward pertinent info to law enforcement agencies. So, companies are already doing much of what is asked, but these senators feel the mere existence of questionable sites in search results makes these companies “facilitators” of illegal drug sales.

If SESTA is signed into law, it will make it that much easier for the government to demand similar legislation targeting opioid distribution. It will allow the government to claw back more of the immunity granted to service providers with the passage of the Communications Decency Act. The more holes drilled into Section 230 by legislation, the easier it is to remove it entirely, and paint targets on the back of search engines and social media platforms.

It’s also dangerous to suggest companies need to set up dedicated 24/7 service for law enforcement agencies. This will only encourage law enforcement to bypass legal protections set up by previous legislation and lean on companies already feeling the heat from the government’s increasingly-insane reaction to opioid overdoses. Warrants will seem unnecessary when legislators in DC are saying tech companies must be more responsive to law enforcement than they already are.

A suggestion from the government to start censoring search results is exactly that: censorship. The government may not be mandating it, but this is nothing like a concerned citizens group asking for more policing of search results. There’s the threat of legislation and other government action propelling it. Even if these senators aren’t mandating policy changes, they’re still using the weight of their position to compel alteration of search results.

Filed Under: amy klobuchar, censorship, chuck grassley, dianne feinstein, drugs, first amendment, free speech, john kennedy, search, search engines, sheldon whitehouse
Companies: google, microsoft, pinterest, yahoo

European News Agencies Again Demand Google, Facebook, Etc. Pay Up For Sending Them Traffic

Say That Again

from the definition-of-insanity dept

Fri, Dec 15th 2017 10:41am - Tim Cushing

Because it’s worked oh so well in the past, European news agencies are (again!) calling for service providers like Google and Facebook to start paying them money for sending them business.

Nine European press agencies, including AFP, called Wednesday on internet giants to be forced to pay copyright for using news content on which they make vast profits.

The call comes as the EU is debating a directive to make Facebook, Google, Twitter and other major players pay for the millions of news articles they use or link to.

“Facebook has become the biggest media in the world,” the agencies said in a plea published in the French daily Le Monde.

“Yet neither Facebook nor Google have a newsroom… They do not have journalists in Syria risking their lives, nor a bureau in Zimbabwe investigating Mugabe’s departure, nor editors to check and verify information sent in by reporters on the ground.”

“Access to free information is supposedly one of the great victories of the internet. But it is a myth,” the agencies argued.

“At the end of the chain, informing the public costs a lot of money.”

This is a doomed idea. First off, if the demands are a pain to implement, news agencies can expect to start seeing referral traffic drop as other news sources not tied to payment demands see their search engine stock rise. If they continue to press for a cut of these companies “billions,” they can expect to be cut off completely. This isn’t hypothetical.

Second, any agency that wants to cut off the search engines supposedly bleeding them dry can always block the engines’ crawlers. But this obviously isn’t about killing off search engine hits and Facebook sharing — it’s about dipping a hand into pockets of service providers for having the audacity to expand the reach of European news agencies.

Finally, there’s nothing in it for news agencies even if they succeed in getting a snippet tax implemented. They see companies worth billions and think skimming a little off the top will put them back in the black permanently. But anyone who knows anything about ad payouts knows CPM “taxes” aren’t the road to riches. In reality, any implemented scheme would involve hundreds of news sites divvying up fractions of cents between themselves for search result impressions. Payouts might be slightly higher for more direct clicks from referrers like Facebook, but at best, new agencies should expect a few bucks a month from a link tax, rather than the thousands (or millions) they envision.

The news agencies supporting this move are complaining about declining ad revenue and think charging platforms for sending them traffic is the solution. This has been tried and it hasn’t worked, but hope springs eternal when you’re all out of innovative ideas.

Filed Under: aggregators, eu, europe, google tax, linking, news, newspapers, reporting, search engines, snippet tax

30 Comments

Expand

Shouldn't Federal Judges Understand That Congress Did Not Pass SOPA?

from the hello-prior-restraint dept

Wed, Oct 4th 2017 11:55am - Mike Masnick

We’ve discussed in the past the completely ridiculous attacks on Sci-Hub, a site that should be celebrated as an incredible repository of all the world’s academic knowledge. It’s an incredible and astounding achievement… and, instead of celebrating it, we have big publishers attacking it. Because copyright. And even though the purpose of copyright was supposedly to advance “learning” and Sci-Hub serves that purpose amazingly well, so many people have bought into the myth of copyrights must “exclude” usage, that we’re in a time where one of the most amazing libraries in the world is being attacked. Sci-Hub lost its big case earlier this year, and almost immediately others piled on. Specifically, back in June, the American Chemical Society (ACS) jumped in with a similar “us too!” lawsuit, knowing full well that Sci-Hub would likely ignore it.

ACS has moved for a default judgment against Sci-Hub (what you tend to get when the defendant ignores the lawsuit), which it would likely get. However, in an extremely troubling move, the magistrate judge reviewing the case for the Article III judge who will make the final ruling has recommended forcing ISPs and search engines to block access to Sci-Hub. After recommending the standard (and expected) injunction against Sci-Hub, the recommendation then says:

In addition, the undersigned recommends that it be ordered that any person or entity in privity with Sci-Hub and with notice of the injunction, including any Internet search engines, web hosting and Internet service providers, domain name registrars, and domain name registries, cease facilitating access to any or all domain names and websites through which Sci-Hub engages in unlawful access to, use, reproduction, and distribution of ACS’s trademarks or copyrighted works. Finally, the undersigned recommends that it be ordered that the domain name registries and/or registrars for Sci-Hub’s domain names and websites, or their technical administrators, shall place the domain names on registryHold/serverHold or such other status to render the names/sites non-resolving.

So, this is kind of incredible. Because, as you might remember, there was a big fight a little over five years ago about a pair of bills in Congress called SOPA and PIPA that proposed allowing for such an order being issued to third parties like search engines, ISPs, domain registrars and the like, demanding they block all access to certain websites. And, following quite a public outcry (which also explained why this approach would do serious harm to certain security standards and other technical aspects of how the internet works), Congress backed down and decided it did not want to enable courts to issue such orders.

So why the hell is Magistrate Judge John F. Anderson recommending such an order?

At the very least, it seems problematic. Even if you ignore the Sci-Hub part of the equation (since it ignored the lawsuit, a default judgment was basically inevitable), you should be concerned about this. Here’s a court order binding a very large number of non-parties to the lawsuit to completely block access to a variety of websites, without any sort of due process. One hopes that ISPs, domain registrars and search engines will push back on such an overbroad order — one that even Congress realized was a step too far and never authorized.

Filed Under: copyright, dns, injunctions, intermediary liability, john f. anderson, search engines, site blocking, sopa
Companies: acs, sci-hub

UK Search Engines Will Sign Up To A 'Voluntary' Code On Piracy — Or Face The Consequences

from the and-who-cares-what-you-think? dept

Thu, Feb 9th 2017 11:55am - Glyn Moody

As Techdirt readers know, the copyright industry has almost no means to tackle infringement, or to demand that pirated materials are removed from Internet sites. At least, that’s the impression you would get as a result of the constant whining you hear from the entertainment companies that they are doomed and terribly neglected by the lawmakers. Indeed, not content with the copyright ratchet that constantly makes copyright laws longer, stronger and broader, the film, music and publishing industries are always pushing for “voluntary” agreements with the Internet industry that don’t require anything so tiresome as actual laws to be passed… or pesky things like “due process.”

One example of this approach is the “six strikes” scheme in the US. As Techdirt noted recently, the approach was a complete failure, and has just been dropped. Unfortunately, the idea lives on around the world — the EFF has an entire section on its site about what it calls “shadow regulation,” and it has just published a global review of copyright enforcement agreements. Particularly troubling are the EU’s proposals for a new copyright directive, which would require:

large user-generated content platforms to reach agreements with copyright holders to adopt automated technologies that would scan content that users upload, and either block that content or pay royalties for it.

As the EFF notes, the reason why these would be “voluntary” deals is pretty clear:

The Commission is likely taking that approach because that it knows that it can’t directly require Internet platforms to scan content that users upload — an existing law, Article 14 of the Directive 2000/31 on electronic commerce (E-commerce Directive), expressly prohibits any such requirement.

That is, it would be impossible to make this a legal requirement, because it is forbidden by another key EU directive, but “voluntary” agreements can skirt that law, which is another reason they are so insidious. The EU’s revised copyright directive is still at an early stage of discussion, so there is some hope that this harmful proposal can be fought and removed. Sadly, that’s not the case in the UK, where it seems that search engines have had their arms twisted to sign up to another “voluntary” agreement, with the threat of new laws being brought in if they don’t. As a post on TorrentFreak explains:

Google and other search companies are close to striking a voluntary agreement with entertainment companies to tackle the appearance of infringing content links in search results. Following roundtable discussions chaired by the UK’s Intellectual Property Office, all parties have agreed that the code should take effect by June 1, 2017.

TorrentFreak quotes a revealing comment made by the UK government minister that has been leading the talks, Baroness Buscombe:

“The search engines involved in this work have been very co-operative, making changes to their algorithms and processes, but also working bilaterally with creative industry representatives to explore the options for new interventions, and how existing processes might be streamlined,” she said.

The fact that the talks were “bilateral,” involving only entertainment companies and search engines, exposes one of the worst features of these so-called “voluntary” agreements: that there is no open debate of the kind that would be standard when actual legislation was involved, nor any opportunity for ordinary people to contribute. Instead, closed-door discussions produce deals that may be satisfactory for the copyright industry, and bearable for the Internet companies, but which are uniformly bad for the general public, whose views are simply not considered relevant.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: copyright, due process, removing content, search engines, shadow regulations, uk, voluntary agreements

9 Comments

Expand

Older Stories >>

Follow Techdirt

Subscribe to Our Newsletter

Essential Reading

The Techdirt Greenhouse

Read the latest posts:

Read All »

Techdirt Insider Discord

The latest chatter on the Techdirt Insider Discord channel...

Older Stuff

Thursday
13:15	Cops Continue To Prove They Can't Be Trusted With Surveillance Tech (2)
11:08	To Dodge A Fight With Trump, Law Firms Cut Deals. Now The Deals Are Creating A Fight With Trump. (13)
11:04	Daily Deal: The 2026 Microsoft Azure Architect & Administrator Exam Prep Bundle (0)
09:36	Florida's Stop WOKE Act Shut Down (Again) By Eleventh Circuit Appeals Court (4)
05:30	Writers Guild Of America Also Sues Paramount, Citing Looming Merger Layoff Bloodbath (2)
Wednesday
19:49	Sony Deletes A Bunch More Movies From The Accounts Of People Who 'Bought' Them (30)
14:50	A Troubling Milestone: Most Supreme Court Rulings Are Secretive Votes With Little Justification (12)
12:53	Fifth Circuit Looks Like It's Ready To Roll Back Its Decision Recognizing Due Process Rights For Migrants (20)
10:53	Rubio Wanted To Ban 'Censors' From Entering The US. A Court Says He's The One Censoring. (13)
10:48	Daily Deal: Opusonix Pro Subscription (0)
09:22	Kash Patel Continues To Draw Heat For His Exorbitant Spending Habits (7)
05:25	NYC Passes Click To Cancel Rules As Lina Khan Lives On (9)
Tuesday
20:00	RFK Jr. Cut Funding For FoodNet, Making It Harder To Figure Out Why You're Shitting Yourself Uncontrollably (9)
15:41	Paramount Falsely Threatens To Leave California After State Challenges Merger (17)
13:49	How The Spread Of Local AI Models Makes Copyright Enforcement Harder (6)
11:17	Federal Judge Nukes Trump's Self-Dealt IRS 'Settlement,' Sends Lawyers To The Bar (26)
11:12	Daily Deal: The 2026 Data Engineering Bundle featuring Databricks (0)
09:24	ICE Camera Crews Are Labeling Themselves 'Media,' Filming Anti-ICE Protesters (14)
05:25	A Dozen States Sue To Block Paramount's Shitty, Unpopular Merger (5)
Monday
20:01	Former CDC CMO: RFK Jr. Is Doing 'Irreparable Harm' (6)
15:25	The UK’s New Under-16 Social Media Ban Will Cause More Harm Than It Prevents (15)
13:05	Oregon AG Wants Pause On Paramount Merger, Hints At Federal Corruption (5)
11:13	Trump Admin Supoenas NYT Reporters Because They Dared To Criticize His Qatari Graft Plane (31)
11:08	Daily Deal: uTalk Language Education (0)
09:36	"Reckless" Ben's Videos Keep Getting More Damning. His Pro Se Lawyering Keeps Getting Worse. (11)
05:27	Musk's Starlink Socks Customers With $1500 'High Demand' Surcharge (36)
Sunday
12:00	Funniest/Most Insightful Comments Of The Week At Techdirt (3)
Saturday
12:00	This Week In Techdirt History: July 5th - 11th (0)
Friday
19:39	Xbox Lays Off 20% Of Staff, Cut Studios, Largely Impacting Acquired Devs It Promised It Wouldn't Layoff (9)
15:50	How Google And AI Nearly Made A Seasoned Reporter Spiral (15)

A Bunch Of Authors Sue OpenAI Claiming Copyright Infringement, Because They Don’t Understand Copyright

from the not-how-any-of-this-works dept

If It’s Impossible To Compete With Google, How Come New Search Engines Keep Launching?

from the the-internet-is-quite-the-dynamic-place dept

Performative Conservatives Are Mad That A Search Engine Wants To Downrank Disinformation

from the you-want-what-now? dept

Other Big CJEU Case Says Google Must Put Certain Links At The Top Of Search Results

from the must-carry dept

TurboTax Did Everything It Could To Hide The Free-Filing Its Supposed To Offer

from the hide-and-seek dept

There's A Reason That Misleading Claims Of Bias In Search And Social Media Enjoy Such Traction

from the but-it's-not-a-good-reason dept

Five Senators Agree: Search Engines Should Censor Drug Information

from the foot-in-the-door-for-greater-government-control-of-web-content dept

European News Agencies Again Demand Google, Facebook, Etc. Pay Up For Sending Them Traffic

from the definition-of-insanity dept

Shouldn't Federal Judges Understand That Congress Did Not Pass SOPA?

from the hello-prior-restraint dept

UK Search Engines Will Sign Up To A 'Voluntary' Code On Piracy — Or Face The Consequences

from the and-who-cares-what-you-think? dept

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Thursday

Wednesday

Tuesday

Monday

Sunday

Saturday

Friday

More

Tools & Services

Company

Contact

More

from the not-how-any-of-this-works dept

from the the-internet-is-quite-the-dynamic-place dept

from the you-want-what-now? dept

from the must-carry dept

from the hide-and-seek dept

from the but-it's-not-a-good-reason dept

from the foot-in-the-door-for-greater-government-control-of-web-content dept

from the definition-of-insanity dept

from the hello-prior-restraint dept

from the and-who-cares-what-you-think? dept

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Email This Story

Tools & Services

Company

Contact

More