Distributed Search Engines, And Why We Need Them In The Post-Snowden World

from the easier-said-than-done dept

One of the many important lessons from Edward Snowden's leaks is that centralized services are particularly vulnerable to surveillance, because they offer a single point of weakness. The solution is obvious, in theory at least: move to decentralized systems where subversion of one node poses little or no threat to the others. Of course, putting this into practice is not so straightforward. That's especially true for search engines: creating distributed systems that are nonetheless capable of scaling so that they can index most of the Web is hard. Despite that challenge, distributed search engines do already exist, albeit in a fairly rudimentary state. Perhaps the best-known is YaCy:

YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world's users.


The resulting decentralized web search currently has about 1.4 billion documents in its index (and growing -- download and install YaCy to help out!) and more than 600 peer operators contribute each month. About 130,000 search queries are performed with this network each day.
Another is Faroo, which has an interesting FAQ that includes this section explaining why even privacy-conscious non-distributed search engines are problematic:
Some search engines promise privacy, and while they look like real search engines, they are just proxies. Their results don't come from their own index, but from the big incumbents (Google, Bing, Yahoo) instead (the query is forwarded to the incumbent, and the results from incumbent are relayed back to the user).

Not collecting logfiles (of your ip address and query) and using HTTPS encryption at the proxy search engine doesn't help if the search is forwarded to the incumbent. As revealed by Edward Snowden the NSA has access to the US based incumbents via PRISM. If the search is routed over a proxy (aka "search engine") the IP address logged at the incumbent is that from the proxy and not from the user. So the incumbent doesn't have the users IP address, and the search engine proxy promises not to log/reveal the user IP, while HTTPS prevents eavesdropping on the way from the user to the search engine proxy.

Sounds good? By observing the traffic between user and search engine proxy (IP and time and size are not protected by HTTPS) via PRISM, Tempora (GCHQ taps world's communications) et al. and combining that with the traffic between search engine proxy and the incumbent (query, time, size are accessible by PRISM), all those seemingly private and protected information can be revealed. This is a common method know as Traffic analysis.

The NSA system XKeyscore allows to recover search engine keywords and other communication just by observing connection data (meta data) and combining them with the backend data sourced from the the incumbents. The system is also used by the German intelligence services BND and BfS. Neither the encryption with HTTPS, nor the use of proxies, nor restricting the observation to meta data is protecting your search queries or other communication content.
Unfortunately, unlike YaCy, Faroo is not open source, which means that its code can't be audited -- an essential pre-requisite in the post-Snowden world. Another distributed search engine that is fully open source is Scholar Ninja, a new project from Jure Triglav:
I’ve started building a distributed search engine for scholarly literature, which is completely contained within a browser extension: install it from the Chrome Web Store. It uses WebRTC and magic, and is currently, like, right now, used by 42 people. It’s you who can be number 43. This project is 20 days old and early alpha software; it may not work at all.
As that indicates, Scholar Ninja is domain-specific at the moment, although presumably once the technology is more mature it could be adapted for other uses. It's also very new -- barely a month old at the time of writing -- and very small-scale, which shows that distributed search has a long way to go before it becomes mainstream. Given the serious vulnerabilities of traditional search engines, that's a pity. Let's hope more people wake up to the need for a completely new approach, and start to help create it.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: distributed computing, privacy, search engines

Reader Comments

Subscribe: RSS

View by: Time | Thread

  1. icon
    Gwiz (profile), 8 Jul 2014 @ 7:14am

    YaCy Tips

    Some tips and tricks I've learned to make YaCy run better:

    - Increase the RAM setting. Default is 600MB. I have a 4GB so I give YaCy 1 GB (1200MB). I would give more if this was dedicated node, but since it's my laptop, 1GB seems to play nice with other stuff that's I'm running.

    - Limit crawl maximum. Default is 6000 PPM (pages per minute) and that is pretty large. I share my internet connection with other people and devices so I limit it to 300 PPM so I don't hog all the bandwidth and piss anyone off.

    - Increase language ranking to 15 (max). I tend to like reading stuff in English, but that's just me.

    - Turn on Heuristics settings so it automatically crawls one level deep on every page returned in the results. This way if you do a search and the results kind of suck - wait ten minutes, do the search again and the results are better because it was "learning" about what you just searched for.

    I also turn on the "site operator shallow crawl". When you enter a search query in the format "site:somewebsite.com" it automatically crawls that site one level deep.

Add Your Comment

Have a Techdirt Account? Sign in now. Want one? Register here

Subscribe to the Techdirt Daily newsletter

Comment Options:

  • Use markdown for basic formatting. (HTML is not supported.)
  • Remember name/email/url (set a cookie)

Follow Techdirt
Insider Shop - Show Your Support!

Report this ad  |  Hide Techdirt ads
Essential Reading
Techdirt Deals
Report this ad  |  Hide Techdirt ads
Techdirt Insider Chat
Report this ad  |  Hide Techdirt ads
Recent Stories
Report this ad  |  Hide Techdirt ads


Email This

This feature is only available to registered users. Register or sign in to use it.