Anonymous Coward

July 8, 2014 at 2:18 am

I agree. And I think such search engines will start being used soon, even if their search results aren’t as relevant as Google’s, but offer other advantages, such as not being censored or being more privacy-friendly.

In the end, if billions of people index all pages, it could get better than Google, too. The power of the crowd vs a single entity.

Ninja (profile)

July 8, 2014 at 3:19 am

Re: Re:

Google could build into their own system the power to provide results from the YaCy network for instance (while helping them). When a takedown notice comes they can say “sorry, we can’t take it down, it’s beyond our power. Maybe if the single entity joins the crowd it can empower such crowd even more.

Anonymous Coward

July 8, 2014 at 4:22 am

Re: Re: Re:

Nice idea, but Google is already using a database of links to be blocked, and would therefore be able to, and expected to filter results that they are obtaining from elsewhere and passing on to users. This is the big problem with censorship mechanisms, once implemented at a choke point they can filter everything passing through the choke point.

Ninja (profile)

July 8, 2014 at 3:17 am

YaCy was pretty crappy a while back but maybe with the scale it’s usable now. In any case since I’ve upgraded my connection I’ve been assisting them and I truly hope they become mainstream.

Distributed solutions are the future.

Anonymous Coward

July 8, 2014 at 4:15 am

Unfortunately there seem to be several problems that will significantly limit the utility of peer to peer search engines. Ranking algorithms require access to large parts of the index, such as a count of all unique links to a page. Also users want to search the whole index for a given term. If this requires them to find and connect to thousands of nodes there is the recursive problem of finding the nodes, and the problem of managing thousands of connections, or keeping track of pending responses to UDP requests.
Note, almost file sharing systems rely on a centralized index to allow searching and finding peers. In essence finding torrents is a smaller scale search problem, and although the actual file transfer is done on a decentralized basis, finding the file is usually centralized. File sharers are more aware than most people of the hazards of centralized systems, and include many programmers in their ranks, and are still struggling to come up with a way of decentralizing the search to avoid the problems of trackers blocked and domains being seized. This is a significantly easier problem to solve, as the indexes are much smaller, than a full index of all of the Internet that is publicly available.

Whatever (profile)

July 8, 2014 at 4:46 am

Re: Re:

Agreed. The biggest poison for any search engine is “SEO” people knowing exactly what to do to rank well. Once they know that, they will repeat it as many times as needed to totally dominate results and render the searches effectively worthless.

A system where the ranking process is open source is pretty much doomed to an early death, as the results will be almost entirely spam within hours of it reaching a reasonable level of user searches.

Anonymous Coward

July 8, 2014 at 6:09 am

Re: Re: Re:

You have a comprehension fail, as the point I was making was that it is very difficult to do search optimization with a distributed index. This makes it difficult to rank results, and one consequence of this would be that SEO does not work, except possible if all sites influencing one node are used, and then it only affects users of that node.

Gwiz (profile)

July 8, 2014 at 6:49 am

Re: Re: Re:

The biggest poison for any search engine is “SEO” people knowing exactly what to do to rank well. Once they know that, they will repeat it as many times as needed to totally dominate results and render the searches effectively worthless.

With YaCy the user controls the ranking, since it’s done at the client. The user also controls their own blacklist of results. I’ve been running a YaCy node for over a year and really have had to blacklist only two entities – one was a porn link spammer and the other was an annoying link farm without any actual content.

A system where the ranking process is open source is pretty much doomed to an early death, as the results will be almost entirely spam within hours of it reaching a reasonable level of user searches.

Not at all. YaCy doesn’t seem to be useless because of spam at all. I’m migrating to using YaCy almost exclusively now since I have it set up to crawl based on what I search for, my results are very relevant to me.

Anonymous Coward

July 8, 2014 at 11:50 am

Re: Re: Re: Re:

Don’t feed the troll. If this article was about how puppies and rainbows, the troll would find a way to shit on it.

Anonymous Coward

July 8, 2014 at 5:25 am

Re: Re:

Ranking algorithms require access to large parts of the index, such as a count of all unique links to a page. Also users want to search the whole index for a given term. If this requires them to find and connect to thousands of nodes there is the recursive problem of finding the nodes, and the problem of managing thousands of connections, or keeping track of pending responses to UDP requests.

You know, Google has the same problems. Did you really think Google’s search engine is centralized? It’s not, it’s distributed between thousands of nodes, each one having only part of the index. So things like computing the ranking and distributing the queries are already known to be solved.

What Google has that is centralized is trust. Google’s nodes know they can trust other nodes, which simplifies things. File sharing systems usually do not have that trust. This leads to the most visible problem with decentralized search: nodes returning faked (usually spam) results.

Anonymous Coward

July 8, 2014 at 6:00 am

Re: Re: Re:

There is a huge difference between a server farm and the Internet when it comes to connecting thousands of nodes, and thousands of disks, a huge number of switches allowing for massive parallelism in connections, and access to storage. Any node can, and usually will have several connections to several networks, and the ability to optimize algorithms to maximize network locality for connections. Also, within such a farm, latency is much much lower than using the Internet.
The huge difference between a super-computer or server farm and the Internet is the communications bandwidth available to the system, by several orders of magnitude, aided by specialized networking support at each node, like an ability to bypass the Kernel when accessing the network, and to use a local network addressing scheme to link nodes within the system. When comparing performance, all the nodes in a big barn is a centralized system compared to having the nodes spread all around the world.

Ninja (profile)

July 8, 2014 at 6:44 am

Re: Re: Re: Re:

I think he meant connections between other data centers around the world.

Ninja (profile)

July 8, 2014 at 6:51 am

Re: Re:

Tbh there is a tool to search the torrent files themselves (or the hash whatever) using DHT. I’ve yet to use it so I can’t attest to its efficiency.

I’d say that it is feasible or will be in a matter of a few years or even months. It may take a few more seconds instead of nonoseconds as it does on a standard search engine but that’s a price I wouldn’t mind paying. As for fake/spammy nodes there are tools to handle them already. On bittorrent for instance bad nodes get isolated and eventually ignored in the swarm (they were forced into such measures due to the MAFIAA poisoning swarms) so the technology is there. There will be tradeoffs for sure but it can be achieved.

Anonymous Coward

July 8, 2014 at 5:48 am

Unfortunately there seem to be several problems that will significantly limit the utility of peer to peer search engines. Ranking algorithms require access to large parts of the index,

Isn’t that Bigcouch was supposed to solve?

http://bigcouch.cloudant.com/

Also Google Omega Cluster does the same thing, it doesn’t matter where the server is in the world, they all act like one big machine.

https://research.google.com/pubs/pub41684.html

Anonymous Coward III

July 8, 2014 at 6:32 am

Re: Re:

TO me both Anonymous Coward I & II have missed the basic problem.

Any system that can be manipulated will be manipulated until results are completely useless to users.

Google has the best search engine; Yahoo has the second best; all others are worse than Google and Yahoo.

Google search results are manipulated by filters.

Some of these filters remove what government and pressure group consider to be inappropriate material; others manipulate what is appropriate as deemed by commercial interest; all filters except those imitated by the end users manipulated what the end user is allowed to see in an endless process of censorship.

The problem is that Google does not produce the end users desired results while recording the end user every action which can then be manipulated for the betterment of others.

Google and search engines need to be replaced by something that produces the results the end user wants without te constant surveillance.

Kenneth Michaels (profile)

July 8, 2014 at 5:51 am

Paying the peers

I’ve seen the idea of Torcoin, a bitcoin-like protocol to reward those who provide bandwidth to a Tor network. Perhaps we need YaCyCoin to reward those who provide index and bandwidth to the distributed search engine.

Of course, I have no idea on how to do that.

private frazer

July 8, 2014 at 5:58 am

we contribute 5% of hardward for Distibutive programs

Search, social media, email. make it all distributive and kill google and facebook etc.. anyone with a pipe to NS/GCHQ

Gwiz (profile)

July 8, 2014 at 7:14 am

YaCy Tips

Some tips and tricks I’ve learned to make YaCy run better:

– Increase the RAM setting. Default is 600MB. I have a 4GB so I give YaCy 1 GB (1200MB). I would give more if this was dedicated node, but since it’s my laptop, 1GB seems to play nice with other stuff that’s I’m running.

– Limit crawl maximum. Default is 6000 PPM (pages per minute) and that is pretty large. I share my internet connection with other people and devices so I limit it to 300 PPM so I don’t hog all the bandwidth and piss anyone off.

– Increase language ranking to 15 (max). I tend to like reading stuff in English, but that’s just me.

– Turn on Heuristics settings so it automatically crawls one level deep on every page returned in the results. This way if you do a search and the results kind of suck – wait ten minutes, do the search again and the results are better because it was “learning” about what you just searched for.

I also turn on the “site operator shallow crawl”. When you enter a search query in the format “site:somewebsite.com” it automatically crawls that site one level deep.

Vidiot (profile)

July 8, 2014 at 8:36 am

I used to use distributed search engines. One was called AltaVista, one was called AskJeeves, and there were these other up-and-comers called Google and Yahoo. Seldom saw the same results; and, devoid of AI algorithms, you could search for literal phrases, booleans and directory paths. Ahhh, the good ol’ days…

Anonymous Coward

July 8, 2014 at 8:55 am

Yet more evidence that this website is nothing but a front for google.

toyotabedzrock (profile)

July 8, 2014 at 12:39 pm

Where is the index for distributed search stored?

Gwiz (profile)

July 8, 2014 at 1:21 pm

Re: Re:

Where is the index for distributed search stored?

For YaCy it’s a DHT (distributed hash table) and it’s stored and shared in little bits and pieces from each user’s hard drive.

Basically, it’s “stored” the same way a torrent is “stored” in the swarm.

Anonymous Coward

July 8, 2014 at 3:38 pm

Re: Re: Re:

I think that this approach is naive, in that the Internet is far larger that most people can conceive. All the works published by the labels, studios and book publishers are but a pebble on the beach when compared to all the web pages that exist on the Internet, and this size is multiplied several times when public email archives are added to the indexes, and multiply several more times if you wish to include individual tweets.
Lets look at a grain of sand on the beach of the Internet, a Google search for Barak Obama gives :-
About 58,900,000 results (0.30 seconds)
that is almost 2Gb of data just for the links, assuming 30 characters per link. If you want a descriptive paragraph, ala Google, that would be more like 40-50 GB of data. Through in the rest of the Indexes needed to support more refined searches, and that is looking a several hundred GB just to do a decent index for one man. When distributed to user level machines, that part of the index could be spread over several hundred machines. Start scaling up the Internet, and tens of millions of machines are likely required, which makes finding which machines to query a major search in its own right.

Gwiz (profile)

July 9, 2014 at 6:52 am

Re: Re: Re: Re:

YaCy’s DHT index only stores what they term a Reverse Word Index (RWI). The entries only associate a word with url’s that contain that word.

When you search, the client receives a list of url’s that contain your search word from your own index and your peers. It then verifies that word is on each of the resulting url’s pages and creates the snippets at that point. The snippets aren’t saved anywhere in the index. Yes, this approach adds some time when waiting for results, but it assures the resulting pages exist and removes bad links from the index.

YaCy seems to be scaling up just fine with over 350 thousand words and almost 2 billion url’s currently.

Tuesday
09:24	ICE Camera Crews Are Labeling Themselves 'Media,' Filming Anti-ICE Protesters (1)
05:25	A Dozen States Sue To Block Paramount's Shitty, Unpopular Merger (4)
Monday
20:01	Former CDC CMO: RFK Jr. Is Doing 'Irreparable Harm' (6)
15:25	The UK’s New Under-16 Social Media Ban Will Cause More Harm Than It Prevents (13)
13:05	Oregon AG Wants Pause On Paramount Merger, Hints At Federal Corruption (5)
11:13	Trump Admin Supoenas NYT Reporters Because They Dared To Criticize His Qatari Graft Plane (28)
11:08	Daily Deal: uTalk Language Education (0)
09:36	"Reckless" Ben's Videos Keep Getting More Damning. His Pro Se Lawyering Keeps Getting Worse. (11)
05:27	Musk's Starlink Socks Customers With $1500 'High Demand' Surcharge (34)
Sunday
12:00	Funniest/Most Insightful Comments Of The Week At Techdirt (3)

Distributed Search Engines, And Why We Need Them In The Post-Snowden World

from the easier-said-than-done dept

Comments on “Distributed Search Engines, And Why We Need Them In The Post-Snowden World”

Re: Re:

Re: Re: Re:

Re: Re:

Re: Re: Re:

Re: Re: Re:

Re: Re: Re: Re:

Re: Re:

Re: Re: Re:

Re: Re: Re: Re:

Re: Re:

Re: Re:

Paying the peers

we contribute 5% of hardward for Distibutive programs

YaCy Tips

Re: Re:

Re: Re: Re:

Re: Re: Re: Re:

Add Your Comment Cancel reply

Comment Options:

What's this?

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Tuesday

Monday

Sunday

More

Tools & Services

Company

Contact

More

Distributed Search Engines, And Why We Need Them In The Post-Snowden World

from the easier-said-than-done dept

Comments on “Distributed Search Engines, And Why We Need Them In The Post-Snowden World”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

The Techdirt Greenhouse

Trending Posts

Tuesday

Monday

Sunday

More

Email This Story

Tools & Services

Company

Contact

More