Can We Clean The Dead Sites Out Of Search Engines, Please?
from the thanks dept
Tech columnist James Derk has written up a fun column detailing a bunch of recent tech annoyances — some of which are amusing and/or dead on. For example, computer companies who overload your computer with useless, intrusive and annoying software that is nearly impossible to remove (for example, Quickbooks, which comes with some computers, and as Derk points out: “Most large businesses don’t use Quickbooks, most small businesses already have it, and most consumers don’t want it.”) Another amusing (and unfortunate) annoyance he lists is a computer he bought with a non-standard power supply — which the company has run out of, and no longer has a supplier for. In other words, when his existing power supply broke, that’s it for the computer. However, the most interesting may be his request to clean out the dead sites from the internet. He’s sick of sites that no longer exist clogging up search engines: “I am getting very annoyed that no one has cleaned the Web lately. Lots of the sites I find in search engines no longer exist. I would like a house-cleaning day where we purge all of the search databases and start over. We all ought to have a “do-over” day.” Wouldn’t that be nice?
Comments on “Can We Clean The Dead Sites Out Of Search Engines, Please?”
Re-indexing the web?
How long would it take to re-index the web? What benefits would a re-index provide to the end user? Would the SEO industry be better or worse for it?
Re: Re-indexing the web?
Google constantly crawls web pages. It doesn’t take very long for google to re index the entire internet, and they do it quite often. When their spiders hit a dead web page, they don’t index it. Google woud return the links only if google had a cache of it from some time ago.
Re: Re-indexing the web?
it takes about 6 months to index the web depending on the search engine. this is going on continously, so , as the author put it, every day is web clean up day. the web to big to check for old sites that fast. all i can say is , suck it up and consider using the cashed version if you have to.
Why not just delete on visiting?
Why not, instead of re-indexing the web, just come up with a script that deletes sites after someone clicks the link and gets a 404. Or, to prevent all slashdotted sites from being dumped, make it then check them every day for a week or something and see if they’re still down.
Re: Why not just delete on visiting?
The best way to do it is by having a link next to the real once in say, Google. If it’s broken, you click the link.
Use AJAX to avoid users having to navigate though multiple pages once they click the “Broken Link” Link.
Then just have Google automatically re-index those pages and if they don’t exist, then dump them.
I do agree with Urza9814’s idea that you check every day for a week… this ensures that the site wasn’t just down for an hour or so for maintainence.
Re: Re: Reindexing and cleaning the Internet is a non-issu
Google and the others reindex constantly. They do incremental reindexing, hitting pages very often that change more often and hitting the seldom changed only every once in a great while. Cleanup does happen, eventually. I agree that dead pages are not the issue on the internet. Link spamming is. They’ve thought about the problem in 100 times the depth that anyone else (including us) has. I used to work for one (search engine).
Don't they already?
I think the guy is trying to hype something that’s not a significant issue. I ran a web page in high school that I discovered on Google a few years ago.. Since then it’s vanished (was some free hosting from a buddy’s school – I had no business using it at all).. When I googled shortly after that it didn’t come up anymore (except for archive.org links).
I think the indices are being cleaned up, though it’s probably less frequent to prevent servers with extensive outages from being dumped just because they had a string of bad luck when the spiders came knocking.
I can imagine a counter for every index in the search history that increments each time the spider visits unsuccessfully, and resets when it’s successful.
*My* gripe are those http://www.instructionsonhowtocleanyourdigitalcamera.ac links that have nothing to do with what I’m looking for – it’s just a spam site with a shitload of popups and a barrage of search words.
Re: Don't they already?
My biggest gripe (and I do have many!) is with search engines who take advertising dollars and then return those paid ads on searches for which they meet none of the search criteria. Google seems to be the biggest culprit but Ask Jeeves is a close second. Google used to be my go-to spot to find whatever I wanted and I’d get it in the first 6 listings of the first page. Now I have to wade through dozens of pages of returns to find what I was looking for in the first place. Yes, I understand search logic and it isn’t me. Perhaps it’s nostalgia but I really miss the internet of 5-10 years ago!
Re: Re: ***PDF and CGI search indexing are mostly USELESS*
Over the past years, the search engines have progressively got worse. For example, I type my search key (or keywords) in to Google or msn or yahoo or whatever, and what is returned is nothing but useless web pages that have NOTHING to do with any of my search years. In fact, many of the pages returned are XLS, PDF or even CGI pages that are worse than useless – since now my computer needs to load up an external program to view the content – and if that content is a PDF, (thanks to ADOBE), the 4ghz computer halts for close to a full minute just to load an application that I will instantly close, because it is NOT what I wanted.
And what is the point of indexing “CGI” pages without their identifiers? Whats the point of going to page.cgi if all it will so is tell you, “page not found, please use your back button” – Who’s brilliant idea was it to index these or any pages that need qualifiers — message forums, I can understand. but if you are going to index them to your search engine, please index them with their full qualifiers (the full URL *PLUS* the stuff that comes after the “?” and “&” commands).
Re: Re: Re: ***PDF and CGI search indexing are mostly USEL
The reason that is not done is because the qualifiers frequently include identifying information about the user’s status or login. Usually this is harmless and can’t be used, simply copying the link and giving it to someone else won’t make them log in as you, but it can lead to duplication of entries in a search engine. Think about it, if every time it sees a unique ID in a link, then it would index a unique page. You would flood the search engines with billions of redundant links.
Re: Re: Re:2 ***PDF and CGI search indexing are mostly USEL
My gripe is that these search engines index the CGI and PHP qualifiers (the “?” and “&”), but then they strip them from their links when they include the descriptions to their index. Making the descriptions USELESS – because they no longer link to the correct pages.
And this thing you mentioned about “logging in as you”, is crazy. Because 1) if the message forum saves user information (loging name/password) in the qualifier portion of .CGI? or .PHP? strings, then what good is even having a login. And 2) how would the search engine even know to login as any particular user? Most search engines index (or try to) as an anonymous visitor… since most people that will use the link the search engine gathers, will also ultimately login as anonymous for their first visit.
Indexing a website by it’s URL and then modifying it (by removing it’s qualifiers) and including that new URL into your search engine, makes your search engine’s data useless. This is the JUNK that should be removed from all engines, as it does nothing more than clutter the results with “irrelevant” results.
Re: Don't they already?
When I first checked my Google I found a guy named Andrew Strasser who was a felon from California. If you check now you’ll see that you can’t even find that person, however I notice changes everyday. In not just mine, but ohers that I particularly pay attention to. They clean the parts that are used the most which is wisest I would think. Google at least the others i tend not to use much…
No Subject Given
Instead of removing them (and I’m surprised Google doesn’t do this already) maybe they could just add a drop down list to choose whether to show pages found or re-discovered within 1 week ago, 2 weeks ago, etc. Maybe save a cookie for those of us who use the Google portal… that’d be a very nice feature.
I see they have options to only show web pages updated 3 months, 6 months, or year in their advanced page but that’s too tedious for me when I’m feeling lazy… and isn’t exactly going to do what I’d like to see.
This is way off topic but do you guys ever change the “techdirt poll” on the main page wtf its been forever?
Re: Re: OFFTOPIC
I actually use the cached results for pages that have disappeared. So, maybe just say “this page is gone, here’s the cached version”. But, hey Google – if your listening – keep the cached version, no matter what the guy’s quote in italics way up there in the article says.
Easy solution + fix for a bigger problem
I find that most sites that are indexed and no longer exist fall into one of two scenarios: The page moved, with an automatic forward, that leads to a 404. Or, pages that lead directly to 404s are usually linked to alot as whenever I do a backward search on who linked to it & find sites that talk about what I’m looking for, they sometimes even turn out to be helpful in finding it. This is a non existant problem really,it’s not worth the asthetics or space of adding a link next to every result, at the most they can put in a dead link submission problem that checks daily for a week then weekly for a month before completely deleting the record. If this was offered it can serve a higher purpose that can be used against a much bigger problems I find with search engines. When using general key words I can’t stand how many sites with absolutely zero content, tons of banners, and some fake links cheat their way to the top of the list. A submission form should be available where people with say Gmail accounts could submit these sites. Following company review these sites could be removed or at least bumped down. Best of all, most of the work is done by users who have the time and desire to do Google’s job of ensuring quality results. If a user’s submissions turn up too many false positives they can script ignoring him after a threshold. If this becomes rampant I’m sure they can come up with an easy algorythim that uses the relationship nature of gmail to predict future abuse. Best of all, they probably wont offer any “case tracking” for submitters so they don’t even have to tell you they’re ignoring you.
Considering I can’t imagine this being more than a 1 week project for a small team of talented people and once the right algorythims are in place and tested functioning only statistical quality control will need to be done, team of pigeons anyone? A false addition to the dead link list could be done per form that first compares the URL to the deleted list to prevent usage as a submission form. Submissions that pass are automatically reindex within… 3 or so hours?
driver software annoyances
Getting new graphics cards or printers causes the driver software to come with useless “interface panel” software that runs in the backgroudn. In some cases, manufacturers offer “minimal” versions of driver software that can be downloaded from their web sites; more often, the entries have to be manually deleted via msconfig.
it would help
oviously by doing so it would help… but how long would it take? and even if we were to use a script that searches for any 404 errors, many times websites are temporarly down. even so, the scripting idea is preety good. i believe we should sart doing so as quick as possible with any possible technique. latter of we can always use a more filtered technique after we have already started off.
Early but seems fitting..
DO NOT CONNECT TO THE INTERNET FROM MARCH 31st 23:59pm(GMT) UNTIL 12:01am(GMT) APRIL 2nd.
*** Attention ***
It’s that time again!
As many of you know, each year the Internet must be shut down for 24 hours in order to allow us to clean it. The cleaning process, which eliminates dead email and inactive ftp, www and gopher sites, allows for a better-working and faster Internet.
This year, the cleaning process will take place from 23:59pm(GMT) on March 31st until 00:01am(GMT) on April 2nd. During that 24-hour period, five powerful Internet-crawling robots situated around the world will search the Internet and delete any data that they find.
In order to protect your valuable data from deletion we ask that you do the following:
1. Disconnect all terminals and local area networks from their Internet connections.
2. Shut down all Internet servers, or disconnect them from the Internet.
3. Disconnect all disks and hardrives from any connections to the Internet.
4. Refrain from connecting any computer to the Internet in any way.
We understand the inconvenience that this may cause some Internet users, and we apologize. However, we are certain that any inconveniences will be more than made up for by the increased speed and efficiency of the Internet, once it has been cleared of electronic flotsam and jetsam. We thank you for your cooperation.
Fu Ling Yu
Interconnected Network Maintenance staff
Main branch, Massachusetts Institute of Technology
Sysops and others: Since the last Internet cleaning, the number of Internet users has grown dramatically. Please assist us in alerting
the public of the upcoming Internet cleaning by posting this message where your users will be able to read it. Please pass this message on to other sysops and Internet users as well.
Re: Early but seems fitting..
About that internet cleaning thing. I’ll have to assume you posted that as a joke, not really seriously believing it. In which case I’ll come off as stupid for even noticing 🙂
(Isn’t April 1 a dead giveaway?)
Re: Re: Early but seems fitting..
(by the way I was referring to the comment above mine, not the original post)
Re: Re: Early but seems fitting..
Re: Re: Early but seems fitting..
hence the title = “Early but seems fitting”, the “early” he would be talking about is April 1
You would think that someone would have already developed some sort of bot to scour the web and gobble up these dead sites. In fact, that gives way to an idea? Gobble.com – just type in the site, and if there has not been any activity for some time, goodbye.
Come on techies!
No Subject Given
Well it will be spring soon. Perhaps a spring clean of search sites?