Sebastian's Pamphlets: Blogger to rule search engine visibility?

Wednesday, July 25, 2007

Blogger to rule search engine visibility?

Via Google's Webmaster Forum I found this curiosity:
http://www.stockweb.blogspot.com/robots.txt

User-agent: *
Disallow: /search
Disallow: /

A standard robots.txt at *.blogspot.com looks different:

User-agent: *
Disallow: /search
Sitemap: http://*.blogspot.com/feeds/posts/default?orderby=updated

According to the blogger the blog is not private, what would explain the crawler blocking:

It is a public blog. In the past it had a standard robots.txt, but 10 days ago it changed to "Disallow: /"

Copyscape thinks that the blog in question shares a fair amount of content with other Web pages. So does blog search:
http://stockweb.blogspot.com/2007/07/ukraine-stock-index-pfts-gained-97-ytd.html
has a duplicate, posted by the same author, at
http://business-house.net/nokia-nok-gains-from-n-series-smart-phones/,
http://stockweb.blogspot.com/2007/07/prague-energy-exchange-starts-trading.html
is reprinted at
http://business-house.net/prague-energy-exchange-starts-trading-tomorrow/
and so on. Probably a further investigation would reveal more duplicated contents.

It's understandable that Blogger is not interested in wasting Google's resources by letting Ms. Googlebot crawl the same contents from different sources. But why do they block other search engines too? And why do they block the source (the posts reprinted at business-house.net state "Originally posted at [blogspot URL]")?

Is this really censorship, or just a software glitch, or is it all the blogger's fault?

Update 07/26/2007: The robots.txt reverted to standard contents for unknown reasons. However, with a shabby link neigborhood as expressed in the blog's footer I doubt the crawlers will enjoy their visits. At least the indexers will consider this sort of spider fodder nauseous.

Labels: Blogger, duplicate content, Google, robots.txt

Stumble It!

Post it to
del.icio.us

-->

11 Comments:

At Wednesday, July 25, 2007, JLH said…

As much as I'd like to blame the blogger, I've come to expect Blogger (the site) to be buggy and it's probably a bug. They've inserted the NOFOLLOW,NOINDEX tag before on questionable blogs, but then the owners were warned to submit their site for a manual review after fixing it.

PS. If you could get your captcha to about 56 letters that would be just perfect (sarcasm)
At Wednesday, July 25, 2007, Sebastian said…

Yep, but sometimes a Blogger bug is a feature, there is someone who hates crawlers at Blogger. For example they recently inserted NOINDEX,NOFOLLOW on all Google branch blogs, throw rel-nofollow on links like confetti, and there are many more not so funny stories about Blogger and blocking crawlers. Probably they have a saboteur in their team ;)

As for the captcha, you may get what you ask for, so be careful with your blogger wish list :)
At Thursday, July 26, 2007, Anonymous said…

Yeah, probably a Blogger bug. I saw that guys post in the google group, and that's the first one and only one I had seen like that.

Besides a couple of the NOINDEX snafus, at one point Blogger seemed like they were experimenting (or it was a bug then too) of putting in a NOINDEX meta tag in suspected spam blogs (ones they required word verification for posting). I wonder if this could have been similar (consider where it occurred) or if it was just a straight out bug.

But also interesting, in your list of MyBlogLog recent readers, I see a Blogger Programmer in the list. Maybe you caught his attention with your post about it. :-)
At Thursday, July 26, 2007, Sebastian said…

Thanks Dave! It's possible that Pete Hopkins saw this post, "repaired" the robots.txt and forgot to drop a comment. It's also possible that the blogger in question fucked up his settings and silently changed it.

Welcome Pete Hopkins! What do you think of an open discussion of some submitted Blogger-bugs respectively flaws, which did not make it on your "known bugs" blog? For example the extremely unlogical handling of rel-nofollow in comments, which is not compliant to the rel-nofollow semantics nor the official Google position on rel-nofollow? Or an easier procedure to label old posts? ... Thanks for your visit :)
At Thursday, July 26, 2007, Vlada, Czech Republic said…

Hello sebastian and others.,
the blog in question is mine. I put some questions for Sebastian at that webmasters google forum.Can you recommend me what shall I improve with my HTML code.

1.Are those meta tags usefull or not?
2.How to block this guy (www.business-house.net) who is stealing my feed? I'd like to provide full feed for readers.
3.I created Reciprocal links in footer to improve PR. Do you think are they usefull?

Thank you in advance
At Thursday, July 26, 2007, Sebastian said…

Vlada, if you don't want to block crawlers you don't need robots and googlebot meta tags. "index, follow" is the default.

I think you've emailed that guy already. If a polite request to stop publishing your feed gets ignored, you can switch to headlines and snippets until he loses interest. Unfortunately, many RSS scrapers reprint even partial feeds. In the meantime you can try a DCMA complaint, but make sure that your blog states clearly that you do not permit reprinting your feed's contents. Technically you can't block that guy from sucking a blogspot feed.

I won't go so far to say that your reciprocal links to directories and 'services' smelling like link farms are responsible for the changed robots.txt, but they're definitely not useful and I'd remove these useless links as soon as possible.

Reciprocal links put for the sole purpose of boosting PageRank violate Google's guidelines. Even if you don't get penalized for a dozen reciprocal links, these tend to nullify each other. You gain nothing except unwanted attention of spam seeking algos, and you can lose a lot when you participate in public link scams, err schemes.

Reciprocal linkage is not a bad thing in general. Say another blogger mentions your blog in a post and you write a piece linking to him because he has got compelling contents, that's reciprocal but perfectly Ok. Natural linkage can't hurt your rankings.

Now a question for you: do you positively know why your robots.txt reverted to Blogger-standard yesterday?
At Saturday, July 28, 2007, Anonymous said…

Ah, here's a situation that would describe what was seen here, only the owner can confirm it tho.

Let's cover-setting a blog to unlisted in the settings will put in the noindex meta tag if you are using the standard template tags (as it has for years) but doesn't change the new robots.txt

If you make a blog private (login) then it does make the robots.txt like the one in question, but that can't be it here as we were able to view the blog.

What I have just verified, if the blog is open BUT Blogger has locked the blog from posting until it's been reviewed and cleared as a possible spam blog, then the robots.txt does revert to how that guys was, and as soon as the review clears, then it goes back to normal.

I've verified this on someone elses site that was locked for review. Of course only the owner can verify if he was under review at the time or not.
At Saturday, July 28, 2007, Sebastian said…

Dave, that would be a good explanation, but he wasn't locked from posting, in fact he posted quite frequently while blocked via robots.txt for two weeks (if that was so). Perhaps he had that captcha thingy applied to his blog and didn't mention it in his various posts, that would explain it too.

Thanks for stopping by to post this report. Perhaps some day we will get this curiosity undeceived :)

Vlada, if you read that, please clarify. Thanks.
At Monday, July 30, 2007, Vlada, Czech Republic said…

Hello Dave and Sebastian,

Yes, I think I was reviewed by google spam policy because I can't published several days. Posts were only saved and later on published.

I wanted to ask you. At Google wembmasters I still see many URLs restricted by robots.txt (191)like:

http://stockweb.blogspot.com/2007/04/another-offer-for-abn.html
http://stockweb.blogspot.com/2007/04/bmw-grows-in-china.html

Even thoughts robots.txt is not blocking them now..??

Thank you
At Monday, July 30, 2007, Sebastian said…

Vlada, you should have told us that in the first place.

As for the crawl errors in your Webmaster console: Look at the date in "last calculated", it tells you when the page in question was blocked by robots.txt. If you find a date shortly after the lift, that may be a caching issue.
At Wednesday, August 01, 2007, Anonymous said…

Vlada, you should have told us that in the first place.

Oh, having all the relevant info just takes all the fun out of it. :-)

I was suspecting that, as I'm seeing more of those site restricted robots.txt for blogs that are under review.