Google indexing despite robots.txt block

zeepartner

Hi

This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch

This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt

Any clues why this is or what I could do to resolve it?

Thanks!

john4math

It sounds like Martijn solved your problem, but I still wanted to add that robots.txt exclusions keep search bots from reading pages that are disallowed, but it does not stop those pages from being returned in search results. When those pages do appear, a lot of times they'll have a page description along the lines of "A description of this page is not available due to this sites robots.txt".

If you want to ensure that pages are kept out of search engines results, you have to use the noindex meta tag on each page.

zeepartner

Yes, I think the crucial point is that addressing googlebot wouldn't resolve the specific problem I have here.

I would have tried adressing googlebot otherwise. But to be honest, I wouldn't have expected a much different result than specifying all user agents. Googlebot should be part of that exclusion in any case.

Martijn_Scheijbeler

I thought that value was a bit outdated, turns out to be still accepted. Although it probably only address this issue for him in Google and I assume it will still remain one in other search engines.

Besides that the problem offered a way better solution in allowing Google not on the HTTPS site.

Kingof5

Specifically for Googlebot. I'm pretty surprised people would disagree - Stephan Spencer recommended this in a personal conversation with me.

Martijn_Scheijbeler

Did you mean a noindex tags for robots or a specific one for googlebot? With the second one I probably get the downvotes.

Kingof5

People who are disagreeing with this, explain your reasoning.

Kingof5

A noindex tag specific to Googlebot would also be a good idea.

Martijn_Scheijbeler

You're welcome, it was mostly due to noticing that the first snippet, the homepage, had no snippet and the rest of the pages did have one. That led me to looking at their URL structure. Good luck fixing it!

zeepartner

100 points for you Martijn, thanks! I'm pretty sure you've found the problem and I'll go about fixing it. Gotta get used to having https used more frequently now...

Martijn_Scheijbeler

Hi Phillipp,

You almost got me with this one, but it's fairly simple. In your question you're pointing at the robots.txt of your HTTP page. But it's mostly your HTTP**S **pages that are indexed and if you look at that robots.txt file it's pretty clear why these pages are indexed: https://www1.swisscom.ch/robots.txt all the pages that are indexed match with one of your Allow statements are the complete Disallow. Hopefully that provides you with the insight on how to fix your issue.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Google indexing despite robots.txt block

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Pages are Indexed but not Cached by Google. Why?

Is there a limit to how many URLs you can put in a robots.txt file?

Is there any value in having a blank robots.txt file?

How to Stop Google from Indexing Old Pages

How to block "print" pages from indexing

OK to block /js/ folder using robots.txt?

What is the best method to block a sub-domain, e.g. staging.domain.com/ from getting indexed?

How to remove a sub domain from Google Index!

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved