Shopping Product Reviews

Block Googlebot dynamic URLs using your Robots.txt file

I’ve been trying to figure out how to block some dynamic urls from Googlebot. The search robots of Yahoo! Slurp and MSNBot use the same or very similar syntax to block dynamic URLs. As an example, I have this line in my htaccess file that allows me to use static pages instead of dynamic pages, but I found that sometimes googlebot keeps crawling my dynamic pages. This can result in duplicate content that is not approved by any of the major search engines.

I’m trying to clean up my personals site as it currently ranks well with Yahoo but not Google. I think MSN Live has algorithms similar to Google’s, but this is not scientifically proven in any way. I only state this from my own personal experience with SEO and my clients’ sites. I think I’ve found some answers about ranking well with Google, MSN and possibly Yahoo. I’m in the middle of testing right now. I managed to rank well in Google for a client’s site for relevant keywords. Anyway, here’s how to block Google’s dynamic pages using its robots.txt file. The following is an excerpt from my htaccess file:

RewriteRule personals-dating-(.*).html$ /index.php?page=view_profile&id=$1

This rule, in case you’re wondering, allows me to create static pages like personal-appointments-4525.html from the dynamic link index.php?page=view_profile&id=4525. However, this has caused problems as now Googlebot can and has “loaded” me with duplicate content. Duplicate content is frowned upon and creates more work for Googlebot because it now has to crawl additional pages and may be seen as spam by the algorithm. The bottom line is that duplicate content should be avoided at all costs.

The following is an excerpt from my robots.txt file:

User-agent: Googlebot

Disallow: /index.php?page=view_profile&id=*

Note the “*” (asterisk) sign at the end of the second line. This just tells Googlebot to ignore any number of characters in place of the asterisk. For example, Googlebot will ignore index.php?page=view_profile&id=4525 or any other number, array, or character. In other words, these dynamic pages will not be indexed. You can check if your rules in your robots.txt file will work correctly by logging into your Google webmaster control panel account. If you don’t have a Google account, you simply need to create one from within Gmail, AdWords, or AdSense and you’ll have access to Google’s webmaster tools and control panel. If you want to achieve higher rankings then you must have one. Then all you need to do is sign in to your Gmail, AdWords, or AdSense accounts to have an account. They make it pretty simple to set up an account and it’s free. Click the “Diagnostics” tab and then the “robots scan tool.txt” link under the Tools section in the left column.

By the way, your robots.txt file should be in your webroot folder. Googlebot checks your site’s robots.txt file once a day and it will be updated in your Google webmaster control panel in the “robots.txt analytics tool” section.

To test your robots.txt file and validate if your rules will work correctly with Googlebot, simply type the URL you want to test in the “Test URL with this robots.txt file” field. I added the following line to this field:

http://www.personals1001.com/index.php?page=view_profile&id=4235

I then clicked the “Check” button at the bottom of the page. Googlebot will block this URL given the conditions. I think this is a better way to block Googlebot instead of using the “URL Removal” tool you can use. The “URL Removal” tool is located in the left column of your Google Webmaster Control Panel. I’ve read in a few cases on Google groups that people have had problems with the “URL Removal” tool.

Leave a Reply

Your email address will not be published. Required fields are marked *