PEAR Articles

  Home arrow PEAR Articles arrow Page 2 - Managing robots.txt using PHP: Generat...
PEAR ARTICLES

Managing robots.txt using PHP: Generating Dynamic Syntax
By: Codex-M
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 4
    2010-07-14

    Table of Contents:
  • Managing robots.txt using PHP: Generating Dynamic Syntax
  • Robots.txt using PHP example
  • Creating the PHP file and the static syntax
  • Upload the complete myrobots.txt.php to the website's root directory
  • Revise .htaccess to rewrite myrobots.txt.php to robots.txt

  •  
     

    SEARCH CODEWALKERS

    TOOLS YOU CAN USE

    advertisement

    Managing robots.txt using PHP: Generating Dynamic Syntax - Robots.txt using PHP example


    (Page 2 of 5 )

    To easily understand this tutorial, I will illustrate with a real world example.

    A WordPress-based website has an existing robots.txt syntax as shown below (which is not yet dynamically generated using PHP):

    User-agent: *

    Disallow: */trackback

    Disallow: */feed

    Disallow: /searchresultpages

    Disallow: /wp-

    Disallow: /*?

    Disallow: /xmlrpc.php

    Disallow: /blockedbyrobots.php

    Allow: /wp-content/uploads/scripts/PHP-Server-Array-Variables.php

    Disallow: /postpdfcreator.php

    Disallow: /search/

    Disallow: /search

    Disallow: /2009/

    Disallow: /*.js$

    Disallow: /antibot.php

    Disallow: /ajaxwebform/captcha.php

    Disallow: /*.jpg$

    Disallow: /ajaxwebform/ajaxvalidate.php

    Disallow: /hiddentextexample.php

    Disallow: /searchresultpages/

    Allow: /index.php?page_id=123&pg=2

    Allow: /site-map/?pg=2

    Sitemap: http://www.php-developer.org/sitemap.xml

    The problematic syntax is this line (which requires periodic manual updating of robots.txt):

    Disallow: /2009/

    This year directory needs to be blocked because it contains duplicated posts, and its not useful for the search engines to index it. Such indexing only consumes a lot of bandwidth and can result in duplicate content issues.

    However, what if the year is already 2010, or 2011 or even later? The webmaster needs to edit the robots.txt file periodically, which might be inconvenient. It's much worse if he or she forgets to edit the file.

    Unfortunately, you cannot block them easily using the wildcards technique in robots.txt, because using wild cards can be dangerous if your website is big. A single wild card mistake can be disastrous.

    You can block indexing by using something like /20*, but if the year is 2100, it would still require manual editing. Using /2* can be risky because it might affect future URLs containing this pattern.

    To make this process as efficient and safe as possible, you will use PHP to automatically update the robots.txt file.

    More PEAR Articles Articles
    More By Codex-M

    blog comments powered by Disqus

    PEAR ARTICLES ARTICLES

    - Installing PEAR
    - PEAR: an Introduction
    - Managing robots.txt using PHP: Generating Dy...
    - Deleting Authors from a PEAR Content Managem...
    - PEAR CMS: Index and Delete Scripts
    - Listing Articles for a PEAR Content Manageme...
    - Building an Authors Page for a PEAR CMS
    - Building the View Details Page in a PEAR CMS
    - Creating the Main Pages of a PEAR CMS
    - Completing the Login Script for a PEAR CMS
    - User Authentication for a PEAR CMS
    - A PEAR CMS: Examining the Code
    - Building a Content Management System with PE...
    - Installing a PEAR Package
    - My PEAR: The Beginning


    © 2003-2012 by Developer Shed. All rights reserved. DS Cluster 7 - Follow our Sitemap