Managing robots.txt using PHP: Generating Dynamic Syntax - Creating the PHP file and the static syntax
(Page 3 of 5 )
Let's name the file myrobots.txt.php (you can name it anything you like). The following is the initial syntax, taken from the existing robots.txt shown earlier -- except for the year syntax, which needs special processing:
This script assigns the existing robots.txt syntax (that does not require dynamic editing) to a PHP variable, $currentsyntax. Then it echoes it to a browser as a text file.
This script is not yet complete, as it does not yet block the problematic "year" directory.
Dynamically generate syntax to block the "Year" directory
In WordPress, you can use PHP to query the MySQL database to retrieve post dates in wp_post table. You can then do string manipulation to extract the year. If you are not using WordPress, you can do the same thing by following techniques similar to those discussed in this tutorial.
Once the year has been extracted, it will then be concatenated with the robots.txt Disallow command. For an explanation of the code below, refer to the comments tags in bold fonts:
//Initialized the WHILE DO LOOP and assign the first year as the initial value.
//Do the loop until the latest year has been reached.
$i=$firstpostyear;
echo strip_tags(nl2br("rn# Start of dynamically generated robots.txt syntax"));
while ($i<=$latestpostyear) {
echo strip_tags(nl2br("rnDisallow: /".$i++."/"));
}
echo strip_tags(nl2br("rn# End of dynamically generated robots.txt syntax"));
Since the contents will be rendered in the text file at the browser, it is important to use strip_tags(nl2br("Contents to render in the browser...")) to add break lines (which makes your robots.txt syntax looks clean and readable in the browser) and strip HTML tags from displaying in the text output (e.g <br />)
Finally, you can add the last two remaining pieces of PHP code for the sitemap reference in the robots.txt file:
To add space or line breaks (the same function as <br /> in HTML output) to the robots.txt file, this line is used:
echo strip_tags(nl2br("rn"));
The most important line in the PHP script mentioned above is this:
echo strip_tags(nl2br("rnDisallow: /".$i++."/"));
This will generate the actual robots.txt "Disallow" syntax for the year directory. So if your WordPress-based site has been in existence since 2005 and you've been updating its content through the present (2010), the generated syntax will be:
Disallow: /2005/
Disallow: /2006/
Disallow: /2007/
Disallow: /2008/
Disallow: /2009/
Disallow: /2010/
Warning: Do not forget to back up any existing files (robots.txt, .htaccess, etc) before doing any editing and uploading that can overwrite existing files.