Using PCREs - Section 2
(Page 3 of 5 )
The above five examples are the best way of introducing the rules I'm about to give. Just as a reminder, I have set up a regular expression tester at:http://samuelfullman.com/team/php/tools/regular_expression_tester_p.php
This tester is great because you can build a string, then either paste in text to search for the strings, OR you can specify a URL on the Web. Here's Sam's Rule #2 of Regular Expressions:
BUILD YOUR REGULAR EXPRESSIONS UP STEP BY STEP, TESTING VARIATIONS OF SEARCHED STRINGS AT EVERY STEP.The tester I desgined will allow you to do that.
Let's go back to the simple href examples above:
Case 1<A HREF = http://compasspointmedia.com>Click here</a> |
Again, this is perfectly valid in any browser. The problem is that we have spaces. We could also have tabs or newline characters.
Enter Sam's Rule #3 of Regular Expressions:
ALWAYS COMPENSATE AND ACCOUNT FOR WHITESPACE!As you may know, browsers don't show whitespace, and a series of more than one space character is ignored. In Perl Regexes, whitespace characters (characters chr(9),chr(10),chr(13) and the space) are designated by \s. So let's rewrite our regex to handle this:
/<a(\s)+href(\s)*=(\s)*"[^"]+"(\s)*>.*<\/a(\s)*>/i |
I've added (\s) where whitespace could conceivably be in the string. Notice that after the first <a, there must be at least one whitespace character, hence the + sign afterwards. The whitespace in the </a> tag is unlikely but again, it's legal for browsers and we want to account for its possible presence.
Case 2<a name="link" href="http://compasspointmedia.com">Click here</a> |
This is pretty obvious; attributes don't have to be in any order. Great for writing HTML, hard for regexes. You have to think strategically on this one. Here's how we add in for this:
/<a(\s)+[^>]*href(\s)*=(\s)*"[^"]+"[^>]*(\s)*>.*<\/a(\s)*>/i |
Basically, I've added this string [^>]*, which means, in English, "anything except for a close bracket (>) character, zero or any number of times. In other words, the first (>) closes the href tag, so that would mean we're no longer in the tag. Since we don't know where the href attribute will be declared in the string, this works.
A little thought will tell you that if we were requiring TWO attributes like href AND name, it might get a little ugly. The regex for this is fallible but will get most cases where we need both. This is WAY complex. You can skip this next one if you want but here it is:
/<a(\s)+[^>]*
((href(\s)*=(\s)*"[^"]+")|(name(\s)*=(\s)*"[^"]+")|([^>]*)){2,}
(\s)*>.*<\/a(\s)*>/i |
Next: Section 3 >>
More Miscellaneous Articles
More By Codewalkers