Using PCREs - Section 1
(Page 2 of 5 )
OK, so hang on to your hats. Take for example trying to find a hyperlink on a web page (the <href> tag). Here is an href that is in its simplest form:
<a href="http://compasspointmedia.com">click here</a> |
And here is the minimum regular expression that would find this using PCRE's:
/<a href="[^"]+">.*<\/a>/i |
You'll notice the "wrapper" of slashes and the 'i' on the end, that is /....../i. This is how it's done in PERL. The i stands for case-insensitive, by the way. You actually aren't constrained to use a '/' as your delimiter, but I usually do. Since I use a forward slash as my wrapper, I must "escape" any forward slash character inside the delimiters with a backslash, like this: \/, so the compiler doesn't think it's the end.
Now, the brackets [] enclose a character or set of characters, and ^ in this case means NOT or EXCEPT FOR, so this part [^"]+ means "any character except a double quote, at least once." This covers the opening href tag. Then we specify any character (.), and the * means "zero to infinity times". Finally we want the closing tag (<\/a>).
Again, the "i" at the end means a case-insenstive search. Basically in English, this regex is saying the following: "Find an open tag, an "a" then a space, then an href=, then a double quote, then ANYTHING EXCEPT FOR A DOUBLE QUOTE, ANY NUMBER OF TIMES. Then find another double quote, then a close bracket. Then find any characters you want, but you have to end it with a close tag </a>". (whew!)
Regex is cool! Reason: You don't have to know what the href is, OR what the text is (click here) for that matter. However, if I was looking for links on a web page, I would NEVER use the above string for finding an href. Here's why:
Case 1<a href = "http://compasspointmedia.com">click here</a> |
Case 2<a name="link" href="http://compasspointmedia.com">Click here</a> |
Case 3<a href='http://compasspointmedia.com'>click here</a> |
Case 4<a href="http://compasspointmedia.com"> click here or anywhere in this paragraph </a> |
Case 5<A HREF=http://compasspointmedia.com>Click here</a><a href="http://amazon.com">go to amazon</a> |
These are five examples of VALID hrefs (try them in a web page and see) that your browser would recognize, but which your regex string would not. So that brings me to Sam's Rule #1 of Regular Expressions:
ALWAYS DESIGN YOUR REGULAR EXPRESSION TO BE AS SMART AS YOUR BROWSER!Next: Section 2 >>
More Miscellaneous Articles
More By Codewalkers