Using PCREs - Section 3
(Page 4 of 5 )
Case 3<a href='http://compasspointmedia.com'>click here</a> |
What's different here? The single quote vs. the double quote. By the way, we're getting complicated here, so please excuse me if my lines start to run over; in a real regex string, you MUST put everything on the same line or the compiler will interpret the newline character as a character in the string.
Back to quotes. You know that the pipe (|) character means "or". Quotes around an href attribute, I believe wc3 says, are optional, anyway they work in a browser, so I handle this the following way:
which would get "value", 'value', or value. (Would also get "value' etc. but if anyone writes hrefs that way they're an idiot.)
So let's expand our all-inclusive href regex a bit more:
/<a(\s)+[^>]*href(\s)*=(\s)*('|")*[^"]+('|")*[^>]*(\s)*>.*<\/a(\s)*>/i |
So that's Sam's Rule #4 or Regular Expressions:
ALWAYS REMEMBER TO ACCOUNT FOR DIFFERENT QUOTE TYPES!Believe me, this is hurting me to write as much as it is for you to read, but if we want to obey Sam's Rule #1, we write it this way. If you're like me, soon you will develop a library of these expressions so you can reuse them.
Now on to case 4 (guaranteed to save you some real frustration):
Case 4<a href="http://compasspointmedia.com"> click here or anywhere in this paragraph </a> |
Believe me, you CAN get HTML output like this on the web (especially when a scripting language like PHP or ASP is outputting it).
This brings up one of the biggest revelations I had when looking for strings; the PCRE string .* (any character, any number of TIMES) does NOT include newlines and other whitespace.
In other words, the following will NOT work:
/<a href="[^"]+">.*<\/a>/i |
That dot just doesn't get it, sorry pal. What you need is the following where you'd use that:
So then this WOULD work:
/<a href="[^"]+">(.|\s)*<\/a>/i |
This will get a paragraph of text, the dot-star will just get a sentence.
That yields Sam's Rule #5 for Regular Expressions:
REMEMBER TO USE (.|\S)* TO FIND LARGE BLOCKS OF HTML.Let's take a look at the final case:
Case 5<A HREF=http://compasspointmedia.com>Click here</a><a href="http://amazon.com">go to amazon</a> |
The problem here is greed. Yes, computer greed. And greed is not good for computers. This means that our friend below:
/<a href="[^"]+">(.|\s)*<\/a>/i |
is going to look at case 5 as one huge href to compasspointmedia.com with everything in between (Click here through amazon) being the text for the href. You and I know better, so does the browser, so make sure your regex can tell by using a question mark character:
/<a href="[^"]+">(.|\s)*?<\/a>/i |
What the question mark character is really telling the complier in English is, "Get every instance you can of the previous character or set of characters, but STOP GETTING MORE when you encounter your first instance of the remainder of the string. In this case the remainder of the string is <\/a>, so using a ? will find TWO matches.
This of course is Sam's Rule #6 for Regular Expressions:
GREED IS NOT GOOD (USE THE ? TO AVOID SPANNING MULTIPLE INSTANCES)Depending on the complexity of what you're looking for, Regexes can be more complicated. If you'd like some links to my regex string library, email me and I'd be happy to provide them. I have started to develop this on the regular expression tester I've provided you.
Next: Wrapping Things Up >>
More Miscellaneous Articles
More By Codewalkers