Miscellaneous
  Home arrow Miscellaneous arrow Page 4 - Using PCREs
Codewalker Forums 
  Tutorials  
Database Articles  
Miscellaneous  
Navigation Usability  
PEAR Articles  
Programming Basics  
Server Administration  
XML Tutorials  
  Reviews  
Database Book Reviews  
Linux Book Reviews  
Miscellaneous Reviews  
PHP Book Reviews  
PHP Software Reviews  
Server Admin Reviews  
SQL Tool Reviews  
  Code Gallery  
Content Management Code  
Contest Code  
Counters Code  
Database Code  
Date Time Code  
Discussion Board Code  
Email Code  
File Manipulation Code  
GUI Code  
Link Farm Code  
Miscellaneous Code  
Search Code  
Site Navigation Code  
User Management Code  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Download TestComplete 
Forums Sitemap 
Weekly Newsletter 
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
MISCELLANEOUS

Using PCREs
By: Codewalkers
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2002-11-16

    Table of Contents:
  • Using PCREs
  • Section 1
  • Section 2
  • Section 3
  • Wrapping Things Up

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Using PCREs - Section 3


    (Page 4 of 5 )

    Case 3
    <a href='http://compasspointmedia.com'>click here</a>

    What's different here? The single quote vs. the double quote. By the way, we're getting complicated here, so please excuse me if my lines start to run over; in a real regex string, you MUST put everything on the same line or the compiler will interpret the newline character as a character in the string.

    Back to quotes. You know that the pipe (|) character means "or". Quotes around an href attribute, I believe wc3 says, are optional, anyway they work in a browser, so I handle this the following way:

    /('|")*value('|")*/i

    which would get "value", 'value', or value. (Would also get "value' etc. but if anyone writes hrefs that way they're an idiot.)

    So let's expand our all-inclusive href regex a bit more:

    /<a(\s)+[^>]*href(\s)*=(\s)*('|")*[^"]+('|")*[^>]*(\s)*>.*<\/a(\s)*>/i

    So that's Sam's Rule #4 or Regular Expressions:

    ALWAYS REMEMBER TO ACCOUNT FOR DIFFERENT QUOTE TYPES!

    Believe me, this is hurting me to write as much as it is for you to read, but if we want to obey Sam's Rule #1, we write it this way. If you're like me, soon you will develop a library of these expressions so you can reuse them.

    Now on to case 4 (guaranteed to save you some real frustration):

    Case 4
    <a href="http://compasspointmedia.com">
    click
    here
    or 
    anywhere in
    this paragraph
    </a>

    Believe me, you CAN get HTML output like this on the web (especially when a scripting language like PHP or ASP is outputting it).

    This brings up one of the biggest revelations I had when looking for strings; the PCRE string .* (any character, any number of TIMES) does NOT include newlines and other whitespace.

    In other words, the following will NOT work:

    /<a href="[^"]+">.*<\/a>/i

    That dot just doesn't get it, sorry pal. What you need is the following where you'd use that:

    (.|\s)*

    So then this WOULD work:

    /<a href="[^"]+">(.|\s)*<\/a>/i

    This will get a paragraph of text, the dot-star will just get a sentence.

    That yields Sam's Rule #5 for Regular Expressions:

    REMEMBER TO USE (.|\S)* TO FIND LARGE BLOCKS OF HTML.

    Let's take a look at the final case:

    Case 5
    <A HREF=http://compasspointmedia.com>Click here</a><a href="http://amazon.com">go to amazon</a>

    The problem here is greed. Yes, computer greed. And greed is not good for computers. This means that our friend below:

    /<a href="[^"]+">(.|\s)*<\/a>/i

    is going to look at case 5 as one huge href to compasspointmedia.com with everything in between (Click here through amazon) being the text for the href. You and I know better, so does the browser, so make sure your regex can tell by using a question mark character:

    /<a href="[^"]+">(.|\s)*?<\/a>/i

    What the question mark character is really telling the complier in English is, "Get every instance you can of the previous character or set of characters, but STOP GETTING MORE when you encounter your first instance of the remainder of the string. In this case the remainder of the string is <\/a>, so using a ? will find TWO matches.

    This of course is Sam's Rule #6 for Regular Expressions:

    GREED IS NOT GOOD (USE THE ? TO AVOID SPANNING MULTIPLE INSTANCES)

    Depending on the complexity of what you're looking for, Regexes can be more complicated. If you'd like some links to my regex string library, email me and I'd be happy to provide them. I have started to develop this on the regular expression tester I've provided you.

    More Miscellaneous Articles
    More By Codewalkers


       · Thank you for this tut. I read over the section on regular expressions in the PHP...
       · There was I wondering how to get HTML tags out of a chunk of text (and wondering why...
     

    MISCELLANEOUS ARTICLES

    - Using PHP to Stream MP3 Files and Prevent Il...
    - 10 Must Have Firefox Improvements
    - All About OpenOffice 3.0
    - Shell Script Writing
    - Loops in the UNIX Shell
    - The Test in the UNIX Shell
    - Data Streams and the UNIX Shell
    - Control Mechanisms of the UNIX Shell
    - Variables Within the UNIX Shell
    - The Shell and UNIX
    - In Detail: UNIX File Systems
    - Rights Management in UNIX
    - UNIX File Systems
    - The Terminal in UNIX
    - Operating Systems and UNIX





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 1 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek