Miscellaneous

  Home arrow Miscellaneous arrow Page 3 - Using PCREs
MISCELLANEOUS

Using PCREs
By: Codewalkers
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2002-11-16

    Table of Contents:
  • Using PCREs
  • Section 1
  • Section 2
  • Section 3
  • Wrapping Things Up

  •  
     

    SEARCH CODEWALKERS

    TOOLS YOU CAN USE

    advertisement

    Using PCREs - Section 2


    (Page 3 of 5 )

    The above five examples are the best way of introducing the rules I'm about to give. Just as a reminder, I have set up a regular expression tester at:http://samuelfullman.com/team/php/tools/regular_expression_tester_p.php

    This tester is great because you can build a string, then either paste in text to search for the strings, OR you can specify a URL on the Web. Here's Sam's Rule #2 of Regular Expressions:

    BUILD YOUR REGULAR EXPRESSIONS UP STEP BY STEP, TESTING VARIATIONS OF SEARCHED STRINGS AT EVERY STEP.

    The tester I desgined will allow you to do that.

    Let's go back to the simple href examples above:

    Case 1
    <A HREF = http://compasspointmedia.com>Click here</a>

    Again, this is perfectly valid in any browser. The problem is that we have spaces. We could also have tabs or newline characters.

    Enter Sam's Rule #3 of Regular Expressions:

    ALWAYS COMPENSATE AND ACCOUNT FOR WHITESPACE!

    As you may know, browsers don't show whitespace, and a series of more than one space character is ignored. In Perl Regexes, whitespace characters (characters chr(9),chr(10),chr(13) and the space) are designated by \s. So let's rewrite our regex to handle this:

    /<a(\s)+href(\s)*=(\s)*"[^"]+"(\s)*>.*<\/a(\s)*>/i

    I've added (\s) where whitespace could conceivably be in the string. Notice that after the first <a, there must be at least one whitespace character, hence the + sign afterwards. The whitespace in the </a> tag is unlikely but again, it's legal for browsers and we want to account for its possible presence.

    Case 2
    &lt;a name="link" href="http://compasspointmedia.com"&gt;Click here&lt;/a&gt;

    This is pretty obvious; attributes don't have to be in any order. Great for writing HTML, hard for regexes. You have to think strategically on this one. Here's how we add in for this:

    /&lt;a(\s)+[^&gt;]*href(\s)*=(\s)*"[^"]+"[^&gt;]*(\s)*&gt;.*&lt;\/a(\s)*&gt;/i

    Basically, I've added this string [^>]*, which means, in English, "anything except for a close bracket (>) character, zero or any number of times. In other words, the first (>) closes the href tag, so that would mean we're no longer in the tag. Since we don't know where the href attribute will be declared in the string, this works.

    A little thought will tell you that if we were requiring TWO attributes like href AND name, it might get a little ugly. The regex for this is fallible but will get most cases where we need both. This is WAY complex. You can skip this next one if you want but here it is:

    /&lt;a(\s)+[^&gt;]*

    ((href(\s)*=(\s)*"[^"]+")|(name(\s)*=(\s)*"[^"]+")|([^&gt;]*)){2,}

    (\s)*&gt;.*&lt;\/a(\s)*&gt;/i

    More Miscellaneous Articles
    More By Codewalkers

    blog comments powered by Disqus

    MISCELLANEOUS ARTICLES

    - Oracle Database XE: Indexes and Sequences
    - Modifying Tables in Oracle Database XE
    - Oracle Database XE: Tables and Constraints
    - More on Oracle Databases and Datatypes
    - Oracle Database XE Datatypes: Datetime and L...
    - Oracle Database XE Datatypes: Character and ...
    - From Databases to Datatypes
    - Firefox 3.6.6 Released with Improved Plug-in...
    - Attention Bloggers: WordPress 3.0 Now Releas...
    - Reflection in PHP 5
    - Inheritance and Other Advanced OOP Features
    - Advanced OOP Features
    - Linux from Scratch V.6.6 Review
    - Linux Gaining in Strength
    - Install Slackware on Your Old PC


    © 2003-2012 by Developer Shed. All rights reserved. DS Cluster 7 - Follow our Sitemap