Friday, February 12, 2010

Perfect regex for removing links when parsing HTML

After a few long hours:

PHP Version:


Actual Regex


That regex was designed for deVolf's new RSS Import feature. It takes an a link and removes the href link and the text inside the . It allows for empty links as well as links without href's. The regex return matches are as follows:
  1. match 1 is whether single or double quotes were used, this is required for later on in the regex and is not usual after the regex is run
  2. match 2 contains the href link
  3. match 3 contains the text between the <a></a>
Things to consider:
  • The regex matches anything after <a> until it hits </a>
  • Between the href="" it looks for a closing quote (that matches the quote used to start it), a space or another html property. Therefore, I recommend checking the end of the url for a quote or space before working with it.
  • It will NOT match newlines that are in the link anywhere. If you want to, add a s after the i at the end.
  • It works with PHP 5.3. I have not tested other versions.

