Friday, February 12, 2010

Perfect regex for removing links when parsing HTML

After a few long hours:

PHP Version:

/\<a.*?href=('|\")(.*?)(?:(?<!\\\)\\1|\w+(?=\=)|.(?=\s))[^\>]*?>(.*?)(?:\<\/)(?=[a]).*?(?=\>)\>/i


Actual Regex

/\<a.*?href=('|\")(.*?)(?:(?<!\\)\1|\w+(?=\=)|.(?=\s))[^\>]*?>(.*?)(?:\<\/)(?=[a]).*?(?=\>)\>/i


That regex was designed for deVolf's new RSS Import feature. It takes an a link and removes the href link and the text inside the . It allows for empty links as well as links without href's. The regex return matches are as follows:
  1. match 1 is whether single or double quotes were used, this is required for later on in the regex and is not usual after the regex is run
  2. match 2 contains the href link
  3. match 3 contains the text between the <a></a>
Things to consider:
  • The regex matches anything after <a> until it hits </a>
  • Between the href="" it looks for a closing quote (that matches the quote used to start it), a space or another html property. Therefore, I recommend checking the end of the url for a quote or space before working with it.
  • It will NOT match newlines that are in the link anywhere. If you want to, add a s after the i at the end.
  • It works with PHP 5.3. I have not tested other versions.

Thanks,
James Hartig

2 comments:

p1nky said...

Hi, do you happen to be the same James Hartig who created this: http://userscripts.org/scripts/show/79598 ?
If yes, please be so kind as to update your script, because your script is very useful :)
I'm so sorry to be going so far and search for your blog >.<

DeveloWare LLC said...

After looking at atleast 20 other pages for a reg exp to remove links that have javascript in them I finally found yours. Thank you very much!
Joe Mas