Why does this regular expression work? - regex

OK I'm thoroughly on why this regular expression works. The text I'm working with is this:
<html>
<body>
hello
<img src="withalt" alt="hi"/>asdf
<img src="noalt" />fdsaasdf
<img src="withalt2" alt="blah" />
</body>
</html>
Using the following regular expression (tested in php but I'm assuming it's true for all perl regular expressions), it will return all img tags which do not contain an alt tag:
/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />
So based on that I would think that simply removing the no backreference would return the same:
/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />
As you see instead it just returns all image tags. Then to make things even more confusing, removing the ? (simply a wildcard as far as I'm aware) after the * returns up to the final >
/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsaasdf
<img src="withalt2" alt="blah" />
So anyone care to inform me, or at least point me in the right direction of what's going on here?

/<img(?:(?!alt=).)*?>/
This regex applies negative look-ahead for each character it matches after img. So, as soon as it finds alt=, it stops. So, it will only match the img tag, that does not have an alt attribute.
/<img(?!alt=).*?>/
This regex, just applies the negative look-ahead after img. So, it will match everything till the first > for all the img tag which is not followed by alt=, no matter whether alt= appears anywhere further down the string. It will be covered in .*?
/<img(?!alt=).*>/
This is same as the previous one, but it matches everything till the last >, since it uses greedy matching. But I don't know why you got that output. You should have got everything till the last > for </html>.
Now forget everything that happened there, and move towards an HTML Parser, for parsing an HTML. They are specifically designed for this task. So, don't bother using regex, because you can't parse every kind of HTML's through regex.

Related

Regex Pattern to Match A Href and Remove

I am trying to create a regex to match all a href links that contain my domain and I will end up removing the links. It is working fine until I run into an a href link that has another HTML tag within the tag.
Regex Statement:
(<a[^<]*coreyjansen\.com[^<]*>)([^"]*?)(<\/a>)
It matches the a href links in this statement with no problem
Need a lawyer? Contact <span style="color: #000000">Random text is great Corey is awesome</span>
It is unable to match both of the a href links this statement:
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /></a>
I have been trying to play with the neglected character set with no luck. If I remove the neglected character set what ends up happening is it will match two links that are right after each other such as example 2 as one match.
The issue here is that [^<]*> matches everything up until last >. That's the greedy behaviour of * asterisk. You can make it non-greedy by appending ? after asterisk(which you already do in other part of your query). It will then match everything until first occurrence of >. Then you have to change the middle part of your regex too ie. to catch everything until first tag </a> like this:
(<a[^<]*coreyjansen\.com[^<]*?>)(.*?)(<\/a>)
Use below regex which matches only a tag
(<a[^>]*coreyjansen\.com[^>]*>)
Example data
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /><a href="http://coreyjansen.com/"/>
Above regex will match all three a tag with your required domain.
Try above on regex
I'm playing with the following regex and it seems to be working:
<a.*coreyjansen\.com.*</a>
it captures anything between anchor tags that contain your site name. I am using javascript pattern matching from www.regexpal.com, depending on the language it could be slightly different
You need to match start of tag <a then match address before > char. You are matching wrong char. When you match that, then everithing between <a> and </a> is displayed link. I don't know why you compare to not contain quotes, every tag attribute (in HTML5) has value inside quotes, so you need to match everything except link ending tag </a>. It's done by ((?!string to not match).)* and after that should follow </a>. The result regex is:
(<a[^>]*coreyjansen\.com[^>]*>)((?!<\/a>).)*(<\/a>)

Notepad++ RegEx replace with pattern

I want to find the following pattern:
Image not found: /Images/IMG-20160519-WA0015.jpg
And replace with some markup, including the image name from the above text like:
<a href="IMG-20160519-WA0015.jpg"><img src="IMG-20160519-WA0015.jpg" width="180" height="240" alt="IMG-20160519-WA0015.jpg" class="image" />
Is it possible with some kind of Regex or plugin or I'm simply burning neurones?
Thanks.
Try finding ^Image not found: \/Images\/(IMG-.*\.jpg) and replacing with <a href="\1"><img src="\1" width="180" height="240" alt="\1" class="image" />
Note that the caret (^) in the regex says that it must be at the beginning of the line, not sure if that's the case for you but I suspect that it is. I also assumed that the "IMG-" prefix is constant, if not then you can just remove those four characters from the regex.
If you're not aware of it, RegExr is a nice interactive way to build and test regular expressions.
EDIT: Since you mentioned having trouble in the comments, here's an image of my settings:

Need to replace dynamic image tag with text

I need to replace the below url (including img tags) with text. I am not very good with regex... As you can see its dynamic with dates, and it ends in two different ways:
with alt=";)"> and sometimes with class="wp-smiley" />
<img src="http://thailandsbloggare.se/wp-content/uploads/2012/10/icon_wink.gif" alt=";)">
and sometimes with class="wp-smiley" at the end
<img src="http://thailandsbloggare.se/wp-content/uploads/2012/09/icon_wink.gif" alt=";)" class="wp-smiley" />
So any time this image is posted I want the complete string to replaced to text ";)"
I have managed to write the regex for everything until alt=";)"> and sometimes with class="wp-smiley" /> but then I am stuck, pressume need some OR functionality here.
<img src="http://thailandsbloggare.se/wp-content/uploads/20\d\d/\d+/icon_wink\.gif
Updated information after replies below
<img src="http://thailandsbloggare.se/wp-content/uploads/20[0-9]{2}/[01][0-9]/icon_wink.gif" alt=";\)" *(|class="wp-smiley")?>
and
Both fail returning strings whith class="wp-smiley" /> included
Its a site built in Wordpress using PHP and I am using http://urbangiraffe.com/plugins/search-regex/
Thanks in advance!
Normally, in a regex, you can create alternative sub-regexes:
(match this|or this)
In your case
(alt=";\)"|class="wp-smiley")
If alt=";)" is always there, do:
alt=";\)" *(|class="wp-smiley")
Of course, we don't know in which editor or programming language you are operating, and the actual regex implementation can be different from the above example.
Try the following pattern search:
<img src="http://thailandsbloggare.se/wp-content/uploads/20[0-9]{2}/[01][0-9]/icon_wink.gif" alt=";\)"(\sclass="wp-smiley")?>
Please refer to the syntax supported by the regex engine you are using. But, for most engines the above pattern should work. Note the character class used for date ranges, you should change it appropriately.

Regular expression to remove <p> tags around elements wrapped in [...]'s

I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.

Regex Syntax - Help

I need to process a HTML content and replace the IMG SRC value with the actual data. For this I have choose Regular Expressions.
In my first attempt I need to find the IMG tags. For this I am using the following expression:
<img.*src.*=\s*".*"
Then within the IMG tag I am looking for SRC="..." and replace it with the new SRC value. I am using the following expression to get the SRC:
src\s*=\s*".*"\s*
The second expression having issues:
For the following text it works:
<img alt="3D""" hspace=
"3D0" src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline" border="3d0" />
But for the following it does not:
<img alt="3D""" hspace="3D0" src=
"3D"cid:UHYNUEWHVTSH.lilies.jpg"" align="3dbaseline"
border="3d0" />
What happens is the expression returns
src="3D"cid:TDCJXACLPNZD.hills.jpg"" align=
"3dbaseline"
It does not return only the src part as expected.
I am using C++ Boost regex library.
Please help me to figure out the problem.
Thanks,
Hilmi.
The problem is that .* is a "greedy" match - it will grab as much text as it possibly can while still allowing the regex to match. What you probably want is something like this:
src\s*=\s*"[^"]*"\s*
which will only match non-doublequote characters inside the src string, and thus not go past the ending doublequote.
Your first regex doesn't work on your sample text for me. I usually use this instead, when looking for specific HTML tags:
<img[^>]*>
Also, try this for your second expression:
src\s*=\s*"[^"]*"\s*
Does that help?