Basically I want to do a Match with a regex expression to get this:
/xpto/uuuu/That name tho [1080p].mp4
from
<section class="video">
<video id="video" autoplay="">
<source src="/xpto/uuuu/That name tho [1080p].mp4" type="video/mp4">
</video>
</section>
What i want is to get relative path that ends with .mp4 from big HTML page.
Can someone help me with this?
Thanks
SOLVED BY RAJ:
"(?<=src="")[^""]+"
Use lookbehind to match all the characters which are just after to src=" and upto to the next " symbol.(ie, the value of source attribute),
(?<=src=")[^"]+
DEMO
Related
I want to find all the jpgs in a big static site and add webp
so i need to find
<img src="image/path/my be/long/new_2.jpg" alt="descriptive tag blah" />
and replace with
<picture>
<source srcset="image/path/amy be/long/new_2.webp" type="image/webp">
<img src="image/path/my be/long/new_2.jpg" alt="descriptive tag blah" />
</picture>
Ive tried all kinds but I'm really unfamiliar with regex, so the closest I've got is
<img src="([^"]+)" alt="(.*?)" />
and replace with
<picture>
<source srcset="$1.webp" type="image/webp">
<img src="$1" alt="$2" />
</picture>
but that comes up with the file extension .jpg.webp
regex is such a huge topic any help on this from anyone with some experience will be very welcome
To adjust the regex to match the source without .jpg you can use this regex:
It matches:
a-z: lowercase characters
0-9: numbers
\/: slash
\s: whitespace
_: underscore
and stops with .jpg"
<img src="([a-z0-9\/\s_]+).jpg" alt="(.*?)" \/>
I am trying to get my regular expression to match any image url with certain optionals.
In my set that matches image file extensions everything is fine until I put in the gif extension. When I do that the pdf urls get matched for some reason.
Could anyone shed light on this?
I am using this within PHP with preg_match_all function
Rules for matching
Can be either src or href link
Can be relative or absolute link
Protocol can be http or https if given
Select only the link if matched
Case insensitive and global
Pattern (Take out gif and pdfs are skipped)
[src|href]="([(https|http):\/\/]?[^"]*.[jpg|png|jpeg|gif])"
Test strings
Should match <a href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg">
Should match <a href="/wp-content/uploads/2014/04/13061-someimage.jpg">
No match
No match
Should match <img href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg"/>
Should match <img href="/wp-content/uploads/2014/04/13061-someimage.gif"/>
Should match <img href="http://blog.mysite.com/wp-content/uploads/2014/04/13061-someimage.jpg" />
Should match <img href="/wp-content/uploads/2014/04/13061-someimage.jpg" />
www.regex101.com fiddle: https://regex101.com/r/x3vVSx/1
Thanks to #Micha Wiedenmann for this.
Quote/Unquote
You mixed up [ and (, you want (jpg|png|jpeg|gif) instead of [jpg|png|...]. Similarly for [src|href].
I want to find the following pattern:
Image not found: /Images/IMG-20160519-WA0015.jpg
And replace with some markup, including the image name from the above text like:
<a href="IMG-20160519-WA0015.jpg"><img src="IMG-20160519-WA0015.jpg" width="180" height="240" alt="IMG-20160519-WA0015.jpg" class="image" />
Is it possible with some kind of Regex or plugin or I'm simply burning neurones?
Thanks.
Try finding ^Image not found: \/Images\/(IMG-.*\.jpg) and replacing with <a href="\1"><img src="\1" width="180" height="240" alt="\1" class="image" />
Note that the caret (^) in the regex says that it must be at the beginning of the line, not sure if that's the case for you but I suspect that it is. I also assumed that the "IMG-" prefix is constant, if not then you can just remove those four characters from the regex.
If you're not aware of it, RegExr is a nice interactive way to build and test regular expressions.
EDIT: Since you mentioned having trouble in the comments, here's an image of my settings:
This is an example string.
<p style="text-align: center;"><img class="aligncenter wp-image-22582 size-full" src="http://the7.dream-demo.com/main/wp-content/uploads/sites/9/2014/05/show-04.png" alt="" width="372" height="225" /></p
There are two Url in a row
One is for PNG, the other is for a web page. I want to get the Png url like the pattern "http:.....png".
It simply uses "http://.*?png", but it retrieves a string from the first "http://" URL to the second Url with Png file extension.
I can now do it using the condition href and src to identify which belongs to Png url. But it will miss a lot of png urls with other patterns like <png>Png url</png>.
How could it be solved? Thanks.
Uhmm, dont parse html with regex as Biffen commented on, but you can extract bits eg:
(?<=href=")[^"]+.png
will do a lookbehind for href=" at the start of the pattern, match every character that isn't a " until the .png at the end.
Spending an hour learning regex will save you time coming here.
OK I'm thoroughly on why this regular expression works. The text I'm working with is this:
<html>
<body>
hello
<img src="withalt" alt="hi"/>asdf
<img src="noalt" />fdsaasdf
<img src="withalt2" alt="blah" />
</body>
</html>
Using the following regular expression (tested in php but I'm assuming it's true for all perl regular expressions), it will return all img tags which do not contain an alt tag:
/<img(?:(?!alt=).)*?>/
Returns:
<img src="noalt" />
So based on that I would think that simply removing the no backreference would return the same:
/<img(?!alt=).*?>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />
<img src="withalt2" alt="blah" />
As you see instead it just returns all image tags. Then to make things even more confusing, removing the ? (simply a wildcard as far as I'm aware) after the * returns up to the final >
/<img(?!alt=).*>/
Returns:
<img src="withalt" alt="hi"/>
<img src="noalt" />fdsaasdf
<img src="withalt2" alt="blah" />
So anyone care to inform me, or at least point me in the right direction of what's going on here?
/<img(?:(?!alt=).)*?>/
This regex applies negative look-ahead for each character it matches after img. So, as soon as it finds alt=, it stops. So, it will only match the img tag, that does not have an alt attribute.
/<img(?!alt=).*?>/
This regex, just applies the negative look-ahead after img. So, it will match everything till the first > for all the img tag which is not followed by alt=, no matter whether alt= appears anywhere further down the string. It will be covered in .*?
/<img(?!alt=).*>/
This is same as the previous one, but it matches everything till the last >, since it uses greedy matching. But I don't know why you got that output. You should have got everything till the last > for </html>.
Now forget everything that happened there, and move towards an HTML Parser, for parsing an HTML. They are specifically designed for this task. So, don't bother using regex, because you can't parse every kind of HTML's through regex.