Regular Expression to extract src attribute from img tag - regex

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?

Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible

This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.

Try this expression:
src\s*=\s*"([^"]+)"

I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1

You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.

I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]

Related

RegExp find wrong tags

I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world

How to extract FirstName and LastName from html tags with regex?

I have response body which contains
"<h3 class="panel-title">Welcome
First Last </h3>"
I want to fetch 'First Last' as a output
The regular expression I have tried are
"Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))"
"Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)"
But not able to get the result. If I remove the newline and take it as
"<h3 class="panel-title">Welcome First Last </h3>" it is detecting in online regex maker.
I suspect your problem is the carriage return between "Welcome" and the user name. If you use the "single-line mode" flag (?s) in your regex, it will ignore newlines. Try these:
(?s)Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))
(?s)Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)
(this works in jMeter and any other java or php based regex, but not in javascript. In the comments on the question you say you're using javascript and also jMeter - if it is a jMeter question, then this will help. if javaScript, try one of the other answers)
Well, usually I don't recommend regex for this kind of work. DOM manipulation plays at its best.
but you can use following regex to yank text:
/(?:<h3.*?>)([^<]+)(?:<\/h3>)/i
See demo at https://regex101.com/r/wA2sZ9/1
This will extract First and Last names including extra spacing. I'm sure you can easily deal with spaces.
In jmeter reg exp extractor you can use:
<h3 class="panel-title">Welcome(.*?)</h3>
Then take value using $1$.
In the data you shown welcome is followed by enter.If actually its part of response then you have to use \n.
<h3 class="panel-title">Welcome\n(.*?)</h3>
Otherwise above one is enough.
First verify this in jmeter using regular expression tester of response body.
Welcome([\s\S]+?)<
Try this, it will definitely work.
Regular expressions are greedy by default, try this
Welcome\s*([A-Za-z]+)\s*([A-Za-z]+)
Groups 1 and 2 contain your data
Check it here

remove multiple style tags using regular expression in C#

I have the following html source which consists of two style tags, using regular expressions we are able to remove all the html tags from the file,but the we are not able to remove the content of second style tag
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
C# Code
1) Regex test = new Regex(#"<[^\>]*>{}");
2) strText = test.Replace(strText, String.Empty);
Output:-
1) Expected is blank but we get P {margin-top:0;margin-bottom:0;}
Do you want to remove the style tag?
<style.*?</style>
I would not generally recommend using regex to match HTML/XML unless you are sure that it always has a certain structure. There is better tools for manipulating XML.
but i want the attributes/values of style tag also to be removed
You can try with back reference that matches the same text as previously matched by a capturing group.
To remove everything inside the <...> to </...> use below regex that looks for same opening and closing HTML tags.
<(\w+)[^>]*>.*<\/\1>
Captured Group 1-----^^^ ^^----- Back Reference first matched group
Here is demo

Remove after .jpg

I'm getting a value like this:
myimage.jpg123456jpg
and I need to remove everything after .jpg
how can I write this in razor?
I don't know anything about razor but this regex would match the part you'd like to save in the first result group:
(.+\.jpg)
You can see it in action here: http://regexr.com?2v7ki
Just match on .+\.jpg, which will give you the myimage.jpg section of the text.

2 VB RegEx Issues

I need some help with a VB RegEx.
I've got two RegEx that I need to do two specific things.
RegEx one - I am not exactly sure how to do this, but I need to get everything within a Href tag. i.e.
String = "<a href=""test.html"">"
I need the RegEx to return .... test.html
RegEx Two - I have partly got this working.
I've got tags like
RegEx = "<div class=""top""(.*?)</div>"
String = "<div class=""top""><a><b><div class=""bottom""></div></b></a></div>"
The problem I have is this isnt returning anything, it should return everything withing "top", but it returns nothing.
Neither use-case can be solved well with regular expressions.
Use an HTML parser instead, e.g. the HTML Agility Pack.
Well, if your html doesn't contain nested tags you can do the first part with regex (as long as you can control your search source code, you can be much more certain of your results).
\<a href=""([^""]+)\>
the test.html will be found in the non-passive group referred to as $1.
The second part I'm concerned that you have nested tags in there and it's failing on that. The thing with regex and html is that regex can't delve well into the nested-allowable-but-not-best-practice code that can execute as expected but isn't well formed.
Can you post some search source for the second case so we can look?