2 VB RegEx Issues - regex

I need some help with a VB RegEx.
I've got two RegEx that I need to do two specific things.
RegEx one - I am not exactly sure how to do this, but I need to get everything within a Href tag. i.e.
String = "<a href=""test.html"">"
I need the RegEx to return .... test.html
RegEx Two - I have partly got this working.
I've got tags like
RegEx = "<div class=""top""(.*?)</div>"
String = "<div class=""top""><a><b><div class=""bottom""></div></b></a></div>"
The problem I have is this isnt returning anything, it should return everything withing "top", but it returns nothing.

Neither use-case can be solved well with regular expressions.
Use an HTML parser instead, e.g. the HTML Agility Pack.

Well, if your html doesn't contain nested tags you can do the first part with regex (as long as you can control your search source code, you can be much more certain of your results).
\<a href=""([^""]+)\>
the test.html will be found in the non-passive group referred to as $1.
The second part I'm concerned that you have nested tags in there and it's failing on that. The thing with regex and html is that regex can't delve well into the nested-allowable-but-not-best-practice code that can execute as expected but isn't well formed.
Can you post some search source for the second case so we can look?

Related

RegExp find wrong tags

I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world

Regular Expression to exclude a string?

I have a regular expression that runs through html tags and grabs values.
I currently have this to grab all values within the tag.
<title\b[^>]*>(.*\s?)</title>
It works perfectly. So if I have a bunch of pages that have titles:
<title>Index</title>
<title>Artwork</title>
<title>Theory</title>
The values returned are:
Index,
Artwork,
Theory
How can I make this regular expression ignore all tags with the value Theory inside them?
Thanks in Advance
A basic look around would probably handle that.
<title\b[^>]*>(((?!Juju - Search Results).)*)(.*\s?)<\/title>

Regexp for finding tags without nested tags

I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't

Regex to find string inside string inside string

Lets say I need to get a string inside some h1, h2, or h3 tags
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
This works great if the user decides to take a sane approach to headers:
<h1>My Header</h1>
but knowing my users, they want bold, italic, underlined h1's. And they have that coding quagmire tinyMCE to help them do it. TinyMCE would output:
<h1><b><span style='text-decoration: underline'><i>My Hideous Header</i></span></b></h1>
So my question is:
How do i get a string inside h1 h2, or h3, and then inside any amount of surrounding other tags as well?
Thanks,
Joe
/<(h[1-3])[^>]*>(?:.*?>)?([^<]+)(?:<.*?)?<\/\1>/i
It will not be too hard to make cases that break it hideously, since (as I'm sure people will tell you) parsing HTML is a job for an HTML parser, not a regex, but it works for your given case and various similar ones.
If you're in php you can use your regex:
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
then pass the captured result through strip_tags() function to get rid of all the insanity inside.
If you are not on php you can pass the result through regexp replace that removes tags. Something like replace
/<\/?[^>]+?>/
with empty string.
If you only want to capture the ultimately nested text you could just drop all tags inside the header tag with:
/<([hH][1-3]).*>(.*?)<.*\/$1>/
Untested, but I think it should work.

Regular Expression to extract src attribute from img tag

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.
Try this expression:
src\s*=\s*"([^"]+)"
I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1
You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible
I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.
I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]