Lets say I need to get a string inside some h1, h2, or h3 tags
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
This works great if the user decides to take a sane approach to headers:
<h1>My Header</h1>
but knowing my users, they want bold, italic, underlined h1's. And they have that coding quagmire tinyMCE to help them do it. TinyMCE would output:
<h1><b><span style='text-decoration: underline'><i>My Hideous Header</i></span></b></h1>
So my question is:
How do i get a string inside h1 h2, or h3, and then inside any amount of surrounding other tags as well?
Thanks,
Joe
/<(h[1-3])[^>]*>(?:.*?>)?([^<]+)(?:<.*?)?<\/\1>/i
It will not be too hard to make cases that break it hideously, since (as I'm sure people will tell you) parsing HTML is a job for an HTML parser, not a regex, but it works for your given case and various similar ones.
If you're in php you can use your regex:
/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/
then pass the captured result through strip_tags() function to get rid of all the insanity inside.
If you are not on php you can pass the result through regexp replace that removes tags. Something like replace
/<\/?[^>]+?>/
with empty string.
If you only want to capture the ultimately nested text you could just drop all tags inside the header tag with:
/<([hH][1-3]).*>(.*?)<.*\/$1>/
Untested, but I think it should work.
Related
I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world
I have strings that looks like this:
"Grand Theft Auto V (5)" border="0" src="/product_images/Gaming/Playstation4 Software/5026555416986_s.jpg" title="Grand... (the string continues for a while here)
I want to use regex to grab this: /product_images/Gaming/Playstation4 Software/5026555416986_s.jpg
Basically, everything in src="..."
At the moment I produce a list using re.findall(r'"([^"]*)"', line) and grab the appropriate one, but there's a lot of quotes in the full string and I'd like to be more efficient.
Can anyone help me put together an expression for this please?
Try with this
(?<=src=").+(?=" )
Use this as RE :
src="(.+?)"
This will return result as you want.
re.findall('src="(.+?)"', text_to_search_from)
Here is string examples:
<option value="20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg">20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg</option>
<option value="20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg">20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg</option>
expected out:
20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg
20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg
I need extract only first matches of file name from string without quote symbols . How can I do it?
the pattern is first 6 digits and end with jpg
I am programming on D.
I have a lot of variants. And all of them are cripple. One of them:
(="[0-9]{8}).+(\")
I know you don't want to use a html parser, but I want to show how simple that is for people in the future who find this question.
regex kinda sorta works for html sometimes, but there's a lot of things it doesn't to: it would leave html entities (& for example) undecoded and extracting the right tag can be hard. A HTML parser makes it easy and correct (and IMO more readable):
My dom.d is does a decent job on html, so I'll show how to use it.
Grab dom.d from my github:
https://github.com/adamdruppe/arsd/blob/master/dom.d
( and if you are parsing tag soup from random non UTF-8 websites, characterencodings.d too: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d )
Then you can do it like this:
import arsd.dom;
import std.stdio;
void main() {
auto document = new Document("your html string here");
foreach(option; document.querySelectorAll("option"))
writeln(option.value); // or option.innerText
}
Compile with dmd yourfile.d dom.d. (add characterencodings.d to the command line if you need to handle non utf-8 too)
querySelectorAll works like CSS selectors, similar to the same function in Javascript and in jQuery, so you can put in context too to extract the option tags from the rest of the html document.
You can try this regex:
(?<=>)([0-9]{8}.+)(?=<)
Online demo
\b\d{8}[^" ]*\.jpg(?![^"]*"(?:[^"]*"[^"]*")*[^"]*$)
Try this.See demo.
https://regex101.com/r/fA6wE2/20
I need some help with a VB RegEx.
I've got two RegEx that I need to do two specific things.
RegEx one - I am not exactly sure how to do this, but I need to get everything within a Href tag. i.e.
String = "<a href=""test.html"">"
I need the RegEx to return .... test.html
RegEx Two - I have partly got this working.
I've got tags like
RegEx = "<div class=""top""(.*?)</div>"
String = "<div class=""top""><a><b><div class=""bottom""></div></b></a></div>"
The problem I have is this isnt returning anything, it should return everything withing "top", but it returns nothing.
Neither use-case can be solved well with regular expressions.
Use an HTML parser instead, e.g. the HTML Agility Pack.
Well, if your html doesn't contain nested tags you can do the first part with regex (as long as you can control your search source code, you can be much more certain of your results).
\<a href=""([^""]+)\>
the test.html will be found in the non-passive group referred to as $1.
The second part I'm concerned that you have nested tags in there and it's failing on that. The thing with regex and html is that regex can't delve well into the nested-allowable-but-not-best-practice code that can execute as expected but isn't well formed.
Can you post some search source for the second case so we can look?
I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't