How do we create such a regular expression to extract data? - regex

<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>
I'd like to create a regexp that safely matches these:
<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>
This is possible that there are other tags (e.g. <i>,<strike>...etc ) between each pair of <br> and they have to be collected just like the <br><b>Peter</b><br>
How should the regexp look like?

If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser

<br>.*?<br>
will match anything from one <br> tag to the closest following one.
The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.

Split the string at (<br>)+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.
If you want to preserve the <br>, then this is not possible unless you know that there is one before and after each element in the result.

Related

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Exclude a certain String from variable in regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?
It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...

Last Matched String Issue

I'm using the following regular expression to pull out some html:
(?i)(?:\<tr\s*class='list'[^\>]*\>)[^$+]*\</tr\>
Problem is its not segregating the TRs correctly. I'm trying to use $+ to reference the tag selector again to ensure that the contents of the match don't have the start tag again. Here is the sample html:
http://www.pastie.org/1311827
There are multiple <tr>s in some matches. Please help.
I don't know what you think [^$+]* means, but it defines a negated character class that matches zero or more times. In other words, it matches an empty string, or one or more characters that aren't a literal dollar sign or plus.
HTML cannot be trivially parsed by regex (unless it is known ahead of time what the structure will look like) because in order to properly parse a document you need to be able to recurse, as elements within the document can be nested within themselves (for instance a <div> can contain another <div>). While some languages (you didn't specify what you're using) support recursive regular expressions (perl and PHP for instance), it would likely be more efficient to use a proper DOM parser than recursive regex (the complexity of which non-withstanding) anyways!
Use document.getElementsByTagName in your favorite DOM library and iterate through the nodeList with a loop, then parse the getAttribute('class').
I suggest not using regex because it's only a matter of time before the regex breaks, unless you're dealing with very trivial markup, in addition DOM is just made for that purpose.

Which regular expression to use to extract some words from an HTML text?

I am having a hard time building a regular expression to grab some words from a HTML text.
Let's say I have the following :
<p style="padding-left :12px">SOME_TEXT_I_WANT</p><p>SOME_OTHER_TEXT</p>
*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "<strong>SOME BOLD TEXT</strong>"
My goal is to extract those texts with one regex.
Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.
However, if this is a "one-off", you may be able to get through with something along the lines of:
#<p[^>]*>(.*?)</p>#
The above has certain limitations, most notably it does not match <p data-something="a > b">...</p> nor nested <p>s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested <p>s—just informing you on possible pitfalls.)
Assuming you are using PHP:
$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);
Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.
Use your language's HTML or XML parser and extract what you need using existing functionality.

How can I match a number inside a given HTML tag?

I would like to match the numbers inside an HTML tag such as:
Sometext<sometag><htmltag>123123</htmltag></sometag>
I would like to create a regex that finds the number that is inside the HTML tag of my choice, for example the 123123 inside <htmltag>.
No, you don't need to "match", you need to extract an HTML node. Use an HTML parser. An HTML parser is simpler to use, more robust against changes, and easier to extend (e.g. grabbing more parts of the same document). A regular expression, on the other hand, is just the wrong tool, because HTML is not a regular language.
If all there is between those two tags is the number, and absolutely no white space or anything, you can simply use this regex:
/<htmltag>([0-9]+)<\/htmltag>/
Or this if there might be whitespace:
/<htmltag>\s*([0-9]+)\s*<\/htmltag>/