RegEx: Get content from multiple concatenated HTML-Files - regex

I have a bunch of html-files that I concat and want to get the actual contents only.
However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.
Source-Data:
Stuff I dont need before
<div id="start">
blablabla11
blablabla12
<div id="end">
Stuff I dont need in the middle1
<div id="start">
blablabla21
blablabla22
<div id="end">
Stuff I dont need in the middle2
<div id="start">
blablabla31
blablabla32
<div id="end">
Stuff I dont need in the end
Desired result:
<div id="start">
blablabla11
blablabla12
<div id="end">
<div id="start">
blablabla21
blablabla22
<div id="end">
<div id="start">
blablabla31
blablabla32
<div id="end">
Context:
I'm working in Sublime (Mac) -> Perl Regex
My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :
Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3
And many more variants, I've been testing and playing around.
However, it yields to:
blablabla11
blablabla12
<div id="start">
blablabla21
blablabla22
<div id="start">
blablabla31
blablabla32
<div id="start">
Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.
Thanks a gazillion for your help guys!
Chris
EDIT:
Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file.
The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it

You may look for
(?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?
and replace with $1\n\n. See regex demo.
Details
(?s) - DOTALL modifier, . now matches any char
.*? - any 0+ chars, as few as possible
(<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
(?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
(?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
$ - end of string.

Related

How do I conditionally add a space in a regex replace

When I woke up this morning, I didn’t know a stroke of regex. By the time I went to Mass, I’d been able to cobble together this regex to find occurrences of ‘Mph’ in an html document.
(?i)(?<=[\s|\d])mph+
If I run it against the following test data:
<div class="vsMph">
<p>95 Mph</p>
</div>
<div class="vsMph">
<p>95Mph</p>
</div>
It correctly matches:
‘ Mph’ and
‘Mph’
And equally correctly leaves the ‘vsMph’ alone, which is exactly what I want. Eventually, I'm going to use the same technique to match knots, ft, in, km and so on.
I’m executing this expression in in Sublime Text 3 using RegReplace and ultimately, what I hope to do is to use this regular expression to find all occurrences of ‘Mph’ preceded by a space or a digit and:
Enclose ‘Mph’ in <abbr> tags.
Add a space between the digit and the
opening <abbr> tag if there was no space between the last digit and
'Mph' originally.
In other words, I want to convert the above test data to:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
I can get RegReplace to add the <abbr> tags as described in 1. above, but I’ve searched around on Google and I can’t find anything that tells me how to conditionally insert a space in a regex replace.
So I’m wondering. Is it possible in the first place to conditionally add a space in a regex replacement and if so how do I do it, or do I have to search for ‘\sMph’ and ‘\dMph’ and replace them separately?
Regards.
I would suggest using groups to match Mph. You could search for simply the following regex:
(\d)(\s)?(Mph)
Then replace using groups
$1 <abbr title="Miles per hour">$3</abbr>
output:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>

Regex - match every possible char and space

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo
You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com

Matching from the last occurence of a character in a string with Regex

Yes I know, don't parse html with regex. That said:
I am trying to capture content between any tag with the word "Title" in the first tag.
I started with:
(?P<QUALIFY_TITLE><(.*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
Where the Named Group Capture is a known word/string I am looking for. I also capture for research sake the QUALIFY_TITLE Name group. I do this because I don't want the string/term unless I 'qualify' it in this way.
However, if I have part of an html that looks like this:
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title">KnownTermIWant</title>
Although I get the CAPTURE String I want (KnownTermIWant), the Qualify string starts from the very first "
I am trying to have the QUALIFY_TITLE start/capture from the last "<" before the title not the first in other words QUALIFY TITLE should be:
<div id="divTitle
or even
<div id="divTitle" class="title">
but I am currently getting
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title"
The problem is that a regex-search will try to match at the first possible opportunity, and non-greedy quantifiers (*? instead of *) do not affect whether something is a match. For example, given the string abcd, the regex .*?d will match the whole thing, because .*? will still match as much as it needs to in order to ensure that the regex matches.
Do you see what I mean?
So you need to make your subexpressions more precise; for example, instead of <(.*?)(title)(.*?)>, you should write <([^>]*)(title)([^>]*)>.
The problem
There's only one problem here, you are matching exactly what you've asked for :)
The process
If you want to match only the last tag, ask yourself this question:
"What is inside every preceding tag, but not inside the one I want?"
The conclusion
The answer is the open/close tags themselves:
(?P<QUALIFY_TITLE><([^<>]*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
^^^^^
Your code was quite a big mess, but I'm going to answer the question in the title, in a much more simplified way:
In this sample code:
<div>Example text<div>Foo bar</div> Hello world <div>Lorem ipsum</div></div> hi
if you want to match from the first <div> to the last </div>, you could just use a greedy quantifier, such as + or *:
/<div>(.*)<\/div>/
That will match the whole string, until the very last </div>.
Demo
If this doesn't answer your question, the complexity of the regular expression would quickly get higher very fast (it's bascially exponentially more complex for extra requirements), so like you said in your very first line, just use a parser.

Match word anywhere within two markers, delimited by whitespace or the markers

Using Gnu sed, can I replace a word only if it appears between two markers (but anywhere between them), only if that word is delimited on the left by the starting marker or whitespace and delimited on the right by the ending marker or whitespace? Very similar to using \b on either side of the word (between the markers), but only allowing whitespace (or nothing, if adjacent to the starting/ending marker) as a delimiter. \b marks the boundary between "word" and "non-word" characters, and treats - as a non-word character, which is undesired in this case. Work and results so far, and test cases, below.
[Detail: Specifically, I'm trying to replace classes within the class="..." text in HTML files with other classes. This may be yet another example of "don't use regex to work with HTML," but the problem is so contained (I don't care if it happens to match outside a start tag, for instance; I don't care about nesting), it feels like it should be possible and, if possible, would be preferred to my next option, Jsoup (however cool and enticing that is). And it feels like a regex and/or sed learning opportunity.]
The starting marker is:
\(\sclass\s*=\s*"\)
(yes, I need to capture it).
The ending marker is:
"
...where no " are allowed in-between (whether escaped in some way or not). So nice and contained, not requiring proper parsing. (I'll use a second command to handle the single quotes version.)
I want to match things like this (for example, there are several of them):
span\([0-9]\+\)
Here's what I have so far, changing spanN to col-md-N (but using \b, and thus not quite working correctly):
s/\(\sclass\s*=\s*"\)\([^"]*\)\bspan\([0-9]\+\)\b\([^"]*\)"/\1\2col-md-\3\4"/g
And it works nicely for this sample data:
<div class="blah span3 arg">This has span3 in it</div>
<div class="span3">This has span3 in it</div>
<div class="span3 arg">This has span3 in it</div>
Giving me the desired:
<div class="blah col-md-3 arg">This has span3 in it</div>
<div class="col-md-3">This has span3 in it</div>
<div class="col-md-3 arg">This has span3 in it</div>
But of course it would also change the following:
<div class="blah x-span3 arg">This has x-span3 in it</div>
<div class="x-span3">This has x-span3 in it</div>
<div class="x-span3 arg">This has x-span3 in it</div>
<div class="blah span3-x arg">This has span3-x in it</div>
<div class="span3-x">This has span3-x in it</div>
<div class="span3-x arg">This has span3-x in it</div>
...which is not desired. And it goes without saying that xxxspan3 should also be left alone (which the \b version does, of course).
Is it possible to make it not change those? Without repeating the expression three times for the "at the beginning", "in the middle", and "at the end" cases? (Six times, if you count the single quotes permutations. Dozens of times if you count everything else I need to change.)
If the answer really is "no, you can't," well, that's a perfectly acceptable answer and I'll get a bigger hammer.
Epilogue: FYI, this was indeed Yet Another Case of "don't try to process HTML with regular expressions." While Jerry's answer did indeed do what I needed, the further I got into it the clearer it became that I needed more context than regex could give me. I ended up using NodeJS with the cheerio DOM parser, because cheerio is very good at being minimal with its changes to the markup.
You can try this regex:
s/\(\sclass\s*=\s*"\)\(\([^"]*\)\( \)\)\?span\([0-9]\+\)\(\( \)\([^"]*\)\)\?"/\1\3\4col-md-\5\7\8"/g
[Sorry it's a big long]
I started out with (with highlighted changes):
s/\(\sclass\s*=\s*"\?\)\([^"]*\)\([" ]\)span\([0-9]\+\)\([" ]\)\([^"]*\)/\1\2\3col-md-\4\5\6/g
^^ ^^^^^^^^ ^^^^^^^^ ^
Where I attempted to capture the " or the preceding space before span and any of the two which followed the digit from span. That also called for adding more backreferences in the replace and the removal of the last quote for which the regex had to be adjusted for, but since class=span isn't qualified to pass, I realised I couldn't just make the first quote optional or remove the last quote.
I thus removed the quotes from the capture groups:
s/\(\sclass\s*=\s*"\)\([^"]*\)\( \)span\([0-9]\+\)\(" \)\([^"]*\)"/\1\2\3col-md-\4\5\6"/g
^^^^^ ^^^^^
Now, there was only the quotes to deal with. Since we can have only "span ... or span\d+", it meant that everything in between could be made optional:
s/\(\sclass\s*=\s*"\)\(\(\([^"]*\)\( \)\)\?span\([0-9]\+\)\(\(" \)\([^"]*\)\)\?"/\1\2\3col-md-\4\5\6"/g
^^ ^^^^ ^^ ^^^^
Only thing that was left was to adjust the backreferences for the different capture groups:
s/\(\sclass\s*=\s*"\)\(\([^"]*\)\( \)\)\?span\([0-9]\+\)\(\( \)\([^"]*\)\)\?"/\1\3\4col-md-\5\7\8"/g
^^^^ ^^^^

A regular expression question

I have content something like
<div class="c2">
<div class="c3">
<p>...</p>
</div>
</div>
What I want is to match the div.c2's inner HTML. The contents of it may vary a lot. The only problem I am facing here is that how can I make it to work so that the right closing div is taken?
You can't. This problem is unsolvable with classic regular expressions, and with most of the existing regex implementations.
However, some regex engines have special support for balanced pair matching. See, e.g., here (.NET). Though even in this case your regex will be able to parse only a subset of syntactically correct texts (e.g., what if a < /div > is embedded in a comment?). You need an HTML parser to get reliable results.
Any chance this will always be valid XHTML? If so, you'd be better off parsing it as XML than trying to regex this.
Delete the first line, delete the last line. Problem solved. No need for RegEx.
The following pattern works well with .Net RegEx implementation:
\<div class="c2"\>{[\n a-z.<>="0-9/]+}\</div\>
And we replace that with \1.
Input:
<div class="c2">
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>
</div>
Output:
<div class="c3">
<p>...</p>
</div></div></div></div></div></div></div></div>