RegEx - Match only when a certain word occurs - regex

Its probably super simple.
I want to match only where a certain word exists in between full <headers>
This is what i have so far.
(<h[d{1-6}](.itemprop.)(.*?)</h[d{1-6}]>)
I want it to match
<h1 class="test" itemprop="name">Test</h1>
AND
<h2 itemprop="name" class="test">Test</h2>
AND
<h6 class="test"><strong itemprop="Price">9,99</strong>Test</h6>
As it is now it only matches <h{1-6} itemprop etc

How about:
<h([1-6]).*?\bitemprop\b.*?</h\1>)

Related

RegEx: Get content from multiple concatenated HTML-Files

I have a bunch of html-files that I concat and want to get the actual contents only.
However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.
Source-Data:
Stuff I dont need before
<div id="start">
blablabla11
blablabla12
<div id="end">
Stuff I dont need in the middle1
<div id="start">
blablabla21
blablabla22
<div id="end">
Stuff I dont need in the middle2
<div id="start">
blablabla31
blablabla32
<div id="end">
Stuff I dont need in the end
Desired result:
<div id="start">
blablabla11
blablabla12
<div id="end">
<div id="start">
blablabla21
blablabla22
<div id="end">
<div id="start">
blablabla31
blablabla32
<div id="end">
Context:
I'm working in Sublime (Mac) -> Perl Regex
My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :
Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3
And many more variants, I've been testing and playing around.
However, it yields to:
blablabla11
blablabla12
<div id="start">
blablabla21
blablabla22
<div id="start">
blablabla31
blablabla32
<div id="start">
Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.
Thanks a gazillion for your help guys!
Chris
EDIT:
Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file.
The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it
You may look for
(?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?
and replace with $1\n\n. See regex demo.
Details
(?s) - DOTALL modifier, . now matches any char
.*? - any 0+ chars, as few as possible
(<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
(?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
(?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
$ - end of string.

How do I conditionally add a space in a regex replace

When I woke up this morning, I didn’t know a stroke of regex. By the time I went to Mass, I’d been able to cobble together this regex to find occurrences of ‘Mph’ in an html document.
(?i)(?<=[\s|\d])mph+
If I run it against the following test data:
<div class="vsMph">
<p>95 Mph</p>
</div>
<div class="vsMph">
<p>95Mph</p>
</div>
It correctly matches:
‘ Mph’ and
‘Mph’
And equally correctly leaves the ‘vsMph’ alone, which is exactly what I want. Eventually, I'm going to use the same technique to match knots, ft, in, km and so on.
I’m executing this expression in in Sublime Text 3 using RegReplace and ultimately, what I hope to do is to use this regular expression to find all occurrences of ‘Mph’ preceded by a space or a digit and:
Enclose ‘Mph’ in <abbr> tags.
Add a space between the digit and the
opening <abbr> tag if there was no space between the last digit and
'Mph' originally.
In other words, I want to convert the above test data to:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
I can get RegReplace to add the <abbr> tags as described in 1. above, but I’ve searched around on Google and I can’t find anything that tells me how to conditionally insert a space in a regex replace.
So I’m wondering. Is it possible in the first place to conditionally add a space in a regex replacement and if so how do I do it, or do I have to search for ‘\sMph’ and ‘\dMph’ and replace them separately?
Regards.
I would suggest using groups to match Mph. You could search for simply the following regex:
(\d)(\s)?(Mph)
Then replace using groups
$1 <abbr title="Miles per hour">$3</abbr>
output:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>

Regex find and replace between <div class="customclass"> and </div> tag

I cant find anywhere a working regex expression to find and replace the text between the div tags
So there is this html where i want to select everything between the <div class="info"> and </div> tag and replace it with some other texts
<div class="extraUserInfo">
<p>Hello World! This is a sample text</p>
<javascript>.......blah blah blah etc etc
</div>
and replace it with
My custom text with some codes
<tags> asdasd asdasdasdasdasd</tags>
so it would look like
<div class="extraUserInfo">
My custom text with some codes
<tags> asdasd asdasdasdasdasd</tags>
</div>
here is a refiddle that all my code is there and as you can see I want to replace the whole bunch of codes between the and tag
http://refiddle.com/1h6j
Hope you get what I mean :)
If there's no nesting, would just do a plain match non-greedy (lazy)
(?s)<div class="extraUserInfo">.*?</div>
.*? matches any amount of any character (as few as possible) to meet </div>
Used s modifier for making the dot match newlines too.
Edit: Here a Javascript-version without s modifier
/<div class="extraUserInfo">[\s\S]*?<\/div>/g
And replace with new content:
<div class="extraUserInfo">My custom...</div>
See example at regex101; Regex FAQ

Regex find CamelCase strings and dashes, replace with lowercase_underscored

In PHPStorm, I need to find/replace some mixed case strings which are used for CSS class names and for the DOM id's. I can't change attributes like onClick and image names need to remain. Here is what I have:
<div class="ThumbContainer" id="Source-Data4-Thumb">
<div class="ThumbTitleArea">
<div class="DataTitleDiv"> GYR Performance <img src="images/someImage.png" onClick="someFunc()" /></div>
</div>
<div class="dataDetailArea">
<div class="DataThumbArea"> Data Source:Client<br>
Last refreshed:12/05/2013 <br>
Records:206<br>
<br>
Used for the following reports<br>
- GYR Performance<br>
</div>
</div>
</div>
Here is what I need:
<div class="thumb_container" id="source_data4_thumb">
<div class="thumb_title_area">
<div class="data_title_div"> GYR Performance <img src="images/someImage.png" onClick="someFunc()" /></div>
</div>
<div class="data_detail_area">
<div class="data_thumb_area"> Data Source:Client<br>
Last refreshed:12/05/2013 <br>
Records:206<br>
<br>
Used for the following reports<br>
- GYR Performance<br>
</div>
</div>
</div>
Notice the dataDetailArea starts with a lowercase.. bleh. This will be a one-time find/replace so it doesn't need to be in PHPStorm. It can be in any online tool even, like http://gskinner.com/RegExr/
The actual backbone template I need to find/replace on is about 3100 lines of code, otherwise I'd provide it all here for you.
Here's what I have so far. It seems to not match match the Camel-Case3-Foo:
(class|id|data-[?!=])="\b([A-Za-z][a-z-]*){2,}\b"
This regex should find the locations where underscores should be placed:
((?<=\w)(?=[A-Z])|-)
It would seem to make sense to do a replacement with this to insert the underscores, then convert the string to lower case.
I would search for something like this:
"[a-z0-9_]*\([A-Z]\)
A quote mark with anything following that has lowercase, numeric, or underscore characters.
Anything following that has an uppercase letter.
Make the uppercase letter sub-expression 1.
Replace subexpression 1 with an underscore + the result from a tolower() function.
You will need to apply this multiple times to each line since it will only find one
uppercase letter per pass.

Reg exp: string NOT in pattern

I have problems constructing a reg exp. I think I should use lookahead/behind but I just don't make it.
I want to make a reg-exp that catches all HTML tags that do NOT contain a string ('rabbit').
For example, the following tags should be matched
<a XXX> <span yyy> </div x zz> </li qwerty=ab cd> <div hello=stackoverflow>
But not the following
<a XXrabbitX> <span yyyrabbit> </div xrabbitzz> </li rabbit=abcd hippo=9876> <div hello=rabbit>
(My next step is to make make a substitution so that the word rabbit enters the tags, but that will hopefully come easy.)
(I use PHP5-preg_replace.)
Thanks.
I guess you're matching the HTML tags with a regex something like this:
/<[^>]*>/
You can add a negative look-ahead assertion in there to assert that "rabbit" cannot be found in the tag:
/<(?![^>]*rabbit)[^>]*>/