Regex to exclude multiple strings - regex

Could use some help with Regex searching with NetBeans 7.01's find function.
I'm trying to exclude multiple strings. Specifically, the target lines:
<div class="table_left">
<div class="table_right">
<div class="table_clear">
I need to match only the third and other Div classes that are not either table_left or table_right.
I've tried:
class="table_(((?!left).*)|((?!right).*))
and
class="table_(left|right){0}
I realized while pasting my first Regex line that I'm matching not right OR not left, which is returning both. What is the proper way to specify two conditions? The and operator?
The joys of searching for words that are also Boolean operators...

Try this pattern:
<div\s+class="(?!table_(left|right))[^"]+"
which wouldn't match:
<div class="table_left">
<div class="table_right">
but would match:
<div class="table_clear">
<div class="foo">
EDIT
The HT wrote:
I need to match only classes that begin with table, but are not right or left
Ah, okay, that would look like:
<div\s+class="table_(?!left|right)[^"]+"
or
<div\s+class="table(?!_left|_right)[^"]+"
as you already found yourself (but I included it in my answer for completeness sake).
A quick explanation of the pattern <div\s+class="table_(?!left|right)[^"]+":
<div # match '<div'
\s+ # match one ore more space chars
class="table_(?!left|right) # match 'class="table_' only if it is not followed by 'left' or 'right'
[^"]+ # match one or more characters other than '"'
" # match a '"'

Related

RegEx: Get content from multiple concatenated HTML-Files

I have a bunch of html-files that I concat and want to get the actual contents only.
However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.
Source-Data:
Stuff I dont need before
<div id="start">
blablabla11
blablabla12
<div id="end">
Stuff I dont need in the middle1
<div id="start">
blablabla21
blablabla22
<div id="end">
Stuff I dont need in the middle2
<div id="start">
blablabla31
blablabla32
<div id="end">
Stuff I dont need in the end
Desired result:
<div id="start">
blablabla11
blablabla12
<div id="end">
<div id="start">
blablabla21
blablabla22
<div id="end">
<div id="start">
blablabla31
blablabla32
<div id="end">
Context:
I'm working in Sublime (Mac) -> Perl Regex
My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :
Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3
And many more variants, I've been testing and playing around.
However, it yields to:
blablabla11
blablabla12
<div id="start">
blablabla21
blablabla22
<div id="start">
blablabla31
blablabla32
<div id="start">
Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.
Thanks a gazillion for your help guys!
Chris
EDIT:
Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file.
The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it
You may look for
(?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?
and replace with $1\n\n. See regex demo.
Details
(?s) - DOTALL modifier, . now matches any char
.*? - any 0+ chars, as few as possible
(<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
(?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
(?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
$ - end of string.

Regex to find a class in a list of files

I have a list of files (HTML) that contains classes.
1 <div class="homepage">...</div>
2 <div class="abc homepage">...</div>
3 <div class="abc homepage2">...</div>
4 <div class="abc homepage four five">...</div>
5 <div class="abc homepagenot five">...</div>
6 <div class="abc homepage-not five">...</div>
I'm trying to use regex to find it in Visual Studio using Regex Expressions.
I've been trying to use
class=".*homepage.*"
as the search criteria, but that is also returning me point 5,6.
Essentially I just want point 1, 2, 4.
What am I missing in regex?
You could check for the existence of a word boundary with \b which looks for a non-word character (and can be zero length). I also include a negative lookahead for the hyphen, because a hyphen is a non-word character that will match on \b but you don't want item 6 in your list.
\bhomepage\b(?!-)
Here's the Regex101 page.
Try class=".*homepage[ "]. This looks for either a space or quote right after homepage.
I assume that you want to find single word homepage within class="...".
Please see this regex.
class=\"(.+ )*homepage( .+)*\"
Assuming you want only first 4 divs as matches, this expression will work for you.
homepage(?!not|-)(.*?)
var divs ="<div class='homepage'></div>"
+"<div class='abc homepage'></div>"
+"<div class='abc homepage2'></div>"
+"<div class='abc homepage four five'></div>"
+"<div class='abc homepagenot five'></div>"
+"<div class='abc homepage-not five'></div>"
+"<div class='abc homepage2'></div>";
var matches = divs.match(/(homepage)(?!not|-)(.+?)/g);
console.log(matches, matches.length);

Regex - match every possible char and space

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo
You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com

Search and Replace with Regular Expression

I have the following HTML snippet and there's a bunch more divs on the page.
I'd like to surround all labels (Name, Current Position and Birth Place in this case) with strong tags. I can't use css in this case.
So I was thinking would a regular expression work in this case? More specifically, I'd like to use Visual Studio Search and Replace with Regular Expressions option to do this. So find all data to left of colon and replace value with <strong>value found</strong>
<div class="col-6">
Name:<br/>blah
</div>
<div class="col-6">
Current Position:<br/>blah
</div>
<div class="col-6">
Birth Place:<br/>blah
</div>
In the search tool, just find this:
([a-z ])+:
and replace with this:
<strong>$1</strong>:
Note: the VS search & replace is not case-sensitive by default
You then want to search for a beginning of the line (^) followed by white space (\s*) then some non-line break and non-colon ([^:\n]) followed by a colon and surround the second capture group with the <strong> tag.
Search:
^(\s*)([^:\n]+:)
Replace:
\1<strong>\2</strong>
See this fiddle for more details: http://regex101.com/r/xB8tD5/2

How to change all title attribute's value in Title Case in sublime text

I have 500 HTML files in my project where casing and quotes (" or ') in <title> attribute vary over all pages, see few examples below
<button title="Next" id="next"> Next</button>
<button title="next"> Next </buton>
<button title=""please go back">Check</button>
I want to change all title attributes in Title Case
<button title="Next" id="next"> Next</button>
<button title="Next"> Next </buton>
<button title="Please Go Back">Check</button>#
I have tried to find and replace - Regular Expression and Case sensitive button enabled
Find What: title=(".*")\s
Replace With: title="\u$"
But didn't get success.Please tell me what I am doing wrong?
UPDATED : also want to remove extra ' " see #
To further my comment, first it's the issue of .* being 'greedy' instead of 'lazy', meaning it is matching as much as possible (i.e. Next"> Next</button><button title="Next in your example).
The quick fix is using a 'lazy' .* instead, aka .*? (I added a ? to indicate possible presence of space because there's none in your examples):
title=(".*?")\s?
To improve performance, you would use a negated class:
title=("[^"]+")\s?
Where [^"]+ matches any character except ".
And to cope with the different quotes, you can use:
title=("[^"]+"|'[^']+')\s?
Which basically means either "[^"]+" or '[^']+' for the part within the parentheses.
For the replace and consecutive quotes issue:
title=(?:"+([^"]+)"+|'+([^']+)'+)\s?
Replace with:
title="\u$1$2"
The only thing is that the last line will be <button title="Please go back">Check</button>, if that's not an issue...
EDIT: \G actually works. Use a second replace:
(?:(?<=title=")|(?<!^)\G)[^\s"]+\s?
Replace with:
\u$0
(?<=title=('|")).+?(?=('|"))
this should give you matches Next next please go back that you can use.
you can use the index of the match to find your match in the Original string if you want to upper your lowers..
or use title=('|").+?(\1) to find any title attributes in your tekst including the quotation marks