Complicated regex to match anything NOT within quotes - regex

I have this regex which scans a text for the word very: (?i)(?:^|\W)(very)[\W$] which works. My goal is to upgrade it and avoid doing a match if very is within quotes, standalone or as part of a longer block.
Now, I have this other regex which is matching anything NOT inside curly quotes: (?<![\S"])([^"]+)(?![\S"]) which also works.
My problem is that I cannot seem to combine them. For example the string:
Fred Smith very loudly said yesterday at a press conference that fresh peas will "very, very defintely not" be served at the upcoming county fair. In this bit we have 3 instances of very but I'm only interested in matching the first one and ignore the whole Smith quotation.

What you describe is kind of tricky to handle with a regular expression. It's difficult to determine whether you are inside a quote. Your second regex is not effective as it only ignores the first very that is directly to the right of the quote and still matches the second one.
Drawing inspiration from this answer, that in turn references another answer that describes how to regex match a pattern unless ... I can capture the matches you want.
The basic idea is to use alternation | and match all the things you don't want and then finally match (and capture) what you do want in the final clause. Something like this:
"[^"]*"|(very)
We match quoted strings in the first clause but we don't capture them in a group and then we match (and capture) the word very in the second clause. You can find this match in the captured group. How you reference a captured group depends on your regex environment.
See this regex101 fiddle for a test case.

This regex
(?i)(?<!(((?<DELIMITER>[ \t\r\n\v\f]+)(")(?<FILLER>((?!").)*))))\bvery\b(?!(((?<FILLER2>((?!").)*)(")(?<DELIMITER2>[ \t\r\n\v\f]+))))
could work under two conditions:
your regex engine allows unlimited lookbehind
quotes are delimited by spaces
Try it on http://regexstorm.net/tester

Related

Regular expression to exclude tag groups or match only (.*) in between tags

I am struggling with this regex for a while now.
I need to match the text which is in between the <ns3:OutputData> data</ns3:OutputData>.
Note: after nscould be 1 or 2 digits
Note: the data is in one line just as in the example
Note: the ... preceding and ending is just to mention there are more tags nested
My regex so far: (ns\d\d?:OutputData>)\b(.*)(\/\1)
Sample text:
...<ns3:OutputData>foo bar</ns3:OutputData>...
I have tried (?:(ns\d\d?:OutputData>)\b)(.*)(?:(\/\1)) in an attempt to exclude group 1 and 3.
I wan't to exclude the tags which are matched, as in the images:
start
end
Any help is much appreciated.
EDIT
There might be some regex interpretation issue with Grep Console for IntelliJ which I intend to use the regex.
Here is is the latest image with the best match so far...
Your regex is almost there. All you need to do is to make the inside-matcher non-greedy. I.e. instead of (.*) you can write (.*?).
Another, xml-specific alternative is the negated character-class: ([^<]*).
So, this is the regex: (ns\d\d?:OutputData>)\b(.*?)(\/\1) You can experiment with it here.
Update
To make sure that the only group is the one that matches the text, then you have to make it work without backreferences: (?:ns\d\d?:OutputData>)\b(.*?)<
Update 2
It's possible to match only the required parts, using lookbehind. Check the regex here.:
(?<=ns\d:OutputData>)\b([^<]*)|(?<=ns\d\d:OutputData>)\b([^<]*)
Explanation:
The two alternatives are almost identical. The only difference is the number of digits. This is important because some flavors support only fixed-length lookbehinds.
Checking alternative one, we put the starting tag into one lookbehind (?<=...) so it won't be included into the full match.
Then we match every non- lt symbol greedily: [^<]*. This will stop atching at the first closing tag.
Essentially, you need a look behind and a look ahead with a back reference to match just the content, but variable length look behinds are not allowed. Fortunately, you have only 2 variations, so an alternation deals with that:
(?<=<(ns\d:OutputData>)).*?(?=<\/\1)|(?<=<(ns\d\d:OutputData>)).*?(?=<\/\2)
The entire match is the target content between the tags, which may contain anything (including left angle brackets etc).
Note also the reluctant quantifier .*?, so the match stops at the next matching end tag, rather than greedy .* that would match all the way to the last matching end tag.
See live demo.
This was the answer in my case:
(?<=(ns\d:OutputData)>)(.*?)(?=<\/\1)
The answer is based on #WiktorStribiżew 3 given solutions (in comments).
The last one worked and I have made a slight modification of it.
Thanks all for the effort and especially #WiktorStribiżew!
EDIT
Ok, yes #Bohemian it does not match 2-digits, I forgot to update:
(?<=(ns\d{0,2}:OutputData)>)(.*?)(?=<\/\1)

Is it possible to say in Regex "if the next word does not match this expression"?

I'm trying to detect occurrences of words italicized with *asterisks* around it. However I want to ensure it's not within a link. So it should find "text" in here is some *text* but not within http://google.com/hereissome*text*intheurl.
My first instinct was to use look aheads, but it doesn't seem to work if I use a URL regex such as John Gruber's:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
And put it in a look ahead at the beginning of the pattern, followed by the rest of the pattern.
(?=URLPATTERN)\*[a-zA-Z\s]\*
So how would I do this?
You can use this alternation technique to match everything first on LHS that you want to discard. Then on RHS use captured group to match desired text.
https?:\/\/\S*|(\*\S+\*)
You can then use captured group #1 for your emphasized text.
RegEx Demo
The following regexp:
^(?!http://google.com/hereissome.*text.*intheurl).*
Matches everything but http://google.com/hereissome*text*intheurl. This is called negative lookahead. Some regexp libraries may not support it, python's does.
Here is a link to Mastering Lookahead and Lookbehind.

Regex for capitalizing first letter in a tag, alt=", etc

I've found regular expressions that capitalize the first letter in a sentence. But does anyone know a regex that capitalizes the first letter inside a tag, including URL and image attributes (e.g. title="antelope" or alt="antelope").
I used another regex to change all my image paths to lower case, and it zapped a bunch of my tags as well (alt, title, h2, etc.). So now I'd like to get a head start fixing them by capitalizing the first letters.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors.
Before...
alt="antelope" title="antelope" <h2>antelope
After...
alt="Antelope" title="Antelope" <h2>Antelope
Regex
(="\w|>\w)
Replace Regex
\U$1\E
Description: This will work for your example, depending on the regex engine you are using.
Debuggex Demo
This replaces the value in parameters in a url. NOT in html, as I now see that is what you mean. Oh well.
Find what: (\?|\&)([a-z_]+=)([a-z])([^&]+)
Replace (all) with: $1$2\u$3$4
Free spaced:
(\?|\&)
Capture group 1: Either the literal question mark or ampersand.
([a-z_]+=)
Capture group 2: One or more of any lowercase letter or underscore, followed by the equals sign.
([a-z])
Capture group 3: The first letter in the value of the url parameter. Note this does not even notice parameters whose values don't start with a letter.
([^&]+)
Capture group four: Every other character in the value. Or more specifically, one or more of any character as long as it's not an ampersand. This is a negative character class.
The \u in the replace-with is an option in TextWrangler (and in TextPad, which is what I use...so TextWrangler might also use the Boost regex engine) replacement that uppercases the immediately-following character. I'm not sure if this would work if capture groups 3 and 4 were merged.
Try it (although it doesn't have the \u option.)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a lot of helpful information in it, including a list of online regex testers (in the bottom section), so you can try things out yourself. All the links in this answer come from the FAQ.

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

regex to find instance of a word or phrase -- except if that word or phrase is in braces

First, a disclaimer. I know a little about regex's but I'm no expert. They seem to be something that I really need twice a year so they just don't stay "on top" of my brain.
The situation: I'd like to write a regex to match a certain word, let's call it "Ostrich". Easy. Except Ostrich can sometimes appear inside of a curly brace. If it's inside of a curly brace it's not a match. The trick here is that there can be spaces inside the curly braces. Also the text is typically inside of a paragraph.
This should match:
I have an Ostrich.
This should not match:
My Emu went to the {Ostrich Race Name}.
This should be a match:
My Ostrich went to the {Ostrich Race Name}.
This should not be a match:
My Emu went to the {Race Ostrich Place}. My Emu went to the {Race Place Ostrich}.
It seems like this is possible with a regex, but I sure don't see it.
I'll offer an alternative solution to doing this, which is a bit more robust (not using regex assertions).
First, remove all the bracketed items, using a regex like {[^}]+} (use replace to change it to an empty string).
Now you can just search for Ostrich (using regex or simple string matching, depending on your needs).
While regular expressions can certainly be written to do what you ask, they're probably not the best tool for this particular type of thing.
One major problem with regular expressions is that they're very good at pattern matching for things that are there, but not so much when you start adding except into the mix.
Regular expressions are not stateful enough to handle this properly without a lot of work, so I would try to find a different path towards a solution.
A character tokenizer that handles the braces would be easy enough to write.
I believe this will work, using lookahead and lookbehind assertions:
(?<!{[^}]*)Ostrich(?![^{]*})
I also tested the case My {Ostrich} went to the Ostrich Race. (where the second "Ostrich" does match)
Note that the lookahead assertion: (?![^{]*}) is optional.. but without it:
My {Ostrich has a missing bracket won't match
My Ostrich also} has a missing bracket will match
which may or may not be desirable.
This works in the .NET regex engine, however, it is not PCRE-compatible because it uses non-fixed-length assertions which are not supported.
Here's a very large regex that almost works.
It will return each "raw" occurrence of the word in a group.
However, the group for the last one will be empty; I'm not sure why. (Tested with .Net)
Parse without whitespace
^(?:
(?:
[^{]
|
(?:\{.*?\})
)*?
(?:\W(Ostrich)\W)?
)*$
Using a positive lookahead with a negation appears to properly match all the test cases as well as multiple Ostriches:
(?<!{[^}]*)Ostrich(?=[^}]*)