Regex group is matching quotes when I don't want it to - regex

I have this regular expression:
"([^"\\]|\\.)*"|(\S+)
Debuggex Demo
But the problem is, when I have an input like "foo" and I use a matcher to go through the groups, the first group it finds is "foo" when I want it to be foo. What am I doing wrong?
EDIT:
I'm using Java and I just fixed it
"((?:[^"\\]|\\.)*)"|(\S+)
Debuggex Demo
The first capturing group wasn't including the * which is the whole string. I enclosed it within a capturing group and made the inner existing one a non capturing group.
EDIT: Actually no... it's working in the online regex debuggers but not in my program...

Capture the contents of the double quoted literal pattern (Branch 1) and if it matched grab it.
Also, consider unrolling the pattern:
 "([^"\\]*(?:\\.[^\\"]*)*)"|(\S+)
In Java:
String pat = "\"([^\"\\\\]*(?:\\\\.[^\\\\\"]*)*)\"|(\\S+)";
Note that patterns like (A|B)* often cause a stack overflow issue in Java, that's why an unrolled version is preferable.

Related

Regex for the string at between the last quotes?

I want to take DDEERR as a result in regex. My sample string is:
("NNNS" lllsds 4.5 ddsdsd "DDEERR")
I used (?<=\s*\s*").*?(?=") for all strings between "", but I couldn't take the last one only (or before the right parentheses).
Do you have any ideas? Thanks.
I would just make good use of greedy dot here:
^.*"(.*?)".*$
Demo
The idea here is that the first .* will consume everything up until the last term appearing in double quotes. Then, we capture the text inside those double quotes as the first (and only) capture group. Follow the link below to see a working demo.
Edit:
If you really need to do this without any capture groups at all, then we can try writing a pattern with lookarounds:
(?<=")[^"]+(?="[^"]*$)
Demo

Complicated regex to match anything NOT within quotes

I have this regex which scans a text for the word very: (?i)(?:^|\W)(very)[\W$] which works. My goal is to upgrade it and avoid doing a match if very is within quotes, standalone or as part of a longer block.
Now, I have this other regex which is matching anything NOT inside curly quotes: (?<![\S"])([^"]+)(?![\S"]) which also works.
My problem is that I cannot seem to combine them. For example the string:
Fred Smith very loudly said yesterday at a press conference that fresh peas will "very, very defintely not" be served at the upcoming county fair. In this bit we have 3 instances of very but I'm only interested in matching the first one and ignore the whole Smith quotation.
What you describe is kind of tricky to handle with a regular expression. It's difficult to determine whether you are inside a quote. Your second regex is not effective as it only ignores the first very that is directly to the right of the quote and still matches the second one.
Drawing inspiration from this answer, that in turn references another answer that describes how to regex match a pattern unless ... I can capture the matches you want.
The basic idea is to use alternation | and match all the things you don't want and then finally match (and capture) what you do want in the final clause. Something like this:
"[^"]*"|(very)
We match quoted strings in the first clause but we don't capture them in a group and then we match (and capture) the word very in the second clause. You can find this match in the captured group. How you reference a captured group depends on your regex environment.
See this regex101 fiddle for a test case.
This regex
(?i)(?<!(((?<DELIMITER>[ \t\r\n\v\f]+)(")(?<FILLER>((?!").)*))))\bvery\b(?!(((?<FILLER2>((?!").)*)(")(?<DELIMITER2>[ \t\r\n\v\f]+))))
could work under two conditions:
your regex engine allows unlimited lookbehind
quotes are delimited by spaces
Try it on http://regexstorm.net/tester

use case for ?: in tcl regexp

I read the documentation of ?: in tcl regexp. Which says that it matches an expression without capturing it.
I tried and it worked fine.
My query is, what is the proper use case for this option, as it we do not want to use capture sequence, we won't puts brackets there.
Is it just an alternate way, or have some special condition, where we should use this? Kindly clarify.
Easy: You need to group several elements in your Regex, but you don't need them as a capturing group for reference.
a+ (b+|c+) OR (a+ b+)|c+
I need braces for grouping. But if I run it like this the engine will capture all those matches. This may need a lot of memory and cost a lot of performance. If I don't need the capturing groups later for reference, I can use ?: to get grouping without the performance impact:
a+ (?:b+|c+) OR (?:a+ b+)|c+
First, have a look at the Tcl regex reference:
(expression)
Parentheses surrounding an expression specify a nested expression. The substring matching expression is captured and can be referred to via the back reference mechanism, and also captured into any corresponding match variable specified as an argument to the command.
(?:expression)
matches expression without capturing it.
While the first part describing capturing group ability to capture subtext to be referred to with backreferences is universal, the second part dwelling on initializing variables based on the capturing group is specific to Tcl.
Bearing that in mind, Tcl regex usage can be greatly simplified with non-capturing groups in case you have a pattern with a number of capturing groups, and you want to modify it by adding another group in-between existing groups.
Say, you want to match strings like abc 1234 (comment) and use {(\w+)\s+(\d+)\s+\(([^()]+)\)}:
regexp {(\w+)\s+(\d+)\s+\(([^()]+)\)} $a - body num comment
However, you were asked to also match strings with any number of word+space+digits in-between 1234 and comment. If you write
set a1 "abc 1234 more 5678 text 890 here 678 (comment)"
regexp {(\w+)\s+(\d+)(\s+\w+\s+\d+)*\s+\(([^()]+)\)} $a - body1 num1 comment1
^^^^^^^^^^^^^^^
the $comment will hold a value you would not expect.
Turning it into a non-capturing group fixes the issue.
See IDEONE demo
For other common uses of a non-capturing group, please refer to Are optional non-capturing groups redundant post.
You can use () parentheses in regex when matching multiple word options which you then do not want to capture.
(?:one|two|three)

Is it possible to say in Regex "if the next word does not match this expression"?

I'm trying to detect occurrences of words italicized with *asterisks* around it. However I want to ensure it's not within a link. So it should find "text" in here is some *text* but not within http://google.com/hereissome*text*intheurl.
My first instinct was to use look aheads, but it doesn't seem to work if I use a URL regex such as John Gruber's:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
And put it in a look ahead at the beginning of the pattern, followed by the rest of the pattern.
(?=URLPATTERN)\*[a-zA-Z\s]\*
So how would I do this?
You can use this alternation technique to match everything first on LHS that you want to discard. Then on RHS use captured group to match desired text.
https?:\/\/\S*|(\*\S+\*)
You can then use captured group #1 for your emphasized text.
RegEx Demo
The following regexp:
^(?!http://google.com/hereissome.*text.*intheurl).*
Matches everything but http://google.com/hereissome*text*intheurl. This is called negative lookahead. Some regexp libraries may not support it, python's does.
Here is a link to Mastering Lookahead and Lookbehind.

TR1 regex: capture groups?

I am using TR1 Regular Expressions (for VS2010) and what I'm trying to do is search for specific pattern for a group called "name", and another pattern for a group called "value". I think what I want is called a capture group, but I'm not sure if that's the right terminology. I want to assign matches to the pattern "[^:\r\n]+):\s" to a list of matches called "name", and matches of the pattern "[^\r\n]+)\r\n)+" to a list of matches called "value".
The regex pattern I have so far is
string pattern = "((?<name>[^:\r\n]+):\s(?<value>[^\r\n]+)\r\n)+";
But the regex T4R1 header keeps throwing an exception when the program runs. What's wrong with the syntax of the pattern I have? Can someone show an example pattern that would do what I'm trying to accomplish?
Also, how would it be possible to include a substring within the pattern to match, but not actually include that substring in the results? For example, I want to match all strings of the pattern
"http://[[:alpha:]]\r\n"
, but I don't want to include the substring "http://" in the returned results of matches.
The C++ TR1 and C++11 regular expression grammars don't support named capture groups. You'll have to do unnamed capture groups.
Also, make sure you don't run into escaping issues. You'll have to escape some characters twice: one for being in a C++ string, and another for being in a regex. The pattern (([^:\r\n]+):\s\s([^\r\n]+)\r\n)+ can be written as a C++ string literal like this:
"([^:\\r\\n]+:\\s\\s([^\\r\\n]+)\\r\\n)+"
// or in C++11
R"xxx(([^:\r\n]+:\s\s([^\r\n]+)\r\n)+)xxx"
Lookbehinds are not supported either. You'll have to work around this limitation by using capture groups: use the pattern (http://)([[:alpha:]]\r\n) and grab only the second capture group.