Regex - Find everything between OR-operators except OR between quotes

Regex - Find everything between OR-operators except OR between quotes - regex

I need some help with a Regex. I have a query, that should be splitted between all OR-operators. But if the OR is inside of quotes, it should not splitted.
Example:
This is the query:
"test1" OR "test2.1 OR test2.2" OR test3 OR test4:"test4.1 OR test4.2"
Expression 1: I need everything between the OR-operators or start/end of line... (This is not working)
(^|OR).*?(OR|$)
Expression 2: ...except of the ORs between quotes:
"(.*?)"
The result should be:
"test1"
"test2.1 OR test2.2"
test3
test4:"test4.1 OR test4.2"
How can I make the first expression work and how can I combine these both expressions?
Thank you for help!

It's unclear what the grammar of your expression is, so I just make a bunch of assumptions and come up with this regex to match the tokens between OR:
\G(\w+(?::"[^"]*")?|"[^"]*")(?:(\s+OR\s+)|\s*$)
Demo at regex101
I assume that between OR, it can be an identifier \w+, an identifier with some string \w+:"[^"]*", or a string literal "[^"]*".
Feel free to substitute your own definition of string literal - I'm using the simplest (and broken) specification "[^"]*" as example.
In every match, the regex starts from where the last match left off (or the beginning of the string) and matches one token (as described above), followed by OR or the end of the input string.
The capturing groups at (\s+OR\s+) is deliberate - you will need this to check whether the last match actually terminates at the end of the string or not, or whether the input is malformed.
Caveat
Do note that while my solution produces the expected result for this case, without a full specification of the grammar of the expression, it's not possible to cater for all possible cases you may want to handle.

(?:^|OR(?=(?:[^"]*"[^"]*")*+[^"]*$))([\s\S]*?)(?=OR(?=(?:[^"]*"[^"]*")*+[^"]*$)|$)
You can use this and capture the groups.See demo.
https://regex101.com/r/xC4rJ3/12

Try to match everything in quotes or not-OR with:
(?:"[^"]+"|\b(?:(?!\bOR\b)[^"])+)+
DEMO

This regex works optimally (though it be subject to improvement with a more detailed specification):
(?<!\S)(?!OR\s)[^\s"]*(?:"[^"]*"[^\s"]*)*
DEMO
(?<!\S) ensures the match starts at the beginning of the string or after a whitespace character.
(?!OR\s) prevents it from matching OR
[^\s"]*(?:"[^"]*"[^\s"]*)* matches a contiguous series of, in any order:
sequences of non-whitespace, non-quote characters, or
a pair of quotes enclosing anything except quotes.
However, I notice that all the tokens in your example consist of:
a non-quote, non-whitespace sequence (NQ),
a quoted sequence (Q), or
an NQ followed immediately by a Q.
If you expect all tokens to match that pattern, you can change the regex to this:
(?<!\S)(?!OR\s)(?:[^\s"]*"[^"]*"|[^\s"]+)
According to Regex101, it's slightly more efficient (but probably not enough to matter).
DEMO

Related

Check array syntax with Regex

I'm trying to create a regex that checks if a string is a valid path for Firestore document.
I will find a regex that testing if a string:
start with a char ^([a-z]{1})
after first char, there will be only letter/digit and/or a dot \w*(.?\w+){0,}
last chars in the string could be an index of an array (\[{1}\d+\]{1})?$
First and second points work well but the last group doesn't work. I test a string like data.images[11 and the regex return true.

first of all you can shorten some quantifiers in your regex:
{1} -> can be ignored completely
{0,} -> *
Your second part could be expressed like this, this will also support readability:
[\w.]* meaning: take any character inside the brackets 0 to n-times. The bracket expression also supports predefined classes, so we are using \w here. The dot INSIDE the brackets doesn't need to be escaped, it simply means the one character dot.
So your parts would be:
^([a-z])
[\w.]*
(\[\d+\])?$
I hope this helps. According to regexpal it matches data.images[11], but not data.images[11. Also it seems to support all your demands.
EDIT:
Your second part doesn't work because (like Asocia stated in the answer) you would need to escape the dot. The dot itself is a class meaning "any character" (depending on regex engine and settings sometimes even line breaks). As you mean the dot as a character you need to escape it.

Regex to match one or two quotes but not three in a row

For the life of me I can't figure this one out.
I need to search the following text, matching only the quotes in bold:
Don't match: """This is a python docstring"""
Match: " This is a regular string "
Match: "" ← That is an empty string
How can I do this with a regular expression?
Here's what I've tried:
Doesn't work:
(?!"")"(?<!"")
Close, but doesn't match double quotes.
Doesn't work:
"(?<!""")|(?!"")"(?<!"")|(?!""")"
I naively thought that I could add the alternates that I don't want but the logic ends up reversed. This one matches everything because all quotes match at least one of the alternates.
(Please note: I'm not running the code, so solutions around using __doc__ won't help, I'm just trying to find and replace in my code editor.)

You can use /(?<!")"{1,2}(?!")/
DEMO
Autopsy:
(?<!") a negative look-behind for the literal ". The match cannot have this character in front
"{1,2} the literal " matched once or twice
(?!") a negative look-ahead for the literal ". The match cannot have this character after
Your first try might've failed because (?!") is a negative look-ahead, and (?<!") is a negative look-behind. It makes no sense to have look-aheads before your match, or look-behinds after your match.

I realized that my original problem description was actually slightly wrong. That is, I need to actually only match a single quote character, unless if it's part of a group of 3 quote characters.
The difference is that this is desirable for editing so that I can find and replace with '. If I match "one or two quotes" then I can't automatically replace with a single character.
I came up with this modification to h20000000's answer that satisfies that case:
(?<!"")(?<=(?!""").)"(?!"")
In the demo, you can see that the "" are matched individually, instead of as a group.
This works very similarly to the other answer, except:
it only matches a single "
that leaves us with matching everything we want except it still matches the middle quotes of a """:
Finally, adding the (?<=(?!""").) excludes that case specifically, by saying "look back one character, then fail the match if the next three characters are """):
I decided not to change the question because I don't want to hijack the answer, but I think this can be a useful addition.

regular expression no characters

I have this regular expression
([A-Z], )*
which should match something like
test, (with a space after the comma)
How to I change the regex expression so that if there are any characters after the space then it doesn't match.
For example if I had:
test, test
I'm looking to do something similar to
([A-Z], ~[A-Z])*
Cheers

Use the following regular expression:
^[A-Za-z]*, $
Explanation:
^ matches the start of the string.
[A-Za-z]* matches 0 or more letters (case-insensitive) -- replace * with + to require 1 or more letters.
, matches a comma followed by a space.
$ matches the end of the string, so if there's anything after the comma and space then the match will fail.
As has been mentioned, you should specify which language you're using when you ask a Regex question, since there are many different varieties that have their own idiosyncrasies.

^([A-Z]+, )?$
The difference between mine and Donut is that he will match , and fail for the empty string, mine will match the empty string and fail for ,. (and that his is more case-insensitive than mine. With mine you'll have to add case-insensitivity to the options of your regex function, but it's like your example)

I am not sure which regex engine/language you are using, but there is often something like a negative character groups [^a-z] meaning "everything other than a character".

Regular expression to split a string but consider multi-digit escape sequences

I could need some help on the following problem with regular expressions and would appreciate any help, thanks in advance.
I have to split a string by another string, let me call it separator. However, if an escape sequence preceeds separatorString, the string should not be split at this point. The escape sequence is also a string, let me call it escapeSequence.
Maybe it is better to start with some examples
separatorString = "§§";
escapeSequence = "###";
inputString = "Part1§§Part2" ==> Desired output: "Part1", "Part2"
inputString = "Part1§§Part2§§ThisIs###§§AllPart3" ==> Desired output: "Part1", "Part2", "ThisIs###§§AllPart3"
Searching stackoverflow, I found Splitting a string that has escape sequence using regular expression in Java and came up with the regular expression
"(?<!(###))§§".
This is basically saying, match if you find "§§", unless it is preceeded by "###".
This works fine with Regex.Split for the examples above, however, if inputString is "Part1###§§§§Part2" I receive "Part1###§", "§Part2" instead of "Part1###§§", "Part2".
I understand why, as the second "§" gives a match, because the proceeding chars are "##§" and not "###". I tried several hours to modify the regex, but the result got only worse. Does someone have an idea?

Let's call the things that appear between the separators, tokens. Your regex needs to stipulate what the beginning and end of a token looks like.
In the absence of any stipulation, in other words, using the regex you have now, the regex engine is happy to say that the first token is Part1###§ and the second is §Part2.
The syntax you used, (?<!foo) , is called a zero-width negative look-behind assertion. In other words, it looks behind the current match, and makes an assertion that it must match foo. Zero-width just indicates that the assertion does not advance the pointer or cursor in the subject string when the assertion is evaluated.
If you require that a new token start with something specific (say, an alphanumeric character), you can specify that with a zero-width positive lookahead assertion. It's similar to your lookbehind, but it says "the next bit has to match the following pattern", again without advancing the cursor or pointer.
To use it, put (?=[A-Z]) following the §§. The entire regex for the separator is then
(?<!###)§§(?=[A-z]).
This would assert that the character following a separator sequence needs to be an uppercase alpha, while the characters preceding the separator sequence must not be ###. In your example, it would force the match on the §§ separator to be the pair of chars before Part2. Then you would get Part1###§§ and Part2 as the tokens, or group captures.
If you want to stipulate what a token is in the negative - in other words to stipulate the a token begins with anything except a certain pattern, you can use a negative lookahead assertion. The syntax for this is (?!foo). It works just as you would expect - like your negative lookbehind, only looking forward.
The regular-expressions.info website has good explanations for all things regex, including for the lookahead and lookbehind constructs.
ps: it's "Hello All", not "Hello Together".

How about doing the opposite: Instead of splitting the string at the separators match non-separator parts and separator parts:
/(?:[^§#]|§[^§#]|#(?:[^#]|#(?:[^#]|#§§)))+|§§/
Then you just have to remove every matched separator part to get the non-separator parts.

What does /([^.])\.(.)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.

/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.

It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.

Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.

Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal

It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.

That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".

It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).

This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar

This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.

IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]

the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html

It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js