Regex to select NOT and operand - regex

I am trying to break a string to array using Regex in C# .
I have for example the string
{([Field] = '100' OR [LaneDescription] LIKE '%DENTINPALEUW%'
OR [LaneDescription] = 'asdf' OR ([ObjectID] = 1) AND [ITEM_HEIGHT] >=
10 AND [SENDER_COMPANY] NOT LIKE '%DHL%'}
(Generated from Telerik RadFilter)
and i need it broken so i can pass it to a custom object with types: open parenthesis, field, comparator , value, close parenthesis.
So far and with the help of http://regexr.com i have reached to
\[([^\[\]]*)\]+|[\w'%]+|[()=]
but i need to get the '>=' and 'NOT LIKE' as one (and similar values like <> != etc..)
You can see my late night attempts at http://regexr.com/39g6b
Any help would be much appreciated.
(PS: There are no newline characters at the string)

Try
\(|\)|\[[a-zA-Z0-9_]+\]|'.*?'|\d+|NOT LIKE|\w+|[=><!]+
Demo.
Explanation:
\( // match "(" literally
| // or
\) // ")"
| // or
\[[a-zA-Z0-9_]+\] // any words inside square braces []
|
'.*?' // strings enclosed in single quotes '' (escape sequences can easily trip this up though)
|
\d+ // digits
|
NOT LIKE // "NOT LIKE", because this is the only token that can contain whitespace
|
\w+ // words like "NOT", "AND", etc
|
[=><!]+ // operators like ">", "!=", etc

Related

Pyspark - Regex - Extract value from last brackets

I created the following regular expression with the idea of extracting the last element in brackets. See that if I only have one parenthesis it works fine, but if I have 2 parenthesis it extracts the first one (which is a mistake) or extract with the brackets .
Do you know how to solve it?
tmp= spark.createDataFrame(
[
(1, 'foo (123) oiashdj (hi)'),
(2, 'bar oiashdj (hi)'),
],
['id', 'txt']
)
tmp = tmp.withColumn("old", regexp_extract(col("txt"), "(?<=\().+?(?=\))", 0));
tmp = tmp.withColumn("new", regexp_extract(col("txt"), "\(([^)]+)\)?$", 0));
tmp.show()
+---+--------------------+---+----+
| id| txt|old| new| needed
+---+--------------------+---+----+
| 1|foo (123) oiashdj...|123|(hi)| hi
| 2| bar oiashdj (hi)| hi|(hi)| hi
+---+--------------------+---+----+
To extract the substring between parentheses with no other parentheses inside at the end of the string you may use
tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));
Details
\( - matches (
([^()]+) - captures into Group 1 any 1+ chars other than ( and )
\) - a ) char
$ - at the end of the string.
The 1 argument tells the regexp_extract to extract Group 1 value.
See the regex demo online.
NOTE: To allow trailing whitespace, add \s* right before $: r"\(([^()]+)\)\s*$"
NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use
r"(?s).*\(([^()]+)\)"
The .* will grab all the text up to the end, and then backtracking will do the job.
This should work. Use it with the single line flag.
\([^\(\)]*?\)(?!.*\([^\(\)]*?\))
https://regex101.com/r/Qrnlf3/1

Match the word "bar" if found anywhere in a field

I am trying to use a CASE statement in Google Data Studio to return a Boolean result if a given string is found within an existing field.
As Google Data Studio uses RE2 RegEx syntax, I believe the following would work, but it returns a could not parse formula error:
CASE
WHEN REGEXP_MATCH(Foo, '(\W|^)bar(\W|$)') THEN 1
ELSE 0
END
I have tried many different combinations of RegEx syntax, but can't work it out. Any help would be much appreciated as this should be a simple REGEXP_MATCH?
The Boolean result should be true if the string is found anywhere within the field:
+---------------------------+----------------+
| Foo | Boolean Result |
+---------------------------+----------------+
| blah bar / boo doo | True |
| but is / should not match | False |
| but match / here bar | True |
+---------------------------+----------------+
You need to make sure you match the whole string with the pattern that you want to use in a REGEXP_MATCH and when using regex escapes, make sure to double escape them:
CASE WHEN REGEXP_MATCH(Foo, '(.*\\W|^)bar(\\W.*|$)') THEN 1 ELSE 0 END
If there are line breaks in Foo, add (?s) at the start of the pattern.
Details
(.*\\W|^) - either any 0+ chars as many as possible followed with a non-word char or start of a string
bar - the word
(\\W.*|$) - either a non-word char followed with any 0+ chars as many as possible or end of a string
See the regex demo.
A Boolean field can be created using the single REGEXP_MATCH Calculated Field below, where \\b on either side of bar represents a Word Boundary thus matching bar but not bark, embark or embar:
REGEXP_MATCH(Foo, ".*(\\bbar\\b).*")
Google Data Studio Report and a GIF to elaborate:

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?
The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.
The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

Regular expression to find unescaped double quotes in CSV file

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?
Not a match:
"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"
Match:
"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
Try this:
(?m)""(?![ \t]*(,|$))
Explanation:
(?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
"" // match two successive double quotes
(?! // start negative look ahead
[ \t]* // zero or more spaces or tabs
( // open group 1
, // match a comma
| // OR
$ // the end of the line or string
) // close group 1
) // stop negative look ahead
So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".
(i) besides being the normal start-of-the-string and end-of-the-string meta characters.
Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.
My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.
NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.
Here is my code:
orgTexts = [
'"asdf","asdf"',
'"", "asdf"',
'"asdf", ""',
'"adsf", "", "asdf"',
'"asdf""asdf", "asdf"',
'"asdf", """asdf"""',
'"asdf", """"'
]
orgTexts.each{|orgText|
# Preprocessing - Eliminate spaces before and after comma
# Here is needed if you may have spaces before and after a valid comma
orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
# Detect valid character (non-quote and valid quote)
resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
# resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
# [^\"] ===> A non qoute
# | ===> or
# ^\" ===> beginning quot
# | ===> or
# \"$ ===> endding quot
# | ===> or
# (?<=,)\" ===> quot just after comma
# \"(?=,) ===> quot just before comma
# (?<=\\\\)\" ===> escaped quot
# This part is to show the invalid non-escaped quots
print orgText
print resText.gsub(Regexp.new('"'), '^')
# This part is to determine if there is non-escaped quotes
# Here is the actual matching, use this one if you don't want to know which quote is un-escaped
isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
# Basicall, it match it from start to end (^...$) there is only a valid character
print orgText + ": " + isMatch
print
print ""
print ""
}
When executed the code prints:
"asdf","asdf"
-------------
"asdf","asdf": false
"","asdf"
---------
"","asdf": false
"asdf",""
---------
"asdf","": false
"adsf","","asdf"
----------------
"adsf","","asdf": false
"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true
"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true
"asdf",""""
--------^^-
"asdf","""": true
I hope I give you some idea here that you can use with other engine and language.
".*"(\n|(".*",)*)
should work, I guess...
For single-line matches:
^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
or for multi-line:
(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.
Try this regular expression:
"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
That will match any quoted string with at least one pair of unescaped double quotes.

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").
It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.
Regexes do not count. That's what parsers are for.
You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";
You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"
Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".
Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|
Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.
#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*
My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}