What is the right regex for extracting string after first symbol? - regex

In google data studio I would like to make a REGEXP_EXTRACT for getting the string after the first | symbol (https://regex101.com/r/w3BqW4/2). I've tried the regex:
[|].*?$
but this is returning:
' | Leren afzaklaarzen Elisio | kalk'
So I still need to lose the first ' | '. Can anyone help me?
Example:
Input: Toral | Leren afzaklaarzen Elisio | kalk
Output: Leren afzaklaarzen Elisio | kalk

Adding a capturing group () to the initial RegEx does the trick:
REGEXP_EXTRACT(Field, "[|](.*)?$")
Adding the suggestion by cricket_007 "You cannot have "zero or one" of "zero or more" characters":
REGEXP_EXTRACT(Field, "[|](.*)$")
Google Data Studio Report to demonstrate:

If you want to capture every character after the first vertical bar, that would look like this
\\|(.*)$
You don't need question mark after a .*

Related

REGEX Replacing with exception

my first problem here is my nemesis regex.
I need a regex to replace every , with a "," from a text without replacing existing ,".
It looks like this:
Before:
abcd,efgh,ijkl,"","",mnop
After:
abcd","efgh","ijkl","","","mnop
I hope you can help me.
Solving a problem using regular expressions is nice but now you have two problems.
A simple solution that does not involve the usage of regular expressions to is do three simple string replacements: first replace , with "," then replace ","" with "," and in the end ""," with ",".
Let's see why this works:
| after 1st | after 2nd | after 3rd
original | replacement | replacement | replacement
----------+-------------+-------------+-------------
a,b | a","b | a","b | a","b
m",n | m"","n | m"","n | m","n
x,"y | x",""y | x","y | x","y
See it in action:
const input = 'abcd,efgh,ijkl,"","",mnop';
const output = input.replace(/,/g, '","').replace(/",""/g, '","').replace(/"","/g, '","');
console.log(output);
N.B. The code snippet above uses regular expressions because this is how JavaScript implements the "replace all" functionality. When the first argument of String.replace() is a string it replaces only its first occurrence.
I could use String.replaceAll() instead (it works with strings) but it is not widely supported by browsers yet.
Crudely, I think you are after something like:
(?:(?<!"),"|(?<=")",(?!")|(?<!"),(?!"))
Note: As mention by #WiktorStribiżew in the comments you could get rid of the outer non-capturing group: (?<!"),"|(?<=")",(?!")|(?<!"),(?!")
See the online Demo

Match the word "bar" if found anywhere in a field

I am trying to use a CASE statement in Google Data Studio to return a Boolean result if a given string is found within an existing field.
As Google Data Studio uses RE2 RegEx syntax, I believe the following would work, but it returns a could not parse formula error:
CASE
WHEN REGEXP_MATCH(Foo, '(\W|^)bar(\W|$)') THEN 1
ELSE 0
END
I have tried many different combinations of RegEx syntax, but can't work it out. Any help would be much appreciated as this should be a simple REGEXP_MATCH?
The Boolean result should be true if the string is found anywhere within the field:
+---------------------------+----------------+
| Foo | Boolean Result |
+---------------------------+----------------+
| blah bar / boo doo | True |
| but is / should not match | False |
| but match / here bar | True |
+---------------------------+----------------+
You need to make sure you match the whole string with the pattern that you want to use in a REGEXP_MATCH and when using regex escapes, make sure to double escape them:
CASE WHEN REGEXP_MATCH(Foo, '(.*\\W|^)bar(\\W.*|$)') THEN 1 ELSE 0 END
If there are line breaks in Foo, add (?s) at the start of the pattern.
Details
(.*\\W|^) - either any 0+ chars as many as possible followed with a non-word char or start of a string
bar - the word
(\\W.*|$) - either a non-word char followed with any 0+ chars as many as possible or end of a string
See the regex demo.
A Boolean field can be created using the single REGEXP_MATCH Calculated Field below, where \\b on either side of bar represents a Word Boundary thus matching bar but not bark, embark or embar:
REGEXP_MATCH(Foo, ".*(\\bbar\\b).*")
Google Data Studio Report and a GIF to elaborate:

Looking for single occurrence between '{' and ':' in a large text

I'm new to the Regex world, so please be kind on the tantrums :-)
I would like to print only the first occurrence of a string between { and :.
Example in the following string:
({TRIGGER.VALUE}=0 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>75)
or
({TRIGGER.VALUE}=1 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>65)
I want it to output only Zabbix windows
how is that possible?
I tried {([a-zA-Z0-9 ]*): it is printing : and doing it twice.
Thanks for reading!
Srini
You may use a PCRE regex with -o option (extracting the matches rather than returning the whole lines) to grab the text you need and use head -1 to only have the first match:
s='({TRIGGER.VALUE}=0 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>75) or ({TRIGGER.VALUE}=1 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>65)'
echo $s | grep -oP '(?<={)[\w\s]+(?=:)' | head -1
See an online demo
Pattern details:
(?<={) - there must be a { immediately to the left of the current location
[\w\s]+ - 1+ word and/or whitespace chars
(?=:) - there must be a : immediately to the right of the current location.

extract a variable value from the middle of a string

I have been trying to figure out for quite sometime. how do I get the PID value from the following string using powershell? I thought REGEX was the way to go but I can't quite figure out the syntax.
For what it is worth everything except for the PID will remain the same.
$foo = <VALUE>I am just a string and the string is the thing. PID:25973. After this do that and blah blah.</VALUE>
I have tried the following in regex
[regex]::Matches($foo, 'PID:.*') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>(.+).') | % {$_.Captures[0].Groups[1].value}
For your regex you'll want to indicate what's before and after the portion you're looking for. PID:.* will find everything from the PID to the end of the string.
And to use a capture group you'll want to have some ( and ) in your regex, which defines a group.
So try this on for size:
[regex]::Matches($foo,'PID:(\d+)') | % {$_.Captures[0].Groups[1].value}
I'm using a regex of PID:(\d+). The \d+ means "one or more digits". The parentheses around that (\d+) identifies it as a group I can access using Captures[0].Groups[1].
Here's another option. Basically it replaces everything with the first capture group (which is the digits after 'pid:':
$foo -replace '^.+PID:(\d+).+$','$1'

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").
It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.
Regexes do not count. That's what parsers are for.
You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";
You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"
Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".
Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|
Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.
#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*
My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}