REGEX Replacing with exception

REGEX Replacing with exception - regex

my first problem here is my nemesis regex.
I need a regex to replace every , with a "," from a text without replacing existing ,".
It looks like this:
Before:
abcd,efgh,ijkl,"","",mnop
After:
abcd","efgh","ijkl","","","mnop
I hope you can help me.

Solving a problem using regular expressions is nice but now you have two problems.
A simple solution that does not involve the usage of regular expressions to is do three simple string replacements: first replace , with "," then replace ","" with "," and in the end ""," with ",".
Let's see why this works:
| after 1st | after 2nd | after 3rd
original | replacement | replacement | replacement
----------+-------------+-------------+-------------
a,b | a","b | a","b | a","b
m",n | m"","n | m"","n | m","n
x,"y | x",""y | x","y | x","y
See it in action:
const input = 'abcd,efgh,ijkl,"","",mnop';
const output = input.replace(/,/g, '","').replace(/",""/g, '","').replace(/"","/g, '","');
console.log(output);
N.B. The code snippet above uses regular expressions because this is how JavaScript implements the "replace all" functionality. When the first argument of String.replace() is a string it replaces only its first occurrence.
I could use String.replaceAll() instead (it works with strings) but it is not widely supported by browsers yet.

Crudely, I think you are after something like:
(?:(?<!"),"|(?<=")",(?!")|(?<!"),(?!"))
Note: As mention by #WiktorStribiżew in the comments you could get rid of the outer non-capturing group: (?<!"),"|(?<=")",(?!")|(?<!"),(?!")
See the online Demo

Related

Match a word in a list of words regex

I want the user to only be able to enter the values in the following regex:
^[AB | BC | MB | NB | NL | NS | NT | NU | ON |QC | PE | SK | YT]{2}$
My problem is that words like : PP AA QQ are accepted.
I am not sure how i can prevent that ? Thank you.
Site i use to verify the expression : https://regex101.com/

In most RegExp flavors, square brackets [] denotate character classes; that is, a set of individual tokens that can be matched in a specific position.
Because P is included in this character class (along with a quantifier of {2}) PP is matched.
Instead, you seem to want a group with alternatives; for that, you'd use parenthesis () (while also eliminating the whitespace, something it doesn't appear was intentional on your part):
^(AB|BC|MB|NB|NL|NS|NT|NU|ON|QC|PE|SK|YT){2}$
RegEx101
This matches things like ABBC, ABAB, NLBC, etc.

What is the right regex for extracting string after first symbol?

In google data studio I would like to make a REGEXP_EXTRACT for getting the string after the first | symbol (https://regex101.com/r/w3BqW4/2). I've tried the regex:
[|].*?$
but this is returning:
' | Leren afzaklaarzen Elisio | kalk'
So I still need to lose the first ' | '. Can anyone help me?
Example:
Input: Toral | Leren afzaklaarzen Elisio | kalk
Output: Leren afzaklaarzen Elisio | kalk

Adding a capturing group () to the initial RegEx does the trick:
REGEXP_EXTRACT(Field, "[|](.*)?$")
Adding the suggestion by cricket_007 "You cannot have "zero or one" of "zero or more" characters":
REGEXP_EXTRACT(Field, "[|](.*)$")
Google Data Studio Report to demonstrate:

If you want to capture every character after the first vertical bar, that would look like this
\\|(.*)$
You don't need question mark after a .*

How to use regex with cson

I wanna capture logical operators from ooRexx with regex in a .cson file because I want support syntax highlighting of ooRexx with the Atom editor. Those are the operators I try to cover:
>= <= \> \< \= >< <> == \== // && || ** ¬> ¬< ¬= ¬== >> << >>= \<< ¬<< \>> ¬>> <<=
And this is the regex part in the cson file:
'match': '\\+ | - | [\\\\] | \\/ | % | \\* | \\| | & |=|¬|>|<|
>= | <= | ([\\\\]>) | ([\\\\]<) | ([\\\\]=) | >< | <> | == | ([\\\\]==) |
\\/\\/ | && | \\|\\| | \\*\\* | ¬> | ¬< | ¬= | ¬== | >> | << | >>= | ([\\\\]<<) | ¬<< |
([\\\\]>>) | ¬>> | <<='
I'm struggling with the slashes (forward and backward) and also with the double **My knowledge about regex is very basic, to say it nicely. Is there somebody who can help me with that?

You have spaces around the pipe bars: these spaces are counted in the regular expression. So when you write something like | \*\* |, the double asterisks get caught, but only if they are surrounded by a space on each side, and not if they're affixed to a word or at the beginning/end of a line. Same issue with the slashes — I have tested it, and it does seem to catch them for me, but only as long as your slashes (or asterisks) are between two spaces.
A few other things to keep in mind:
You shouldn't need the square brackets around backslashes; they're useful to provide classes of possible characters to match. For instance, [<>]= will catch both >= and <=. Writing [\\] is equivalent to writing \\ directly because \\ counts as a single character, due to the first escaping backslash. Similarly, your parentheses here are not being used; see grouping.
Also think of using repetition operators like + and *. So \\>+ will catch both \> and \>>.
Finally, the question mark will help you avoid repetition, by marking the previous character (or group of characters, in square brackets) as optional. ==? will match both = and ==.
You can group together a LOT of your statements with these three tricks combined… I'll leave that exercise to you!
Just another hint when developing long regular expressions — use a tester like Regex101 or similar with a test file to see your changes in real time, and debuggers like Regexper will help you understand how your regular expression is parsed.

extract a variable value from the middle of a string

I have been trying to figure out for quite sometime. how do I get the PID value from the following string using powershell? I thought REGEX was the way to go but I can't quite figure out the syntax.
For what it is worth everything except for the PID will remain the same.
$foo = <VALUE>I am just a string and the string is the thing. PID:25973. After this do that and blah blah.</VALUE>
I have tried the following in regex
[regex]::Matches($foo, 'PID:.*') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>') | % {$_.Captures[0].Groups[1].value}
[regex]::Matches($foo, 'PID:*?>(.+).') | % {$_.Captures[0].Groups[1].value}

For your regex you'll want to indicate what's before and after the portion you're looking for. PID:.* will find everything from the PID to the end of the string.
And to use a capture group you'll want to have some ( and ) in your regex, which defines a group.
So try this on for size:
[regex]::Matches($foo,'PID:(\d+)') | % {$_.Captures[0].Groups[1].value}
I'm using a regex of PID:(\d+). The \d+ means "one or more digits". The parentheses around that (\d+) identifies it as a group I can access using Captures[0].Groups[1].

Here's another option. Basically it replaces everything with the first capture group (which is the digits after 'pid:':
$foo -replace '^.+PID:(\d+).+$','$1'

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").

It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.

Regexes do not count. That's what parsers are for.

You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";

You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"

Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".

Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|

Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.

#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*

My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEX Replacing with exception - regex

my first problem here is my nemesis regex. I need a regex to replace every , with a "," from a text without replacing existing ,". It looks like this: Before: abcd,efgh,ijkl,"","",mnop After: abcd","efgh","ijkl","","","mnop I hope you can help me.

Crudely, I think you are after something like: (?:(?<!"),"|(?<=")",(?!")|(?<!"),(?!")) Note: As mention by #WiktorStribiżew in the comments you could get rid of the outer non-capturing group: (?<!"),"|(?<=")",(?!")|(?<!"),(?!") See the online Demo

Related

Match a word in a list of words regex

What is the right regex for extracting string after first symbol?

How to use regex with cson

extract a variable value from the middle of a string

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

Categories

Resources