Regular Expressions in Google Sheets

Regular Expressions in Google Sheets - regex

I'm trying to use regular expressions within Google Sheets. Given that the environment is within GSheets some functionality seems to be missing or, potentially just different.
I would like to use a regexmatch function that returns true if the range in question contains any of the following strings:
"string1"
"string2"
"string3"
I tried =regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)"
This works.
But my developer colleague said he would usually just end the expression /i to say "Be case insensitive"
=regexmatch(range,"/(String1|String2|String3)/i"
But since Gsheets does not use "/" to open a regular expression, is there another way to tell the function to ignore case?
Also, is there a way to negate the expression? That is, instead of:
=NOT(regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)")
Can you do something like
=regexmatch(range,"!=([Ss]tring1|[[Ss]tring2|[Ss]tring3)"

you can try wrapping your range with the "lower" function, so compares the values as if they are all lower case regardless of whether they really are or not.
=REGEXMATCH(lower(range),"string1|string2|string3")

is there another way to tell the function to ignore case?
Please try:
=regexmatch(range,"(?i)string1|string2|string3")

Related

Regular Expression groups ignoring comma inside parenthesis

I know that are plenty of regular expressions around here similar to what I am going to ask, but couldn't find one that actually helps me.
This one got close, but it uses Java split method, but I need to capture the values using only regular expressions:
Java: splitting a comma-separated string but ignoring commas in quotes
So, what I need to do is, given the below input:
string,string([a-zA-Z]{0,9}),integer
I would like to capture 3 matches:
string
string([a-zA-Z]{0,9})
integer
Note that inside the parenthesis we can have a regular expression, which means almost any chars, even comma.
I can't use split here, because I am not using Java, but an internal declarative programming that uses ICU regular expressions and has an API for capturing groups, but not a regex based split method.
Any help would be appreciated. And I am really sorry if there exists other posts that could be duplicated as this one, but I have spent a few hours looking around, and even played with the post I mentioned, but couldn't get to a solution.
Thanks
EDIT
The input I provided is just an example, but other inputs are also possible.
Besides, after #sin comments, I have reviewed the input, and we can actually assume we'll have quotes inside the parenthesis, like that:
string("[\w]{0,9}"),integer,string

Using SIMILAR TO for a regex?

Why is the following instruction returning FALSE?
SELECT '[1-3]{5}' SIMILAR TO '22222' ;
I can't find what is wrong with that, according to the Postgres doc ...

Your basic error has already been answered.
More importantly, don't use SIMILAR TO at all. It's completely pointless:
Query performance in PostgreSQL using 'similar to'
Difference between LIKE and ~ in Postgres
Use LIKE, or the regular expression match operator ~, or other pattern matching operators:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
For 5 digits between 1 and 3 use the expression #Thomas provided.
If you actually want 5 identical digits between 1 and 3 like your example suggests I suggest a back reference:
SELECT '22222' ~ '([1-9])\1{4}';
Related answer with more explanation:
Deleting records with number repeating more than 5
sqlfiddle demonstrating both.

The operator is defined as:
string SIMILAR TO pattern
so the first parameter is the string that you want to compare. The second parameter is the regex to compare against.
You need:
SELECT '22222' SIMILAR TO '[1-3]{5}';

try
SELECT '22222' ~ '[1-3]{5}'
SIMILAR is not POSIX standard
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE, except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common regular expression notation.
...
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and SIMILAR TO operators. Many Unix tools such as egrep, sed, or awk use a pattern matching language that is similar to the one described here.
http://www.postgresql.org/docs/9.2/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

REGEXP_LIKE AND REGEXP_INSTR which one to use

I have the following code which uses regexp_like(), but when I write the exp: ^[0-5]\.[\d]+$ to the regexp_like(), it doesn't return me the correct result.
Should I use regexp_instr?
How do I get to know, which one to use?

They're two different functions with different goals so you should use the one most appropriate to your situation.
REGEXP_LIKE() returns a Boolean and can only be used in the WHERE clause, it's used when you want to return rows that match a condition.
REGEXP_INSTR() returns an integer, which indicates the beginning or or end of the matched substring. It does not have to be used in the WHERE clause.
Essentially, where regexp_instr(...,...) > 0 is identical to a REGEXP_LIKE but it can be used in a lot more situations.
Please read the linked documentation on both.
As to why your condition doesn't return the correct result it'll be because your regular expression doesn't adequately describe the rows you want returned.

Just a guess here, but I think regexp_like is already anchored at the start and end, otherwise regexp_instr would be redundant.

I am not sure that your regex engine support the \d character class. Try these syntaxes instead (with regexp_like):
^[0-5]\.[0-9]+$
or
^[0-5]\.[[:digit:]]+$

In Oracle REGEXP_LIKE is a condition and REGEXP_INSTR is a function.
You typically use conditions in the WHERE clause and a few other places. You use functions in any expression.
Without more details it's hard to tell which one is more suitable in your case, but ultimately both of them do exactly the same. The representation of result is of course different as per the links above and you have to account for that.

How to create regular expression to get all functions from code

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!

Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.

Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps

Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).

It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.

(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?

Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.

As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.

It's as safe as anything else in regular expressions!

As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes

I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'

The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expressions in Google Sheets - regex

you can try wrapping your range with the "lower" function, so compares the values as if they are all lower case regardless of whether they really are or not. =REGEXMATCH(lower(range),"string1|string2|string3")

is there another way to tell the function to ignore case? Please try: =regexmatch(range,"(?i)string1|string2|string3")

Related

Regular Expression groups ignoring comma inside parenthesis

Using SIMILAR TO for a regex?

REGEXP_LIKE AND REGEXP_INSTR which one to use

How to create regular expression to get all functions from code

Is stringing together multiple regular expressions with "or" safe?

Categories

Resources