REGEXP_LIKE AND REGEXP_INSTR which one to use

REGEXP_LIKE AND REGEXP_INSTR which one to use - regex

I have the following code which uses regexp_like(), but when I write the exp: ^[0-5]\.[\d]+$ to the regexp_like(), it doesn't return me the correct result.
Should I use regexp_instr?
How do I get to know, which one to use?

They're two different functions with different goals so you should use the one most appropriate to your situation.
REGEXP_LIKE() returns a Boolean and can only be used in the WHERE clause, it's used when you want to return rows that match a condition.
REGEXP_INSTR() returns an integer, which indicates the beginning or or end of the matched substring. It does not have to be used in the WHERE clause.
Essentially, where regexp_instr(...,...) > 0 is identical to a REGEXP_LIKE but it can be used in a lot more situations.
Please read the linked documentation on both.
As to why your condition doesn't return the correct result it'll be because your regular expression doesn't adequately describe the rows you want returned.

Just a guess here, but I think regexp_like is already anchored at the start and end, otherwise regexp_instr would be redundant.

I am not sure that your regex engine support the \d character class. Try these syntaxes instead (with regexp_like):
^[0-5]\.[0-9]+$
or
^[0-5]\.[[:digit:]]+$

In Oracle REGEXP_LIKE is a condition and REGEXP_INSTR is a function.
You typically use conditions in the WHERE clause and a few other places. You use functions in any expression.
Without more details it's hard to tell which one is more suitable in your case, but ultimately both of them do exactly the same. The representation of result is of course different as per the links above and you have to account for that.

Related

REGEX: PCRE atomic group doesn't work

In my PCRE regular expression I used an atomic group to reduce backtracks.
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
REGEX101
But in the example debug, I see that after f checking ends, it goes further for other options. I'm trying to stop it after f check fails so it doesn't check the rest of expression. What's wrong?

I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.
The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.
PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
Just by adding a single (*COMMIT) verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.
(*COMMIT) tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </ again if no match is found.
To fully optimize the expression, you'll have to add (*COMMIT) at every point after branching has occurred.
Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.

Because that's how atomic group works. The idea is:
at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it.
(Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)
So if there is no match inside an atomic group, it will iterate through all options.
You can use conditionals instead:
</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b
Regex101

Regular Expressions in Google Sheets

I'm trying to use regular expressions within Google Sheets. Given that the environment is within GSheets some functionality seems to be missing or, potentially just different.
I would like to use a regexmatch function that returns true if the range in question contains any of the following strings:
"string1"
"string2"
"string3"
I tried =regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)"
This works.
But my developer colleague said he would usually just end the expression /i to say "Be case insensitive"
=regexmatch(range,"/(String1|String2|String3)/i"
But since Gsheets does not use "/" to open a regular expression, is there another way to tell the function to ignore case?
Also, is there a way to negate the expression? That is, instead of:
=NOT(regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)")
Can you do something like
=regexmatch(range,"!=([Ss]tring1|[[Ss]tring2|[Ss]tring3)"

you can try wrapping your range with the "lower" function, so compares the values as if they are all lower case regardless of whether they really are or not.
=REGEXMATCH(lower(range),"string1|string2|string3")

is there another way to tell the function to ignore case?
Please try:
=regexmatch(range,"(?i)string1|string2|string3")

Finding match between optional tokens?

For the strings:
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
I have the current regex:
m/(handle:|chat_identifier:)(.+?)(:{2}|&)/
And I am currently using $2 in order to obtain the value I wish (in the first string e#ma.il and in the second, chat0123456789).
Is there a better/faster/simpler way to solve this problem, though?

Whether it's "better" or not depends on the context, but you could take this approach: split the string on ":" and take the fourth element of the resulting list. That's arguably more readable than the regex and more robust if the third field can be something other than "handle" or "chat_identifier".
I think the speed would be very similar for either approach but probably for almost any implementation in perl. I'd want to show that speed was critical for this step before worrying about it...

For a regex solution, this one is slightly simpler and doesn't need to backtrack:
m/(handle|chat_identifier):([^:&]+)/
Note the slight difference: yours allows single colons within the value, mine doesn't (it stops at the first colon encountered). If that is not a problem, you can use my variant. Or as I mentioned in a comment, split at : and use the fourth element in the result.
An equivalent version that does only stop at double colons is this:
m/(handle|chat_identifier):((?:(?!::|&).)+)/
Not so beautiful, but it still avoids backtracking (the lookahead might make it slower, though... you will need to profile that, if speed matters at all).

Looks like you have allot of good solutions already here. The split method seems like the simplest. But depending on your requirements you could also use a more generic regex that breaks the string in its basic pieces. It will work for other datatypes and property names than in your examples.
([^:]+)::([^:]+):([^:&]+)(?:::|&)\1
The captures groups are as follows:
Group 1: the datatype. (the keyword "text" from your examples.)
Group 2: The property name. (The keywords "handle" and "chat_identifier"
from your examples.)
Group 3: The property value.

If the values you want are always in the same position and it's safe to split on : and &, then perhaps the following will work for you:
use Modern::Perl;
say +( split /[:&]+/ )[2] for <DATA>;
__DATA__
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
Output:
e#ma.il
chat0123456789

How to create regular expression to get all functions from code

I have some problem with my regular expression. I need to find all functions in text. I have this regular expression \w*\([^(]*\). It works fine until text does not contais brackets without function name. For example for this string 'hello world () testFunction()' it returns () and testFunction(), but I need only testFunction(). I want to use it in my c# application to parse passed to my method string. Can anybody help me?
Thanks!

Programming languages have a hierarchical structure, which means that they cannot be parsed by simple regular expressions in the general case. If you want to write correct code that always works, you need to use an LR-parser. If you simply want to apply a hack that will pick up most functions, use something like:
\w+\([^)]*\)
But keep in mind that this will fail in some cases. E.g. it cannot differentiate between a function definition (signature) and a function call, because it does not look at the context.

Try \w+\([^(]*\)
Here I have changed \w* to \w+. This means that the match will need to contain atleast one text character.
Hope that helps

Change the * to + (if it exists in your regex implementation, otherwise do \w\w*). This will ensure that \w is matched one or more times (rather than the zero or more that you currently have).

It largely depends on the definition of "function name". For example, based on your description you only want to filter out the "empty"names, and not want to find all valid names.
If your current solution is largely enough, and you have problems with this empty names, then try to change the * to a +, requiring at least one word character right before the bracket.
\w+([^(]*)
OR
\w\w*([^(]*)
Depending on your regexp application's syntax.

(\w+)\(
regex groups would have the names of variables without any parentesis, you can add them later if you want, i supposed you don't need the parameters.
If you do need the parameters then use:
\w+\(.*\)
for a greedy regex (it would match nested functions calls)
or...
\w+\([^)]*\)
for a non-greedy regex (doesn't match nested function calls, will match only the inner one)

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?

Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.

As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.

It's as safe as anything else in regular expressions!

As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes

I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'

The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEXP_LIKE AND REGEXP_INSTR which one to use - regex

I have the following code which uses regexp_like(), but when I write the exp: ^[0-5]\.[\d]+$ to the regexp_like(), it doesn't return me the correct result. Should I use regexp_instr? How do I get to know, which one to use?

Just a guess here, but I think regexp_like is already anchored at the start and end, otherwise regexp_instr would be redundant.

I am not sure that your regex engine support the \d character class. Try these syntaxes instead (with regexp_like): ^[0-5]\.[0-9]+$ or ^[0-5]\.[[:digit:]]+$

Related

REGEX: PCRE atomic group doesn't work

Regular Expressions in Google Sheets

Finding match between optional tokens?

How to create regular expression to get all functions from code

Is stringing together multiple regular expressions with "or" safe?

Categories

Resources