I'm evaluating multiple expression and the result looks like this. Single variable replacement works, not sure if there is maybe an issue with some cases.
Related
I am not looking for a specific regular expression, but for a software that find them.
Let us say I have a file A and a file B: how to find a regexp that matches all words of A, but does not match any of the words in A?
If A contains "truit fruit" and B contains "ridiculous", then the software could return something like ".ru." but '.r.' only would be invalid.
It is the "practical" aspect of another question [1], though what interests me is to find an actual software that solves it in practice.
Thanks for your help,
Nathann
[1] https://cstheory.stackexchange.com/questions/1854/is-finding-the-minimum-regular-expression-an-np-complete-problem
There is no algorithm to somehow "cleverly derive" a regular expression from examples. You can only implement a brute force attempt of an iteration through all permutations of common substrings of the words in A and tests B against it until you find a solution. You are not guaranteed to find a solution, though.
For the case that there are no common substrings of all words in A you could then extend that approach to introduce the "or" operator in regular expressions. But that get's really ugly and slow.
If that does not lead to a solution, then you'd have to go on extending your attempts such that also exclusion rules are added to the expression by iterating through all words in B and creating anti patterns from it. Horrible attempt.
And as said: you are never guaranteed to find a solution.
There is one thing though:
If you are not interested in how the final regular expression looks like you can do this: create a regex simply combining all words in a "whitespace padded version of A" with an "or" operation (so \struit\s|\sfruit\s in your example). Obviously that attempt creates huge expressions. You then would have to take care to exclude exact substrings that might occur in B again. Which may lead to much longer expressions still.
Bottom line: there is no really elegant solution for this. Simply because the question does not allow for that. Question is: why does it have to be a regular expression? Why can't you simply do string comparisions? That would probably not be more expensive anyway in such an vaguely defined scenario...
I'm trying to use regular expressions within Google Sheets. Given that the environment is within GSheets some functionality seems to be missing or, potentially just different.
I would like to use a regexmatch function that returns true if the range in question contains any of the following strings:
"string1"
"string2"
"string3"
I tried =regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)"
This works.
But my developer colleague said he would usually just end the expression /i to say "Be case insensitive"
=regexmatch(range,"/(String1|String2|String3)/i"
But since Gsheets does not use "/" to open a regular expression, is there another way to tell the function to ignore case?
Also, is there a way to negate the expression? That is, instead of:
=NOT(regexmatch(range,"([Ss]tring1|[[Ss]tring2|[Ss]tring3)")
Can you do something like
=regexmatch(range,"!=([Ss]tring1|[[Ss]tring2|[Ss]tring3)"
you can try wrapping your range with the "lower" function, so compares the values as if they are all lower case regardless of whether they really are or not.
=REGEXMATCH(lower(range),"string1|string2|string3")
is there another way to tell the function to ignore case?
Please try:
=regexmatch(range,"(?i)string1|string2|string3")
Well, I got it working, but somehow it looks slow and inefficient (or maybe not).
What I've got is a sequence of characters, for simplicity sake let's just say it's
123456789
What I want to do is to make sure the input begins the same way, and is in the same sequence, but doesn't need to be the complete sequence.
What I've got is this:
^1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)?
This looks pretty horrid, but is there a better way to do this?
Edit Added the ^ that was in the original code and I forgot to include here.
A ? quantifier is is like a spare part. Think of the engine that runs fine without it. It will try to ingore it if possible.
Sure x?x?x?x?x? looks pretty bad. But, its almost meaningless unless used with some context around it.
Asuming your groupings are just to denote options, you could factor out the last inner-group using this 1(2(3(4(5(6(7(89?)?)?)?)?)?)?)?.
Example:
1(2(3(4(5(6(7(8(9)?)?)?)?)?)?)?)? will globally match this
987654321 1111111111111112121211112121121212312111 multiple times.
So, its all relative.
Here's what I'm using: ".+/#[^/]+$". Can you think of a reason why this might not work?
This is actually a very subtle problem and I think a great question.
My understanding is that an (abbreviated) XPATH points to an attribute if and only its last # is not within a predicate, that is, something of the form [...], and has no steps after it (something like /...). I think this has the relatively simple regular expression #[^]/]*$, that is, there must be an # that has no ]s nor /s after it. Also, if you want to cover unabbreviated XPATHs, you can use (#|attribute::)[^]/]*$
I've included a test harness that may prove useful in checking this or other tests. Note also that there may be whitespace in between tokens which can complicate some regexs.
Positive (an attribute)
#* or #a or ../#a or a/#b
a[#b and #c]/#d
a[b[#c="d"]/e[#f and #g]]/h[#i="j"]/#k
Negative (not an attribute)
a[#b] or a[#b and #c]
a[b[#c and #d]/#e]
a[b[#c="d"]/e[#f and #g]]/h[#i="j"]/k[5][#l="m"]
I can't think of a legal example where there is a / but not a ] after the last example, but I think there might be one.
Hopefully these examples make it at least a little clear that there can be arbitrary nesting of [ and ] together with #s anywhere in between. Luckily, I think only the very last # and its nesting level matters.
(For reference, the OP's regex fails on #a. My original regex failed on a[#b and #c].)
Edit: It turns out that there are more corner cases, which convinces me that there is no perfectly-correct regular expression. For example, once you have an attribute node, there are many ways of keeping it, e.g. //#a// or //#a/. in the abbreviated syntax. There are also a variety of more creative ways, such as //#f//[node()]. All in all, it seems that if you want to cover these cases, you need to be able to match [ and ], which a basic regular expression cannot do. On the other hand, you could decide this is too contrived ...
We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?
Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.
As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.
It's as safe as anything else in regular expressions!
As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes
I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'
The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");