Escaping Asterisk Grabs wrong character - regex

When I run the code below I get the unexpected result where \* also captures É. Is there a way to make it only capture * like I wanted?
let s =
"* A
ÉR
* B"
let result = System.Text.RegularExpressions.Regex.Replace(s, "\n(?!\*)", "", Text.RegularExpressions.RegexOptions.Multiline)
printfn "%s" result
Result After Running Replace
* AÉR
* B
Expected Result
"* A
ÉR
* B"
UPDATE
This seems to be working, when I use a pattern like so \n(?=\*). I guess I needed a positive lookahead.

You're right that you need to use positive lookahead instead of negative lookahead to get the result you want. However, to clarify an issue that came up in the comments, in F# a string delimited by just "" is not quite like either a plain C# string delimited by "" or a C# string delimited by #"" - if you want the latter you should also use #"" in F#. The difference is that in a normal F# string, backslashes will be treated as escape sequences only when used in front of a valid character to escape (see the table towards the top of Strings (F#)). Otherwise, it is treated as a literal backslash character. So, since '*' is not a valid character to escape, you luckily see the behavior you expected (in C#, by contrast, this would be a syntax error because it's an unrecognized escape). I would recommend that you not rely on this and should use a verbatim #"" string instead.
In other words, in F# the following three strings are all equivalent:
let s1 = "\n\*"
let s2 = "\n\\*"
let s3 = #"
\*"
I think that the C# design is more sensible because it prevents confusion on what exactly is being escaped.

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

regex that will check expression is valid arithmetic expression, for example 'a+b*c=d' is valid and accepted

I made separate regexs for both but its not giving desired result. and it should work like check whole input string and return valid if its valid or invalid if its invalid.
import re
identifiers = r'^[^\d\W]\w*\Z'
operators = r'[\+\*\-\/=]'
a = re.compile(identifiers, re.UNICODE)
b = re.compile(operators, re.UNICODE)
c = re.findall(a, 'a+b*c=d')
d = re.findall(b, 'a+b*c=d')
print c, 'identifiers'
print d, 'operators'
Result of this snippet is
[ ] identifiers &
['+', '*', '='] operators
I want results like input string is valid or invalid by checking all characters of input string by both regex
I think the issue you're having with your current code is that your identifiers pattern only works if it matches the whole string.
The problem is that the current pattern requires that both the beginning and end of the input be matched (by the ^ and \Z respectively). That's usually causing you to not finding any identifiers, since only an input like "foo" would be matched, since it's a single identifier that contacts both the start and end of the string. (I'd also note that it is a bit odd to mix ^ and \Z together, though it is not invalid. It would just be more natural to pair ^ with $ or \Z with \A.)
I suspect that you don't actually want ^ and \Z in your pattern, but rather should be using \b in place of both. The \b escape matches "word breaks", which means either the start or end of the input, or a change between word-characters and non-word characters.
>>> re.findall(r'\b[^\d\W]\w*\b', 'a+b*c=d', re.U)
['a', 'b', 'c', 'd']
This still isn't going to do what you say you ultimately want (testing if the string to ensure it's a valid expression). That's a much more difficult task, and regular expressions are not up to it in general. Certain specific forms of expressions can perhaps be matched with regex, but supporting things like parentheses will break the whole system in a hurry. To identify arbitrary arithmetic expressions, you'd need a more sophisticated parser, which might use regex in some of it's steps, but not for the whole thing.
For the simple cases like your example an expression like this will work:
^[0-9a-z]+([+/*-][0-9a-z]+)+=[0-9a-z]+$

Regex to capture VBA comments

I'm trying to capture VBA comments. I have the following so far
'[^";]+\Z
Which captures anything that starts with a single quote but not contain any double quotes until end of string. i.e. it will not match single quotes within a double quote string.
dim s as string ' a string variable -- works
s = "the cat's hat" ' quote within string -- works
But fails if the comment contains a double quote string
i.e.
dim s as string ' string should be set to "ten"
How can I fix my regex to handle that too?
The pattern in #Jeff Wurz's comment (^\'[^\r\n]+$|''[^\r\n]+$) doesn't even match any of your test samples, and the linked question is useless, the regex in there will only match that specific comment in the OP's question, not "the VBA comment syntax".
The regex you have come up with works even better than what I had when I gave up the regex approach.
Well done!
The problem is that you can't parse VBA comments with a regex.
In Lexers vs Parsers, #SasQ's answer does a good job at explaining Chomsky's grammar levels:
Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the
symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or
alternatives (e.g. a|b). They can be implemented as finite state
automata (FSA), like NFA (Nondeterministic Finite Automaton) or better
DFA (Deterministic Finite Automaton). Regular grammars can't handle
with nested syntax, e.g. properly nested/matched parentheses
(()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because
state automata to deal with it should have to have infinitely many
states to handle infinitely many nesting levels.
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax
trees, so they can handle with nested structures well. They can be
implemented as state automaton with stack. This stack is used to
represent the nesting level of the syntax. In practice, they're
usually implemented as a top-down, recursive-descent parser which uses
machine's procedure call stack to track the nesting level, and use
recursively called procedures/functions for every non-terminal symbol
in their syntax. But they can't handle with a context-sensitive
syntax. E.g. when you have an expression x+3 and in one context this x
could be a name of a variable, and in other context it could be a name
of a function etc.
Level 1: Context-sensitive grammars
Regular Expressions simply aren't the appropriate tool for solving this problem, because whenever there's more than a single quote (/apostrophe), or when double quotes are involved, you need to figure out whether the left-most apostrophe in the code line is inside double quotes, and if it is, then you need to match the double quotes and find the left-most apostrophe after the closing double quote - actually, the left-most apostrophe that isn't part of a string literal, is your comment marker.
My understanding is that VBA comment syntax is a context-sensitive grammar (level 1), because the apostrophe is only your marker if it's not part of a string literal, and to figure out whether an apostrophe is part of a string literal, the easiest is probably to walk your string left to right and to toggle some IsInsideQuote flag as you encounter double-quotes... but only if they're not escaped (doubled-up). Actually you don't even check to see if there's an apostrophe inside the string litereal: you just keep walking until open quotes are closed, and only when the "in-quotes flag" is False you found a comment marker if you encounter a single quote.
Good luck!
Here's a test case you're missing:
s = "abc'def ""xyz""'nutz!" 'string with apostrophes and escaped double quotes
If you don't care about capturing the string literals, you can simply ignore the escaped double quotes and see 3 string literals here: "abc'def ", "xyz" and "'nutz!".
This C# code outputs 'string with apostrophes and escaped double quotes (all in-string double quotes are escaped with a backslash in the code), and works with all the test strings I gave it:
static void Main(string[] args)
{
var instruction = "s = \"abc'def \"\"xyz\"\"'nutz!\" 'string with apostrophes and escaped double quotes";
// var instruction = "s = \"the cat's hat\" ' quote within string -- works";
// var instruction = "dim s as string ' string should be set to \"ten\"";
int? commentStart = null;
var isInsideQuotes = false;
for (var i = 0; i < instruction.Length; i++)
{
if (instruction[i] == '"')
{
isInsideQuotes = !isInsideQuotes;
}
if (!isInsideQuotes && instruction[i] == '\'')
{
commentStart = i;
break;
}
}
if (commentStart.HasValue)
{
Console.WriteLine(instruction.Substring(commentStart.Value));
}
Console.ReadLine();
}
Then if you want to capture all legal comments, you need to handle the legacy Rem keyword, and consider line continuations:
Rem this is a legal comment
' this _
is also _
a legal comment
In other words, \r\n in itself isn't enough to correctly identify all end-of-statement tokens.
A proper lexer+parser seems the only way to capture all comments.

How to replace all the numbers with literal \d in scala?

I want to write a function, to replace all the numbers in a string with literal \d. My code is:
val r = """\d""".r
val s = r.replaceAllIn("123abc", """\d""")
println(s)
I expect the result is \d\d\dabc, but get:
dddabc
Then I change my code (line 2) to:
val s = r.replaceAllIn("123abc", """\\d""")
The result is correct now: \d\d\dabc
But I don't understand why the method replaceAllIn converts the string, not use it directly?
There was a toList in my previous code, that now what I want. I have just update the question. Thanks to everyone.
Just remove the toList.
val r = """\d""".r
val list = r.replaceAllIn("123abc", """\\d""")
println(list)
Strings are (implicitly, via WrappedString, convertible to) Seq[Char]. If you invoke toList, you will have a List[Char].
Scala's Regex uses java.util.regex underneath (at least on the JVM). Now, if you look up replaceAll on Java docs, you'll see this:
Note that backslashes (\) and dollar
signs ($) in the replacement string
may cause the results to be different
than if it were being treated as a
literal replacement string. Dollar
signs may be treated as references to
captured subsequences as described
above, and backslashes are used to
escape literal characters in the
replacement string.