OCaml regex being buggy when trying to use escape characters - regex

I'm trying to write a lexer for a variation on C using OCaml. For the lexer I need to match the strings "^" and "||" (as the exponentiation and or symbols respectively). Both of these are special characters in regex, and when I try to escape them using the backslash, nothing changes and the code runs as if "\^" was still beginning of line and "\|\|" was still "or or". What can I do to fix this?

Backslash characters in string literals have to be doubled to make them past the OCaml string parser:
# let r = Str.regexp "\\^" in
Str.search_forward r "FOO^BAR" 0;;
- : int = 3
If you are using OCaml 4.02 or later, you can also use quoted strings ({| ... |}), which do not handle a backslash character specially. This may result in more readable code because backslash characters do not have to be doubled:
# let r = Str.regexp {|\^|} in
Str.search_forward r "FOO^BAR" 0;;
- : int = 3
Or you may consider using Str.regexp_string (or Str.quote), which creates a regular expression that will match all characters in its argument literally:
# let r = Str.regexp_string "^" in
Str.search_forward r "FOO^BAR" 0;;
- : int = 3
The Str module does not take | as a special regex character, so you do not have to worry about quoting when you want to use it literally:
# let r = Str.regexp "||" in
Str.search_forward r "FOO||BAR" 0;;
- : int = 3
| has to be quoted only when you want to use it as the "or" construct:
# let r = Str.regexp "BAZ\\|BAR" in
Str.search_forward r "FOOBAR" 0;;
- : int = 3
You might want to refer to Str.regexp for the full syntax of regular expressions.

Related

What is the regex for the escaping []?

I am trying to match square brackets i.e. [] in regex VBA in excel. I am trying with the below code but it is not working.
Public Function IsSpecial(s As String) As Long
Dim L As Long, LL As Long
Dim sCh As String
IsSpecial = 0
For L = 1 To Len(s)
sCh = Mid(s, L, 1)
If sCh Like "[0-9a-zA-Z/;#%,'‚.+&/\(): ]" Or sCh = "_" Or sCh Like "[-]" Or sCh Like "\[" Then
Else
IsSpecial = 1
Exit Function
End If
Next L
End Function
According to Using the Like operator and wildcard characters in string comparisons:
You can use a group of one or more characters (charlist) enclosed in brackets ([ ]) to match any single character in expression, and charlist can include almost any characters in the ANSI character set, including digits. You can use the special characters opening bracket ([), question mark (?), number sign (#), and asterisk (*) to match themselves directly only if enclosed in brackets. You cannot use the closing bracket (]) within a group to match itself, but you can use it outside a group as an individual character.
So, you need to use
Ch Like "[[]"
However, the function you have is not following your logic, since it checks each char individually, and you want to make sure [] is checked as a char sequence.
With a regex, it will look like
Public Function IsSpecial(s As String) As Long
Dim L As Long, LL As Long
Dim rx As New regExp
rx.Pattern = "^(?:[0-9a-zA-Z/;#%,'‚.+&/\\(): _-]|\[])*$"
IsSpecial = 0
If Not rx.Test(s) Then IsSpecial = 1
End Function

Formatting regex in Dart on several lines

I have
Pattern pattern = r'^((?:19|20)\d\d)[- /.]
(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
My editor shows an error on this regexp:
How can I fix it?
You entered a line break inside a string literal, that is why you get a syntax issue.
If you want to split a pattern into several lines, just use string concatenation:
Pattern pattern = r'^((?:19|20)\d\d)[- /.]' +
r'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
Or, since string literals separated only with whitespace characters are concatenated automatically:
Pattern pattern = r'^((?:19|20)\d\d)[- /.]'
r'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$';
Or, if you plan to re-use a long pattern, you may define this part as a variable, and just use string interpolation:
String d = r'((?:19|20)\d\d)';
String M = r'(0[1-9]|1[012])';
String y = r'(0[1-9]|[12][0-9]|3[01])';
String sep = r'[- /.]';
Pattern pattern = '^$d$sep$M$sep$y\$';

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test
While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).
This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}
Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

How to grab a letter after ';' with regular expressions?

How can I grab a letter after ; using regular expressions? For example:
c ; d
e ; f ; m ; k ; s
import re
f = open('file.txt')
regex = re.compile(r"(?<=\; )\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
This code only grabs d and f. I need the outcome yo look like:
d
f
m
k
s
Replace all occurrences of "; " to a newline character and trim all spaces from the ends of every line.
use a regex similar to this if you want to "blacklist" the ";" character:
[;]
I don't know much about python, but here how you would use it in JavaScript:
var desired_chars = myString.replace(/[;]/gi, '')
Instead of regex.search use regex.findall. That'll give you a list of matches for each line which you can then manipulate and print on separate lines.