OCaml regexp "any" matching, where "]" is one of the characters - regex

I'd like to match a string containing any of the characters "a" through "z", or "[" or "]", but nothing else. The regexp should match
"b"
"]abc["
"ab[c"
but not these
"2"
"(abc)"
I tried this:
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\]]*$") s 0;;
content_check "]abc[";;
and got warned that the "escape" before the "]" was illegal, although I'm pretty certain that the equivalent in, say, sed or awk would work fine.
Anyhow, I tried un-escaping the cracket, but
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[]]*$") s 0;;
doesn't work at all, since it should match any of a-z or "[", then the first "]" closes the "any" selection, after which there must be any number of "]"s. So it should match
[abc]]]]
but not
]]]abc[
In practice, that's not what happens at all; I get the following:
# let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-zA-Z[]]*$") s 0;;
content_check "]abc[";;
content_check "[abc]]]";;
content_check "]abc[";;
val content_check : string -> bool = <fun>
# - : bool = false
# - : bool = false
# - : bool = false
Can anyone explain/suggest an alternative?
#Tim Pietzker's suggestion sounded really good, but appears not to work:
# #load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\\]]*$") s 0;;
content_check "]abc[";;
# val content_check : string -> bool = <fun>
# - : bool = false
#
nor does it work when I double-escape the "[" in the pattern, just in case. :(
Indeed, here's a MWE:
#load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "[\\]]") s 0;;
content_check "]";; (* should be true *)

This is not going to really answer your question, but it will solve your problem. With the re library:
let re_set = Re.(rep (* "rep" is the star *) ## alt [
rg 'a' 'z' ; (* the range from a to z *)
set "[]" ; (* the set composed of [ and ] *)
])
(* version that matches the whole text *)
let re = Re.(compile ##
seq [ start ; re_set ; stop ])
let content_check s =
Printf.printf "%s : %b\n" s (Re.execp re s)
let () =
List.iter content_check [
"]abc[" ;
"[abc]]]" ;
"]abc[" ;
"]abc[" ;
"abc##"
]
As you noticed, str from the stdlib is akward, to put it midly. re is a very good alternative, and it comes with various regexp syntax and combinators (which I tend to use, because I think it's easier to use than regexp syntax).

I'm an idiot. (But perhaps the designers of Str weren't so clever either.)
From the "Str" documentation: "To include a ] character in a set, make it the first character in the set."
With this, it's not so clear how to search for "anything except a ]", since you'd have to place the "^" in front of it. Sigh.
:(

Related

How can I have String.tokens use more than one delimiter?

Because String.tokens is a curried function, I know I can change
String.tokens (fn c =\> c = #" ") "hello world";
to a string that would contain all the delimiters, but I am just confused about the actual dictation of how.
One of the forms that I tried was:
fun splitter nil = nil
| splitter str =
let
val c = " ,.;?:!\t\n"
val s = String.tokens (fn (c:string,x:char) => c=Char.toString c x) str
in
s
end;
With c being the string of the delimiters, but I know something is very wrong. If anyone could point me into the right direction that would be greatly appreciated.
String.tokens takes two arguments: a predicate to determine if a character is a token; and a string to split. The first argument is the important part. We don't have to specify a character to split on, just a rule to identify that character.
If you turn a string containing the token characters into a list with String.explode, then it's easy to use List.exists to find out if a character is in that token string.
fun splitOn(str, tokens) =
let
val tokens' = String.explode tokens
fun isToken c = List.exists (fn c' => c = c') tokens'
in
String.tokens isToken str
end;
splitOn("hello world | wooble. foo? bar!", " |.?!");
(* ["hello", "world", "wooble", "foo", "bar"] *)

Haskell, regex, TDFA: match (and remove) quoted substrings

There is a regular expression matching quoted substrings: "/\"(?:[^\"\\]|\\.)*\"/" (originally /"(?:[^"\\]|\\.)*"/, see Here). Tested on regex101, it works.
With TDFA, it's syntax:
*** Exception: Explict error in module Text.Regex.TDFA.String : Text.Regex.TDFA.String died:
parseRegex for Text.Regex.TDFA.String failed:"/"(?:[^"\]|\.)*"/" (line 1, column 4):
unexpected "?"
expecting empty () or anchor ^ or $ or an atom
Is there a way co correct it?
Test string: Is big "problem", no?
Expected result: "problem"
UPD:
This is full context:
removeQuotedSubstrings :: String -> [String]
removeQuotedSubstrings str =
let quoteds = concat (str =~ ("/\"(?:[^\"\\]|\\.)*\"/" :: String) :: [[String]])
in quoteds
No improvement, just an acceptable solution, albeit lacking in elegance:
import qualified Data.Text as T
import Text.Regex.TDFA
-- | Removes all double quoted substrings, if any, from a string.
--
-- Examples:
--
-- >>> removeQuotedSubstrings "alfa"
-- "alfa"
-- >>> removeQuotedSubstrings "ngoro\"dup\"lai \"ming\""
-- "ngoro lai "
removeQuotedSubstrings :: String -> String
removeQuotedSubstrings str =
let quoteds = filter (('"' ==) . head)
$ concat (str =~ ("\"(\\.|[^\"\\])*\"" :: String) :: [[String]])
in T.unpack $ foldr (\quoted acc -> T.replace (T.pack quoted) " " acc)
(T.pack str) quoteds
Yes, the final purpose has always been to remove the quoted substrings.

How to validate if a string only contains number chars in Ocaml

I'm using Str.regexp, I want to know how to check if undetermined length string contains only number characters.
This is what I'm doing:
Str.string_match "[0-9]+" "1212df3124" 0;;
The problem is it evaluates to true, but it should returns false because it contains 'df' substring. (This is not the same as C# regexp, it's Ocaml)
The Str.string_match function checks whether the pattern matches starting at the index you supply. As long as there's at least one digit at the beginning of the string, your pattern will match. If the string starts with something other than a digit, your pattern will fail to match:
# Str.string_match (Str.regexp "[0-9]+") "df3124" 0;;
- : bool = false
To check against the whole string, you need to "anchor" the pattern to the end with $. I.e., you need to make sure the match goes to the end of the string.
# Str.string_match (Str.regexp "[0-9]+") "1212df3124" 0;;
- : bool = true
# Str.string_match (Str.regexp "[0-9]+$") "1212df3124" 0;;
- : bool = false
# Str.string_match (Str.regexp "[0-9]+$") "3141592" 0;;
- : bool = true
# Str.string_match (Str.regexp "[0-9]+$") "" 0;;
- : bool = false
Another solutions is to use int_of_string to see if it raises an exception:
let check_str s =
try int_of_string s |> ignore; true
with Failure _ -> false
If you are going to convert your string to an integer anyway, you can use that.
Beware, it will allow everything that OCaml's parser consider to be an integer
check_str "10";; //gives true
check_str "0b10";; //gives true, 0b11 = 2
check_str "0o10";; //gives true, 0o10 = 8
check_str "0x10";; //gives true, 0x10 = 16
So if you want to allow only decimal representation you can do:
let check_str s =
try (int_of_string s |> string_of_int) = s
with Failure _ -> false
as string_of_int returns the string representation of an integer, in decimal.

syntax error in ocaml because of String.concat

Lets say I have a list of type integer [blah;blah;blah;...] and i don't know the size of the lis and I want to pattern match and not print the first element of the list. Is there any way to do this without using a if else case or having a syntax error?
because all i'm trying to do is parse a file tha looks like a/path/to/blah/blah/../file.c
and only print the path/to/blah/blah
for example, can it be done like this?
let out x = Printf.printf " %s \n" x
let _ = try
while true do
let line = input_line stdin in
...
let rec f (xpath: string list) : ( string list ) =
begin match Str.split (Str.regexp "/") xpath with
| _::rest -> out (String.concat "/" _::xpath);
| _ -> ()
end
but if i do this i have a syntax error at the line of String.concat!!
String.concat "/" _::xpath doesn't mean anything because _ is pattern but not a value. _ can be used in the left part of a pattern matching but not in the right part.
What you want to do is String.concat "/" rest.
Even if _::xpath were correct, String.concat "/" _::xpath would be interpreted as (String.concat "/" _)::xpath whereas you want it to be interpreted as String.concat "/" (_::xpath).

matching exact string in Ocaml using regex

How to find a exact match using regular expression in Ocaml? For example, I have a code like this:
let contains s1 s2 =
let re = Str.regexp_string s2
in
try ignore (Str.search_forward re s1 0); true
with Not_found -> false
where s2 is "_X_1" and s1 feeds strings like "A_1_X_1", "A_1_X_2", ....and so on to the function 'contains'. The aim is to find the exact match when s1 is "A_1_X_1". But the current code finds match even when s1 is "A_1_X_10", "A_1_X_11", "A_1_X_100" etc.
I tried with "[_x_1]", "[_X_1]$" as s2 instead of "_X_1" but does not seem to work. Can somebody suggest what can be wrong?
You can use the $ metacharacter to match the end of the line (which, assuming the string doens't contain multiple lines, is the end of the string). But you can't put that through Str.regexp_string; that just escapes the metacharacters. You should first quote the actual substring part, and then append the $, and then make a regexp from that:
let endswith s1 s2 =
let re = Str.regexp (Str.quote s2 ^ "$")
in
try ignore (Str.search_forward re s1 0); true
with Not_found -> false
Str.match_end is what you need:
let ends_with patt str =
let open Str in
let re = regexp_string patt in
try
let len = String.length str in
ignore (search_backward re str len);
match_end () == len
with Not_found -> false
With this definition, the function works as you require:
# ends_with "_X_1" "A_1_X_10";;
- : bool = false
# ends_with "_X_1" "A_1_X_1";;
- : bool = true
# ends_with "_X_1" "_X_1";;
- : bool = true
# ends_with "_X_1" "";;
- : bool = false
A regex will match anywhere in the input, so the behaviour you see is normal.
You need to anchor your regex: ^_X_1$.
Also, [_x_1] will not help: [...] is a character class, here you ask the regex engine to match a character which is x, 1 or _.