How can I have String.tokens use more than one delimiter? - sml

Because String.tokens is a curried function, I know I can change
String.tokens (fn c =\> c = #" ") "hello world";
to a string that would contain all the delimiters, but I am just confused about the actual dictation of how.
One of the forms that I tried was:
fun splitter nil = nil
| splitter str =
let
val c = " ,.;?:!\t\n"
val s = String.tokens (fn (c:string,x:char) => c=Char.toString c x) str
in
s
end;
With c being the string of the delimiters, but I know something is very wrong. If anyone could point me into the right direction that would be greatly appreciated.

String.tokens takes two arguments: a predicate to determine if a character is a token; and a string to split. The first argument is the important part. We don't have to specify a character to split on, just a rule to identify that character.
If you turn a string containing the token characters into a list with String.explode, then it's easy to use List.exists to find out if a character is in that token string.
fun splitOn(str, tokens) =
let
val tokens' = String.explode tokens
fun isToken c = List.exists (fn c' => c = c') tokens'
in
String.tokens isToken str
end;
splitOn("hello world | wooble. foo? bar!", " |.?!");
(* ["hello", "world", "wooble", "foo", "bar"] *)

Related

Write a SML function that take the name of a file and return a list of char without spaces

In an exam i found this exercise:
"Write a function that take a file name (i.e. "text.txt") and return a list of char without blanks"
For example:
"text.txt" contains "ab e ad c"
the function must return -> [#"a",#"b",#"e",#"a",#"d",#"c"]
Which is the easiest way to solve the exercise?
I've tried to use the library "TextIO" and the function "input1" but i got stuck. I don't know how to implement the function recursively. Could someone help?
fun chars filename =
let
val f = TextIO.openIn filename
val s = TextIO.inputAll f
in
TextIO.closeIn f;
List.filter (fn c => c <> #" ") (explode s)
end

haskell read a file and convert it map of list

input file is txt :
000011S\n
0001110\n
001G111\n
0001000\n
Result is:
[["0","0","0","0","1","1","S"], ["0","0","0","1","1","1","0"] [...]]
Read a text file with
file <- openFile nameFile ReadMode
and the final output
[["a","1","0","b"],["d","o","t","2"]]
is a map with list of char
try to:
convert x = map (map read . words) $ lines x
but return [[string ]]
As it could do to return the output I want? [[Char]],
is there any equivalent for word but for char?
one solution
convert :: String -> [[String]]
convert = map (map return) . lines
should do the trick
remark
the return here is a neat trick to write \c -> [c] - wrapping a Char into a singleton list as lists are a monad
how it works
Let me try to explain this:
lines will split the input into lines: [String] which each element in this list being one line
the outer map (...) . lines will then apply the function in (...) to each of this lines
the function inside: map return will again map each character of a line (remember: a String is just a list of Char) and will so apply return to each of this characters
now return here will just take a character and put it into a singleton list: 'a' -> [a] = "a" which is exactly what you wanted
your example
Prelude> convert "000011S\n0001110\n001G111\n0001000\n"
[["0","0","0","0","1","1","S"]
,["0","0","0","1","1","1","0"]
,["0","0","1","G","1","1","1"]
,["0","0","0","1","0","0","0"]]
concerning your comment
if you expect convert :: String -> [[Char]] (which is just String -> [String] then all you need is convert = lines!
[[Char]] == [String]
Prelude> map (map head) [["a","1","0","b"],["d","o","t","2"]]
["a10b","dot2"]
will fail for empty Strings though.
or map concat [[...]]

OCaml regexp "any" matching, where "]" is one of the characters

I'd like to match a string containing any of the characters "a" through "z", or "[" or "]", but nothing else. The regexp should match
"b"
"]abc["
"ab[c"
but not these
"2"
"(abc)"
I tried this:
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\]]*$") s 0;;
content_check "]abc[";;
and got warned that the "escape" before the "]" was illegal, although I'm pretty certain that the equivalent in, say, sed or awk would work fine.
Anyhow, I tried un-escaping the cracket, but
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[]]*$") s 0;;
doesn't work at all, since it should match any of a-z or "[", then the first "]" closes the "any" selection, after which there must be any number of "]"s. So it should match
[abc]]]]
but not
]]]abc[
In practice, that's not what happens at all; I get the following:
# let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-zA-Z[]]*$") s 0;;
content_check "]abc[";;
content_check "[abc]]]";;
content_check "]abc[";;
val content_check : string -> bool = <fun>
# - : bool = false
# - : bool = false
# - : bool = false
Can anyone explain/suggest an alternative?
#Tim Pietzker's suggestion sounded really good, but appears not to work:
# #load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "^[a-z[\\]]*$") s 0;;
content_check "]abc[";;
# val content_check : string -> bool = <fun>
# - : bool = false
#
nor does it work when I double-escape the "[" in the pattern, just in case. :(
Indeed, here's a MWE:
#load "str.cma" ;;
let content_check(s:string):bool =
Str.string_match (Str.regexp "[\\]]") s 0;;
content_check "]";; (* should be true *)
This is not going to really answer your question, but it will solve your problem. With the re library:
let re_set = Re.(rep (* "rep" is the star *) ## alt [
rg 'a' 'z' ; (* the range from a to z *)
set "[]" ; (* the set composed of [ and ] *)
])
(* version that matches the whole text *)
let re = Re.(compile ##
seq [ start ; re_set ; stop ])
let content_check s =
Printf.printf "%s : %b\n" s (Re.execp re s)
let () =
List.iter content_check [
"]abc[" ;
"[abc]]]" ;
"]abc[" ;
"]abc[" ;
"abc##"
]
As you noticed, str from the stdlib is akward, to put it midly. re is a very good alternative, and it comes with various regexp syntax and combinators (which I tend to use, because I think it's easier to use than regexp syntax).
I'm an idiot. (But perhaps the designers of Str weren't so clever either.)
From the "Str" documentation: "To include a ] character in a set, make it the first character in the set."
With this, it's not so clear how to search for "anything except a ]", since you'd have to place the "^" in front of it. Sigh.
:(

apache-spark regex extract words from rdd

I try to extract words from a textfile.
Textfile:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
The following works well:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
But I want to extract the words for every line.
If i type
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
i get
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
What am I doing wrong?
Thanks in advance
Thank you for your Answer.
The goal was to count the occourence of words in a pos/neg-wordlist.
Seems this works:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))
Your problem is not exactly Apache Spark, your first map will make you handle a line, but your flatMap on that line will make you iterate on the characters in this line String. So Spark or not, your code won't work, for example in a Scala REPL :
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
So if you really want, using your regexp, all the words in your line, just use flatMap once :
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
Regards,

syntax error in ocaml because of String.concat

Lets say I have a list of type integer [blah;blah;blah;...] and i don't know the size of the lis and I want to pattern match and not print the first element of the list. Is there any way to do this without using a if else case or having a syntax error?
because all i'm trying to do is parse a file tha looks like a/path/to/blah/blah/../file.c
and only print the path/to/blah/blah
for example, can it be done like this?
let out x = Printf.printf " %s \n" x
let _ = try
while true do
let line = input_line stdin in
...
let rec f (xpath: string list) : ( string list ) =
begin match Str.split (Str.regexp "/") xpath with
| _::rest -> out (String.concat "/" _::xpath);
| _ -> ()
end
but if i do this i have a syntax error at the line of String.concat!!
String.concat "/" _::xpath doesn't mean anything because _ is pattern but not a value. _ can be used in the left part of a pattern matching but not in the right part.
What you want to do is String.concat "/" rest.
Even if _::xpath were correct, String.concat "/" _::xpath would be interpreted as (String.concat "/" _)::xpath whereas you want it to be interpreted as String.concat "/" (_::xpath).