OCaml: How to remove all non-alphabetic characters from a string?

OCaml: How to remove all non-alphabetic characters from a string? - regex

How do I remove all the non-alphabetic characters from a string?
E.g.
"Wë_1ird?!" -> "Wëird"
In Perl, I'd do this with =~ s/[\W\d_]+//g. In Python, I'd use
re.sub(ur'[\W\d_]+', u'', u"Wë_1ird?!", flags=re.UNICODE)
Etc.
AFAICT, Str.regex does not support \W, \d, etc. (I can't
tell whether it supports Unicode, but somehow I doubt it).

Str doesn't support Unicode. Assuming you are dealing with UTF-8 encoded data. You can use Uutf and Uucp as follows:
let keep_alpha s =
let b = Buffer.create 255 in
let add_alpha () _ = function
| `Malformed _ -> Uutf.Buffer.add_utf_8 b Uutf.u_rep
| `Uchar u -> if Uucp.Alpha.is_alphabetic u then Uutf.Buffer.add_utf_8 b u
in
Uutf.String.fold_utf_8 add_alpha () s;
Buffer.contents b
# keep_alpha "Wë_1ird?!";;
- : string = "Wëird"

I'm not an expert in regexes and utf, but if I were in your shoes, then I would use re2 library, and this is my first approximation:
open Core.Std
open Re2.Std
open Re2.Infix
let drop _match = ""
let keep_alpha s = Re2.replace ~/"\\PL" ~f:drop s
The first three lines open libraries and bring their definitions into scope. You do not need to open library to use it, but otherwise you need to prefix each defintion. OCaml core library is specially designed in a such way, that a user should open Std submodule to bring all necessary defintions to scope. Re2 library is from the same guys and have a consisten conventions. open Re2.Infix will bring infix (and prefix operators) to scope, namely ~/ that will create a regex from a string. The drop function just ignores its argument and returns an empty string. I've prefixed parameter with an underscore, since it is a convention for unused parameteers (respected by a compiler). You can also use just a plain uderscore, as a wild card instead, like let drop _ = "". Next is keep_alpha function that will substitute any utf symbol that doesn't match a utf letter class with an empty string, i.e., remove it from the output.
Update
I've checked my code, and fixed errors. Also, I would like to show, how to play with this code in toplevel. You've several options, but the easiest is to use coretop script that ships with core library. It uses utop toplevel, so make sure that you have installed it:
$ opam install -y utop
Once, it is done, you can start toplevel:
$ coretop -require re2
this -require re2 flag will automatically find and load re2 library to your toplevel. You can load additional libraries without restarting utop with the following command:
# #require "libname";;
The first # is a toplevel's prompt, you shouldn't type it, but the second is a start of directive, so make sure that actually type it. Any directive should be started from # symbol. There're other useful directives in utop, namely:
# #use "filename.ml";; (* will load and evaluate filename.ml *)
# #list;; (* will list all available packages *)
# #typeof "keep_alpha";; (* will infer and print type of expression *)
Toplevel will not evaluate your code until you terminate it with ;; sequence. You may sometimes see this ugly ;; in a real code, but it is not needed, it is just to say the toplevel, that you want it to evaluate your code right at this place, and show you the result.

Related

What is the best way to specify a wildcard or regex in a "test" statement in configure.ac?

I am writing a configure.ac script for gnu autotools. In my code I have some if test statements where I want to set flags based on the compiler name. My original test looked like this:
if test "x$NETCDF_FC" = xifort; then
but sometimes the compiler name is more complicated (e.g., mpifort, mpiifort, path prepended, etc...), and so I want to check if the string ifort is contained anywhere within the variable $NETCDF_FC.
As far as I can understand, to set up a comparison using a wildcard or regex, I cannot use test but instead need to use the double brackets [[ ]]. But when configure.ac is parsed by autoconf to create configure, square brackets are treated like quotes and so one level of them is stripped from the output. The only solution I could get to work is to use triple brackets in my configure.ac, like this:
if [[[ $NETCDF_FC =~ ifort ]]]; then
Am I doing this correctly? Would this be considered best practices for configure.ac or is there another way?

Use a case statement. Either directly as shell code:
case "$NETCDF_FC" in
*ifort*)
do_whatever
;;
*)
do_something_else
;;
esac
or as m4sh code:
AS_CASE([$NETCDF_FC],
[*ifort*], [do_whatever],
[do_something_else])
I would not want to rely on a shell capable of interpreting [[ ]] or [[[ ]]] being present at configure runtime (you need to escape those a bit with [] to have the double or triple brackets make it into configure).
If you need a character class within a case pattern (such as e.g. *[a-z]ifort*), I would advise you to check the generated configure file for the case statement patterns which actually end up being used until you have enough [] quotes added around the pattern in the source configure.ac file.
Note that the explicit case statements often contain # ( shell comments at the end of the lines directly before the ) patterns to avoid editors becoming confused about non-matching opening/closing parentheses.

What is this error? - FAILED DURING THE BUILDING PHASE

I got this error while building:
dist/package.conf.inplace:
inappropriate type
FAILED DURING THE BUILDING PHASE. The **exception** was: ExitFailure 1
How do I use subRegex in package Text.Regex?
I have written:
import Text.Regex.Posix
But I got this error:
_.hs:13:5: Not in scope: ‘subRegex’
_.hs:13:15:
Not in scope: ‘mkRegex’
Perhaps you meant ‘makeRegex’ (imported from Text.Regex.Posix)
So, I went to Text.Regex's [page][1], and there it said:
Uses the POSIX regular expression interface in Text.Regex.Posix.
So why not aren't these functions in-scope?

Here are some steps you can perform to make it working.
Download from http://hackage.haskell.org/package/regex-compat-0.92, unzip to <Haskell Platform INSTALL FOLDER>\2014.2.0.0\lib\
Run Haskell.
Type :mod +Text.Regex to load the package.
Type, e.g. subRegex (mkRegex "[0-9]+") "foobar567" "123"
Result is "foobar123" (after all packages are loaded).
Here is the subRegex description:
:: Regex Search pattern
-> String Input string
-> String Replacement text
-> String Output string
Replaces every occurance of the given regexp with the replacement
string.
In the replacement string, "\1" refers to the first substring; "\2" to
the second, etc; and "\0" to the entire match. "\\" will insert a
literal backslash.
This does not advance if the regex matches an empty string. This
misfeature is here to match the behavior of the the original
Text.Regex API.
Some cool links that can help you delve deeper:
http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/, and
https://wiki.haskell.org/Cookbook/Pattern_matching.
I am using it in Windows, here is my screen:

You shouldn't import Text.Regex.Posix, but rather just Text.Regex, because the two functions you want are there.
Have a look at the Hackage page - you were almost there, but the functions where actually in that file.

How to replace characters in string Erlang?

I have this piece of code that gets sessionid, make it a string, and then create a set with key as e.g. {{1401,873063,143916},<0.16443.0>} in redis. I'm trying replace { characters in this session with letter "a".
OldSessionID= io_lib:format("~p",[OldSession#session.sid]),
StringForOldSessionID = lists:flatten(OldSessionID),
ejabberd_redis:cmd([["SADD", StringForSessionID, StringForUserInfo]]);
I've tried this:
re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not a advised way of doing things.

Your solution works, and if you are comfortable with it, you should keep it.
On my side I prefer list comprehension : [case X of ${ -> $a; _ -> X end || X <- StringForOldSessionID ]. (just because I don't have to check the function documentation :o)

re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not
a advised way of doing things.
According to official documentation:
2.5 Myth: Strings are slow
Actually, string handling could be slow if done improperly. In Erlang, you'll have to think a little more about how the strings are used and choose an appropriate representation and use the re module instead of the obsolete regexp module if you are going to use regular expressions.
So, either you use re for strings, or:
leave { behind(using pattern matching)
if, say, N is {{1401,873063,143916},<0.16443.0>}, then
{{A,B,C},Pid} = N
And then format A,B,C,Pid into string.

Since Erlang OTP 20.0 you can use string:replace/3 function from string module.
string:replace/3 - replaces SearchPattern in String with Replacement. 3rd function parameter indicates whether the leading, the trailing or all encounters of SearchPattern are to be replaced.
string:replace(Input, "{", "a", all).

C++ Application using TCL API to enable package autoloading

I want to have my C++ application to enable package autoloading for all ActiveTcl packages in C:\Tcl\lib. I pass below tcl command to Tcl_Eval() in my C++ code. And expect "package require <package name>" will automatically find the package and load it.
set ::auto_path [file join {C:\Tcl\lib}]
But it didn't work as what it does in TCL shell - TCL shell looks for pkgIndex.tcl in auto_path, so when "package require", it can find the right package or shared libs. Is it possible to make it work in C++ application? Or is there any better way?

OK, I think I know what the problem might be. The auto_path is a Tcl list of directories — the code that uses it iterates over the list with foreach — that are searched for packages (and auto-loaded scripts, an old mechanism that I think is rather more trouble than it's worth). Yet you're using a single element, the output of that file join. This wouldn't usually matter on platforms other than Windows, as the directory separator is / there (and that's just an ordinary non-whitespace character to Tcl) but on Windows the directory separator is \ and that's a list metasyntax character to Tcl.
What does this mean? Well, after:
set ::auto_path [file join {C:\Tcl\lib}]
We can ask what the things of that list are. For example, we can print the first element of the list…
puts [lindex $::auto_path 0]
What does that output? Probably this:
C:Tcllib
OOooops! The backslashes have been taken as quoting characters, leaving an entirely non-functioning path. That won't work.
The fix is to use a different way to construct the auto_path. I think you'll find that this does what you actually want:
set ::auto_path [list {C:\Tcl\lib}]
Though this is an alternative (still using list; it's best for trouble-free list construction in all cases):
set ::auto_path [list [file normalize {C:\Tcl\lib}]]
(My bet is that you're trying to use file join as a ghetto file normalize. Don't do that. It's been poor practice for a long time now, especially now that we have a command that actually does do what you want.)

PCRE in Haskell - what, where, how?

I've been searching for some documentation or a tutorial on Haskell regular expressions for ages. There's no useful information on the HaskellWiki page. It simply gives the cryptic message:
Documentation
Coming soonish.
There is a brief blog post which I have found fairly helpful, however it only deals with Posix regular expressions, not PCRE.
I've been working with Posix regex for a few weeks and I'm coming to the conclusion that for my task I need PCRE.
My problem is that I don't know where to start with PCRE in Haskell. I've downloaded regex-pcre-builtin with cabal but I need an example of a simple matching program to help me get going.
Is it possible to implement multi-line matching?
Can I get the matches back in this format: [(MatchOffset,MatchLength)]?
What other formats can I get the matches back in?
Thank you very much for any help!

There's also regex-applicative which I've written.
The idea is that you can assign some meaning to each piece of a regular expression and then compose them, just as you write parsers using Parsec.
Here's an example -- simple URL parsing.
import Text.Regex.Applicative
data Protocol = HTTP | FTP deriving Show
protocol :: RE Char Protocol
protocol = HTTP <$ string "http" <|> FTP <$ string "ftp"
type Host = String
type Location = String
data URL = URL Protocol Host Location deriving Show
host :: RE Char Host
host = many $ psym $ (/= '/')
url :: RE Char URL
url = URL <$> protocol <* string "://" <*> host <* sym '/' <*> many anySym
main = print $ "http://stackoverflow.com/questions" =~ url

There are two main options when wanting to use PCRE-style regexes in Haskell:
regex-pcre uses the same interface as described in that blog post (and also in RWH, as I think an expanded version of that blog post); this can be optionally extended with pcre-less. regex-pcre-builtin seems to be a pre-release snapshot of this and probably shouldn't be used.
pcre-light is bindings to the PCRE library. It doesn't provide the return types you're after, just all the matchings (if any). However, the pcre-light-extras package provides a MatchResult class, for which you might be able to provide such an instance. This can be enhanced using regexqq which allows you to use quasi-quoting to ensure that your regex pattern type-checks; however, it doesn't work with GHC-7 (and unless someone takes over maintaining it, it won't).
So, assuming that you go with regex-pcre:
According to this answer, yes.
I think so, via the MatchArray type (it returns an array, which you can then get the list out from).
See here for all possible results from a regex.

Well, I wrote much of the wiki page and may have written "Coming soonish". The regex-pcre package was my wrapping of PCRE using the regex-base interface, where regex-base is used as the interface for several very different regular expression engine backends. Don Stewart's pcre-light package does not have this abstraction layer and is thus much smaller.
The blog post on Text.Regex.Posix uses my regex-posix package which is also on top of regex-base. Thus the usage of regex-pcre will be very very similar to that blog post, except for the compile & execution options of PCRE being different.
For configuring regex-pcre the Text.Regex.PCRE.Wrap module has the constants you need. Use makeRegexOptsM from regex-base to specify the options.

regexpr is another PCRE-ish lib that's cross-platform and quick to get started with.

I find rex to be quite nice too, its ViewPatterns integration is a nice idea I think.
It can be verbose though but that's partially tied to the regex concept.
parseDate :: String -> LocalTime
parseDate [rex|(?{read -> year}\d+)-(?{read -> month}\d+)-
(?{read -> day}\d+)\s(?{read -> hour}\d+):(?{read -> mins}\d+):
(?{read -> sec}\d+)|] =
LocalTime (fromGregorian year month day) (TimeOfDay hour mins sec)
parseDate v#_ = error $ "invalid date " ++ v
That said I just discovered regex-applicative mentioned in one of the other answers and it may be a better choice, could be less verbose and more idiomatic, although rex has basically zero learning curve if you know regular expressions which can be a plus.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OCaml: How to remove all non-alphabetic characters from a string? - regex

Related

What is the best way to specify a wildcard or regex in a "test" statement in configure.ac?

What is this error? - FAILED DURING THE BUILDING PHASE

How to replace characters in string Erlang?

C++ Application using TCL API to enable package autoloading

PCRE in Haskell - what, where, how?

Categories

Resources