Racket raw strings - regex

In Racket you have to escape backslashes in strings, therefore Windows paths and regexes become verbose.
For example, the regular expression (.*)\1 can be represented with the string "(.*)\\1" or the regexp constant #rx"(.*)\\1"; the \ in the regular expression must be escaped to include it in a string or regexp constant. [Source: Regexp Syntax]
In many languages like Perl and Ruby regexes are supported syntactically /\([a-z]+\)/, in others there are optional raw strings, like in Python r"\([a-z]+\)". It seems that Racket doesn't support raw strings, where you don't need to escape backslashes, natively. Is there any method to implement them, a third-party library, a proposal, whatever?
See also:
Regular Expressions # The Racket Guide
Regular Expressions # The Racket Reference

As Chris mentioned, a custom reader can do this.
An example of a reader that Racket already supplies, that you could use, is at-exp:
#lang at-exp racket
#~a{C:\Windows\win.ini}
;; "C:\\Windows\\win.ini"
#~a{This is a string
with newlines.}
;; "This is a\nstring with newlines."
I like to use ~a with this because it converts anything to a string, and it's only two characters to type.
However for your regexp example, you can't use ~a or #rx. Instead you should use regexp:
#regexp{(.*)\1}
;; #rx"(.*)\\1"
In all of these examples, #function{string} is read as (function "string") -- basically. There are some nuances you can read about in the documentation for at-exp and Scribble.

Related

String matching in emacs lisp matching arbitary string

In emacs lisp I only know the functions string-match[-p], but I know no method for matching a literal string to a string.
E.g. assume that I have a string generated by some function and want to know if another string contains it. In many cases string-match-p will work fine, but when the generated string contains regexp syntax, it will result in unexpected behaviour, maybe even crash if the regular expression syntax contained is invalid (e.g. unbalanced quoted parentheses \(, \)).
Is the some function in emacs lisp, that is similiar to string-match-p but doesn't interpret regular expression syntax?
As regexp-matching is implemented in C I assume that matching the correct regexp is faster than some substring/string= loop; Is there some method to escape an arbitrary string into a regular expression that matches that string and only that string?
Are you looking for regexp-quote?
The docs say:
(regexp-quote STRING)
Return a regexp string which matches exactly STRING and nothing else.
And I don't know that your assumption in #2 is correct, string= should be faster...
Either use regexp-quote as recommended by #trey-jackson, or don't use strings at all.
Emacs is not optimized for string handling; it is optimized for buffers. So, if you manipulate text, you might find it faster to create a temporary buffer, insert your text there, and then use search-forward to find your fixed string (non-regexp) in that buffer.
Perhaps cl-mismatch, an analogue to Common Lisp mismatch function? Example usage below:
(mismatch "abcd" "abcde")
;; 4
(mismatch "abcd" "aabcd" :from-end t)
;; -1
(mismatch "abcd" "aabcd" :start2 1)
;; nil
Ah, sorry, I didn't understand the question the first time. If you want to know whether the string is a substring of another string (may start at any index in the searched string), then you could use cl-search, again, an analogue of Common Lisp search function.
(search "foo\\(bar" "---foo\\(bar")
;; 3

"Raw" string in Haskell for Regular Expression

I appear to be having trouble creating a regular expression in Haskell, what I'm trying to do is convert this string (which matches a URL in a piece of text)
\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b
Into a regular expression, the trouble is I keep getting this error in ghci
Prelude Text.RegExp> let a = fromString "\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b"
<interactive>:1:27:
lexical error in string/character literal at character 'S'
I'm guessing it's failing because Haskell doesn't understand \S as an escape code. Are there any ways to get around this?
In Scala you can surround a string with 3 double quotes, I was wondering if you could achieve something similar in Haskell?
Any help would be appreciated.
Every backslash in your string has to be written as a double backslash inside the double quotes. So
"\\b(((\\S+)?)(#|mailto\\:|(news|(ht|f)tp(s?))\\://)\\S+)\\b"
A more general remark: you'd be better off writing a proper parser rather than using regular expressions. Regular expressions rarely do exactly the right thing.
Haskell doesn't support raw strings out of the box, however, in GHC it's very easy to implement them using quasiquotation:
r :: QuasiQuoter
r = QuasiQuoter {
quoteExp = return . LitE . StringL
...
}
Usage:
ghci> :set -XQuasiQuotes
ghci> let s = [r|\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b|]
ghci> s
"\\b(((\\S+)?)(#|mailto\\:|(news|(ht|f)tp(s?))\\://)\\S+)\\b"
I've released a slightly more expanded and documented version of this code as the raw-strings-qq library on Hackage.
I'm a big fan of the Rex library:
http://hackage.haskell.org/package/rex
http://hackage.haskell.org/packages/archive/rex/0.4.2/doc/html/Text-Regex-PCRE-Rex.html
Which not only uses quasiquoting for nice regex entry (no double backslashes), it also uses perl-like regular expressions and not the default annoying POSIX regular expressions, and even allows you to use regular expressions as pattern matching your method parameters, which is genius.

Seeking quoted string generator

Does anyone know of a free website or program which will take a string and render it quoted with the appropriate escape characters?
Let's say, for instance, that I want to quote
It's often said that "devs don't know how to quote nested "quoted strings"".
And I would like to specify whether that gets enclosed in single or double quotes. I don't personally care for any escape character other than backslash, but other's might.
If none of the double quotes of the string is already escaped, you can simply do:
str = str.replace(/"/g, "\\\"");
Otherwise, you should check if it is already escaped and replace only if it isn't; You can use lookbehind for that. The following is what came to my mind first but it would fail for strings like escaped backslash followed by quotes \\" :(
str = str.replace(/(?<!\\)"/g, "\\\"");
The following makes sure that the second last character, if exists, is not a backslash.
str = str.replace(/(?<!(^|[^\\])\\)"/g, "\\\"");
Update: Just remembered that JavaScript doesn't support look-behind; you can use the same regex on a look-behind supporting regex engine like perl/php/.net etc.
Any decent regex library in any decent programming language will have a function to do this - not that it's hard to write one yourself (as the other answers have indicated). So having a separate website or program to do it would be mostly useless.
Perl has the quotemeta function
PCRE's C++ wrapper has a function RE::QuoteMeta (warning: giant file at that link) which does the same thing
PHP has preg_quote if you're using Perl-compatible regexes
Python's re module has an escape function
In Java, the java.util.regex.Pattern class has a quote method
Perl and most of the other regular expression engines based on Perl have metacharacters \Q...\E, meaning that whatever comes between \Q and \E is interpreted literally
Most tools that use POSIX regular expressions (e.g. grep) have an option that makes them interpret their input as a literal string (e.g. grep -F)
In Python, for enclosing in single quotes:
import re
mystr = """It's often said that "devs don't know how to quote nested "quoted strings""."""
print("""'%s'""" % re.sub("'", r"\'", mystr))
Output:
'It\'s often said that "devs don\'t know how to quote nested "quoted strings"".'
You could easily adapt this into a more general form, and/or wrap it in a script for command-line invocation.
so, I guess the answer is "no". Sorry, guys, but I didn't learn anything that I don't know. Probably my fault for not phrasing the question correctly.
+1 for everyone who posted

Is the syntax for writing regular expression standardized

Is the syntax for writing regular expression standardized? That is, if I write a regular expression in C++ it will work in Python or Javascript without any modifications.
No, there are several dialects of Regular Expressions.
They generally have many elements in common.
Some popular ones are listed and compared here.
Simple regular expressions, mostly yes. However, across the spectrum of programming languages, there are differences.
No, here are some differences that comes to mind:
JavaScript lets you write inline regex (where \ in \s need not be escaped as \\s), that are delimited by the / character. You can specify flags after the closing /. JS also has RegExp constructor that takes the escaped string as the first argument and an optional flag string as second argument.
/^\w+$/i and new RegExp("^\\w+$", "i") are valid and the same.
In PHP, you can enclose the regex string inside an arbitrary delimiter of your choice (not sure of the super set of characters that can be used as delimiters though). Again you should escape backslashes here.
"|[0-9]+|" is same as #[0-9]+#
Python and C# supports raw strings (not limited to regex, but really helpful for writing regex) that lets you write unescaped backslashes in your regex.
"\\d+\\s+\\w+" can be written as r'\d+\s+\w+' in Python and #'\d+\s+\w+' in C#
Delimiters like \<, \A etc are not globally supported.
JavaScript doesn't support lookbehind and the DOTALL flag.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)