"Raw" string in Haskell for Regular Expression - regex

I appear to be having trouble creating a regular expression in Haskell, what I'm trying to do is convert this string (which matches a URL in a piece of text)
\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b
Into a regular expression, the trouble is I keep getting this error in ghci
Prelude Text.RegExp> let a = fromString "\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b"
<interactive>:1:27:
lexical error in string/character literal at character 'S'
I'm guessing it's failing because Haskell doesn't understand \S as an escape code. Are there any ways to get around this?
In Scala you can surround a string with 3 double quotes, I was wondering if you could achieve something similar in Haskell?
Any help would be appreciated.

Every backslash in your string has to be written as a double backslash inside the double quotes. So
"\\b(((\\S+)?)(#|mailto\\:|(news|(ht|f)tp(s?))\\://)\\S+)\\b"
A more general remark: you'd be better off writing a proper parser rather than using regular expressions. Regular expressions rarely do exactly the right thing.

Haskell doesn't support raw strings out of the box, however, in GHC it's very easy to implement them using quasiquotation:
r :: QuasiQuoter
r = QuasiQuoter {
quoteExp = return . LitE . StringL
...
}
Usage:
ghci> :set -XQuasiQuotes
ghci> let s = [r|\b(((\S+)?)(#|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b|]
ghci> s
"\\b(((\\S+)?)(#|mailto\\:|(news|(ht|f)tp(s?))\\://)\\S+)\\b"
I've released a slightly more expanded and documented version of this code as the raw-strings-qq library on Hackage.

I'm a big fan of the Rex library:
http://hackage.haskell.org/package/rex
http://hackage.haskell.org/packages/archive/rex/0.4.2/doc/html/Text-Regex-PCRE-Rex.html
Which not only uses quasiquoting for nice regex entry (no double backslashes), it also uses perl-like regular expressions and not the default annoying POSIX regular expressions, and even allows you to use regular expressions as pattern matching your method parameters, which is genius.

Related

Remove text between two characters (parenthesis) in a string

I'm working on a project and I want to remove text between two parentheses in a string.
Example:
std::string str = "I want to remove (this)."
How would I go about doing that?
I've searched google and stackoverflow an haven't found anything.
I'd use a regular expression for that. Check out the link I provided. As for the expression to use the following expression
(\()(?:[^\)\\]*(?:\\.)?)*\)
That guy worked for me.
Conditionally replace regex matches in string
Do not get regular and common expressions confused. This is not like the more common expression of :-) or :-O or >:( All-though effective These expressions are mutually exclusive expressions that not many languages understand but are more commonly used.

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.
A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions
Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.
I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

VB6 and C# regexes

I need to convert a VB6(which I'm not fammiliar with) project to C# 4.0 one. The project contains some regexes for string validation.
I need to know if the regexes behave the same in both cases, so if i just copy the regex string from the VB6 project, to the C# project, will they work the same?
I have a basic knowledge of regexes and I can just about read what one does, but for flavors and such, that's a bit over my head at the moment.
For example, are these 2 lines equivalent?
VB6:
isStringValid = (str Like "*[!0-9A-Z]*")
C#:
isStringValid = Regex.IsMatch(str, "*[!0-9A-Z]*");
Thanks!
The old VB Like operator, despite appearances, is not a regular expression interface. It's more of a glob pattern matcher. See http://msdn.microsoft.com/en-us/library/swf8kaxw.aspx
In your example:
Like "*[!0-9A-Z]*"
Matches strings that start and end with any character (zero or more), then doesn't match an alphanumeric character somewhere in the middle. The regular expression for this would be:
/.*[^0-9A-Z].*/
EDIT To answer your question: No, the two can't be used interchangeably. However, it's fairly easy to convert Like's operand into a proper regular expression:
Like RegEx
========== ==========
? .
* .*
# \d
[abc0-9] [abc0-9]
[!abc0-9] [^abc0-9]
There are a few caveats to this, but that should get you started and cover most cases.
In a word, yes.
These are the same. Some quick googling should give you answers to more complex issues.
http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/bce145b8-95d4-4be4-8b07-e8adee7286f1/
http://www.regular-expressions.info/dotnet.html

Negation of a regular expression

I am not sure how it is called: negation, complementary or inversion. The concept is this. For example having alphabet "ab"
R = 'a'
!R = the regexp that matche everyhting exept what R matches
In this simple example it should be soemthing like
!R = 'b*|[ab][ab]+'
How is such a regexp called? I remeber from my studies that there is a way to calculate that, but it is something complicated and generally too hard to make by hand. Is there a nice online tool (or regular software) to do that?
jbo5112's answer gives good practical help. However, on the theoretical side: a regular expression corresponds to a regular language, so the term you're looking for is complementation.
To complement a regex:
Convert into the equivalent NFA. This is a well-known and defined process.
Convert the NFA to a DFA via the powerset construction
Complement the DFA by making accept states not accept and vice versa.
Convert the DFA to a regular expression.
You now have the complement of the original regular expression!
If all you're doing is searching, then some software/languages for regular expressions have a way to negate the match built in. For example, with grep you can use a '-v' option to get lines that don't match and the SQL variants I've seen allow you to use a 'not' qualifier to negate the match.
Another option that some/most/all regex dialects support is to use "negative lookahead". You may have to look up your specific syntax, but it's an interesting tool that is well worth reading about. Generally it's something like this: if R='<regex>', then Negative_of_R='(?!<regex>)'. Unfortunately, it can vary with the peculiarities of your language (e.g. vim uses \(<regex>\)\#!).
A word of caution: If you're not careful, a negated regular expression will match more than you expect. If you have the text This doesn't match 'mystring'. and search for (?!mystring), then it will match everything except the 'm' in mystring.

Substring match by reqular expression

I am not much familiar in regular expression, I wanted to do the following comparison by using regular expression.
Source word is : Hello124
In a list, I have following strings
Hello12
Hello
Hel
Hel123
Her
the output I want is ( Hello12, Hello, Hel ). i.e from source sting, I will reduce last char one by one and find the match in the list. Please let me know, Is that possible to use regular expression to optimize this functionality?
I am using C++ with stl::tr1 library.
You could try this:
^H(?:e(?:l(?:l(?:o(?:1(?:24?)?)?)?)?)?)?$
But in most languages it would be easier just to evaluate query.StartsWith(word) for each word.
Of course, you can solve this problem by using regular expressions, for example using the following: h|he|hel|hell|hello|hello1|hello12|hello124.
However, this is not very nice and an overkill. As far as I know, every language supporting regex also supports querying for substrings (you may want to look here if you find yours).