PCRE in Haskell - what, where, how?

PCRE in Haskell - what, where, how? - regex

I've been searching for some documentation or a tutorial on Haskell regular expressions for ages. There's no useful information on the HaskellWiki page. It simply gives the cryptic message:
Documentation
Coming soonish.
There is a brief blog post which I have found fairly helpful, however it only deals with Posix regular expressions, not PCRE.
I've been working with Posix regex for a few weeks and I'm coming to the conclusion that for my task I need PCRE.
My problem is that I don't know where to start with PCRE in Haskell. I've downloaded regex-pcre-builtin with cabal but I need an example of a simple matching program to help me get going.
Is it possible to implement multi-line matching?
Can I get the matches back in this format: [(MatchOffset,MatchLength)]?
What other formats can I get the matches back in?
Thank you very much for any help!

There's also regex-applicative which I've written.
The idea is that you can assign some meaning to each piece of a regular expression and then compose them, just as you write parsers using Parsec.
Here's an example -- simple URL parsing.
import Text.Regex.Applicative
data Protocol = HTTP | FTP deriving Show
protocol :: RE Char Protocol
protocol = HTTP <$ string "http" <|> FTP <$ string "ftp"
type Host = String
type Location = String
data URL = URL Protocol Host Location deriving Show
host :: RE Char Host
host = many $ psym $ (/= '/')
url :: RE Char URL
url = URL <$> protocol <* string "://" <*> host <* sym '/' <*> many anySym
main = print $ "http://stackoverflow.com/questions" =~ url

There are two main options when wanting to use PCRE-style regexes in Haskell:
regex-pcre uses the same interface as described in that blog post (and also in RWH, as I think an expanded version of that blog post); this can be optionally extended with pcre-less. regex-pcre-builtin seems to be a pre-release snapshot of this and probably shouldn't be used.
pcre-light is bindings to the PCRE library. It doesn't provide the return types you're after, just all the matchings (if any). However, the pcre-light-extras package provides a MatchResult class, for which you might be able to provide such an instance. This can be enhanced using regexqq which allows you to use quasi-quoting to ensure that your regex pattern type-checks; however, it doesn't work with GHC-7 (and unless someone takes over maintaining it, it won't).
So, assuming that you go with regex-pcre:
According to this answer, yes.
I think so, via the MatchArray type (it returns an array, which you can then get the list out from).
See here for all possible results from a regex.

Well, I wrote much of the wiki page and may have written "Coming soonish". The regex-pcre package was my wrapping of PCRE using the regex-base interface, where regex-base is used as the interface for several very different regular expression engine backends. Don Stewart's pcre-light package does not have this abstraction layer and is thus much smaller.
The blog post on Text.Regex.Posix uses my regex-posix package which is also on top of regex-base. Thus the usage of regex-pcre will be very very similar to that blog post, except for the compile & execution options of PCRE being different.
For configuring regex-pcre the Text.Regex.PCRE.Wrap module has the constants you need. Use makeRegexOptsM from regex-base to specify the options.

regexpr is another PCRE-ish lib that's cross-platform and quick to get started with.

I find rex to be quite nice too, its ViewPatterns integration is a nice idea I think.
It can be verbose though but that's partially tied to the regex concept.
parseDate :: String -> LocalTime
parseDate [rex|(?{read -> year}\d+)-(?{read -> month}\d+)-
(?{read -> day}\d+)\s(?{read -> hour}\d+):(?{read -> mins}\d+):
(?{read -> sec}\d+)|] =
LocalTime (fromGregorian year month day) (TimeOfDay hour mins sec)
parseDate v#_ = error $ "invalid date " ++ v
That said I just discovered regex-applicative mentioned in one of the other answers and it may be a better choice, could be less verbose and more idiomatic, although rex has basically zero learning curve if you know regular expressions which can be a plus.

Related

How to parse parameters to a hubot script

New to hubot/coffeescript and inheriting and existing script.
I googled and found some unhelpful stuff like this: Hubot matching on multiple tokens per line?
What I want to do is be able to parse parameters to my Hubot message. For example:
startPlaceOrderListener = () ->
robot.respond /order me (.*)/i, (res) ->
and then follow it with what you want to order.
I can obviously re-invent the wheel and parse res.match[1] myself, but hubot already seems to have some regular expression parsing built in for its own use and I was wondering if there's a way to leverage that for my own nefarious purposes.

It turns out the coffeescript has regular expressions built in. So
/order me (.*)/i
is straight coffeescript.
To match a regular expression you can do:
/order me (.*)/i.test("Bob")
Where the i can be left out if you don't want to ignore case.

To parse the input value in CoffeeScript you can do something like:
robot.respond /open the (.*) doors/i, (res) ->
doorType = res.match[1]
if doorType is "pod bay"
res.reply "I'm afraid I can't let you do that."
else
res.reply "Opening #{doorType} doors"

How to replace characters in string Erlang?

I have this piece of code that gets sessionid, make it a string, and then create a set with key as e.g. {{1401,873063,143916},<0.16443.0>} in redis. I'm trying replace { characters in this session with letter "a".
OldSessionID= io_lib:format("~p",[OldSession#session.sid]),
StringForOldSessionID = lists:flatten(OldSessionID),
ejabberd_redis:cmd([["SADD", StringForSessionID, StringForUserInfo]]);
I've tried this:
re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not a advised way of doing things.

Your solution works, and if you are comfortable with it, you should keep it.
On my side I prefer list comprehension : [case X of ${ -> $a; _ -> X end || X <- StringForOldSessionID ]. (just because I don't have to check the function documentation :o)

re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not
a advised way of doing things.
According to official documentation:
2.5 Myth: Strings are slow
Actually, string handling could be slow if done improperly. In Erlang, you'll have to think a little more about how the strings are used and choose an appropriate representation and use the re module instead of the obsolete regexp module if you are going to use regular expressions.
So, either you use re for strings, or:
leave { behind(using pattern matching)
if, say, N is {{1401,873063,143916},<0.16443.0>}, then
{{A,B,C},Pid} = N
And then format A,B,C,Pid into string.

Since Erlang OTP 20.0 you can use string:replace/3 function from string module.
string:replace/3 - replaces SearchPattern in String with Replacement. 3rd function parameter indicates whether the leading, the trailing or all encounters of SearchPattern are to be replaced.
string:replace(Input, "{", "a", all).

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.

Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)

Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}

The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.

I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

Emacs-style Regex in Info-reader?

I am a Vim-user lost in the Emacs-style Regex of Info-reader. I want to match:
$ info find
?How-in-Info-reader? :%s#\(\\;.*\\+\)\|\(\\+.*\\;\)#WORKS!#g
INFO: "C-X n" to go through the matches
I am looking for the Emacs-counterpart for the Vim-command marked with "?How-in-Info-reader?".
How can you find the matches in Info-reader?

For the standalone info reader, your choices are more limited than when using Emacs proper for browsing *info* pages.
I'm not familiar with the details of ?How-in-Info-reader, but there are two ways (I can see to search in the standalone info browser.
M-x index-apropos SOMESTRING
will give you a list of all the index nodes which contain SOMESTRING.
And the other searches C-s (for interactive search) and / or s (non-interactive search) for a particular string in the current view (they don't drop down into the nodes).

I think you're trying to replace either backslash-semi-anystring-backslashes or backslashes-anystring-backslash-semi with "WORKS!" everywhere in the file. It doesn't look like info is an editor. it doesn't even look like it has regex searching. In emacs, I'd type esc-control-s (to get incremental regular expression search, which means you can try out expressions and see how they work).
Once you're in emacs, the search string you presented should work just fine if I've understood your question. You can also type Esc-r, and then type the first string ("\(\\;.*\\+\)\|\(\\+.*\\;\)"), a RETURN, and the replacement string ("#WORKS!").

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js