Regexp languages and replacements in Emacs - regex

When I use the regexp-builder, I need to escape things in a different way from the way I do it when using replace-regexp. Now, this thread explains that these two commands use a different syntax, but why is that?
Also, I went through this blog post: Re-builder: The Interactive Regexp Builder, and I added
(require 're-builder)
(setq reb-re-syntax 'string)
to my .emacs file following the advice on the site. However, I still need to type " around my regexp to make it work. I thought changing the syntax language would take care of this but it doesn't.
With this, my actual questions are:
Is it sill the case that Emacs does not support PCRE? Are there any workarounds to this?
Once I have the right regexp in regex-builder, is there any way to directly send the regexp to replace-regexp and enter the replacement string?

There's a package in the MELPA repository called pcre2el that adds PCRE support to many parts of Emacs, including regexp-builder and replace-regexp.

Regarding question #2: No (at least not by default), but there's another way to do that without using re-builder.
Start by doing a regexp isearch for your pattern. Because it's an isearch, you'll see the matches interactively, a bit like re-builder (albeit without coloured groupings).
Still in isearch, once you're happy with the pattern, type C-M-% to call isearch-query-replace-regexp which will prompt you for the replacement.
You can of course simply copy your re-builder string from its buffer and yank it as a replacement string (but that's undoubtedly not news).
I was curious about the need for quotes in re-builder with string syntax. It seems that's it's just a formality of the system, and reb-read-regexp returns everything between the first and last " when using that syntax. Maybe it's intended to ensure that leading or trailing whitespace can't confuse matters -- re-builder does use leading whitespace for improved visibility, and trailing whitespace would be harder to spot. Or maybe it just made some of the code more convenient/consistent.

No, Emacs doesn't support PCRE, and as far as I know there is no work-around for that.
I don't think so.
To answer your first question, why does re-builder use a different syntax than replace-regexp:
By default, re-builder uses the syntax that is appropriate for writing elisp programs. In the context of a written program, regexps are entered within strings. Inside a string, backslashes have a special meaning which conflicts with using the backslash as part of a regexp. Consequently, within a string, you need to double a backslash to use it to signify part of the regexp syntax.
replace-regexp, on the other hand, is designed to be used interactively by the user, and it explicitly expects the input to be a regexp. As a convenience, it interprets backslashes as regexp syntax, not as string escapes. Which is why you can use single backslashes in this context.

Related

Regex Replacement Syntax for number of replacent group occurences

Take the sample string:
__________Hello
I want to replace lines starting with 10 x _ with 20 x _
Desired output:
____________________Hello
I can do this a number of ways, i.e:
/^(_{10})/\1\1/
/^_{10}/____________________/
/^(__________)/\1\1/
etc...
Question:
Is there a way within the regex specification/expression itself - say PCRE (or any regex library/engine for that matter) - to specify the replacement occurence of a character ?
For example:
/_{10}/_{20}/
I don't know if I'm having a mind blank or if I've just never done this, but I cannot seem to find any such thing in the regex specification docs.
It can't be done within the Regex itself.
If I have the input "39572a4872" and I want to replace it with "39572aaaaa4872", there are many simple ways to achieve that, which can include Regular expressions, but as Wiktor explained in the comment thread, the actual quantifier of the replacement is not something itself that is achieved through regex.
It may seem unimportant, since in this example I could simply just apply the replacement 5 times manually or programatically, but one of the benefits of standardized technologies is applying the same concepts in different environments, languages, even within programs.
I as well as many others have had a lot of success with the portability of my regex because of this.
This question was to see if specifying quantifiers for replacement strings was possible within the syntax of a regex itself. Which it is surely not.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

How can I fix Emacs' syntax highlighting of qr"\.[^.]+$"?

Emacs is confusing syntax after a line like
fileparse($file, qr"\.[^.]+$");
And thinks the rest of the file is a string. How do I fix this?
Emacs is confused by the quotes in the qr expression. The standard regular expression delimiter (/) works:
fileparse($file, qr/\.[^.]+$/);
In fact, almost anything else seems to work.
fileparse($file, qr{(\.[^.]+$/});
fileparse($file, qr*\.[^.]+$*);
fileparse($file, qr#\.[^.]+$#);
My version of VIM doesn't get confused by the quotes, but I know that older versions of VIM did. Putting quotes is really confusing because it makes the qr expression look like a string (which it isn't). It's usually bad policy to use quotes as delimiters in regular expressions even though it's technically legal.
However, the really important question is what's fileparse? That's not a standard Perl function. I am assuming that this is imported from File::Basename? Would that be correct?
According to the documentation I have, the second argument in fileparse is suppose to be an array and not a quoted regular expression.
David W. probably has the right answer. Again, probably best to avoid using " as delimiter, but sometimes you need to use ' as the delimiter to prevent interpolation, therefore this may be useful to someone in that case.
So, if for some reason you REALLY want the quote delimiter (or to prevent interpolation using '), you can always do something like:
fileparse($file, qr"\.[^.]+$"); #"# highlight fix
an example using one that SO does badly:
s'hi'by'; #'# highlight fix
note: the second # is to trigger the comment highlighting in the editor so that highlight fix shows as a comment.

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
\((f\))|\|\((e\))
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.
The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html

Seeking quoted string generator

Does anyone know of a free website or program which will take a string and render it quoted with the appropriate escape characters?
Let's say, for instance, that I want to quote
It's often said that "devs don't know how to quote nested "quoted strings"".
And I would like to specify whether that gets enclosed in single or double quotes. I don't personally care for any escape character other than backslash, but other's might.
If none of the double quotes of the string is already escaped, you can simply do:
str = str.replace(/"/g, "\\\"");
Otherwise, you should check if it is already escaped and replace only if it isn't; You can use lookbehind for that. The following is what came to my mind first but it would fail for strings like escaped backslash followed by quotes \\" :(
str = str.replace(/(?<!\\)"/g, "\\\"");
The following makes sure that the second last character, if exists, is not a backslash.
str = str.replace(/(?<!(^|[^\\])\\)"/g, "\\\"");
Update: Just remembered that JavaScript doesn't support look-behind; you can use the same regex on a look-behind supporting regex engine like perl/php/.net etc.
Any decent regex library in any decent programming language will have a function to do this - not that it's hard to write one yourself (as the other answers have indicated). So having a separate website or program to do it would be mostly useless.
Perl has the quotemeta function
PCRE's C++ wrapper has a function RE::QuoteMeta (warning: giant file at that link) which does the same thing
PHP has preg_quote if you're using Perl-compatible regexes
Python's re module has an escape function
In Java, the java.util.regex.Pattern class has a quote method
Perl and most of the other regular expression engines based on Perl have metacharacters \Q...\E, meaning that whatever comes between \Q and \E is interpreted literally
Most tools that use POSIX regular expressions (e.g. grep) have an option that makes them interpret their input as a literal string (e.g. grep -F)
In Python, for enclosing in single quotes:
import re
mystr = """It's often said that "devs don't know how to quote nested "quoted strings""."""
print("""'%s'""" % re.sub("'", r"\'", mystr))
Output:
'It\'s often said that "devs don\'t know how to quote nested "quoted strings"".'
You could easily adapt this into a more general form, and/or wrap it in a script for command-line invocation.
so, I guess the answer is "no". Sorry, guys, but I didn't learn anything that I don't know. Probably my fault for not phrasing the question correctly.
+1 for everyone who posted