Can some please explain this elisp regexp - regex

Can some please explain the following regexp, which I found in ediff-trees.el as a specification for which files/directories to exclude from its comparison process.
"\\`\\(\\.?#.*\\|.*,v\\|.*~\\|\\.svn\\|CVS\\|_darcs\\)\\'"
Although I am somewhat familiar with regular expressions encountering this elisp string-based variant has thrown me off.

First thing, remember that elisp's regexes have to be string-escaped, which created a lot of extra backslashes. Removing them, we get
\`\(\.?#.*\|.*,v\|.*~\|\.svn\|CVS\|_darcs\)\'
Then, \( and \) mean grouping, "foo\|bar" means "either foo, or bar".
So, piece by piece, this regexp matches: either an emacs temporary file (something starting with #, possibly preceded by a period: .?#.), or an RCS file (ending in ,v: .,v), or an emacs backup file (ending in ~: .*~), or an svn directory (.svn), or a cvs directory (CVS), or a darcs directory (_darcs).
Edit to correct: as andre-r correctly points out, the backtick \` and single quote \' basically mean "beginning and end of the string" (respectively). So this means that the regexp finds strings which match exactly one of the choices I've outlined above (i.e., the string starts, then comes one of those choices, then the string ends). I had previously said they meant quoting, I don't know what I was thinking :). Thanks andre-r!

Sorry, this isn't really an answer; it's merely a comment to rbp's answer. But I can't figure out how to get the code sample to render nicely inside a comment, whereas it looks fine here in this answer.
Anyway:
I dunno about you, but I find
(rx bos (group (or (and (zero-or-one ".") "#" (zero-or-more nonl))
(and (zero-or-more nonl) ",v" )
(and (zero-or-more nonl) "~" )
".svn"
"CVS"
"_darcs"
))
eos)
a lot easier to read -- and it's exactly equivalent.

Parentheses in elisp regexes need to be escaped. Backslashes in strings need to be escaped, so you end up with \\( and \\) when any sensible regex parser would just use ( and ). Don't get me wrong, I love Emacs, but having to escape parentheses in a regex was a really bad idea. The pipes and periods and backticks are also being escaped - that's why you've got this hell of double backslashes. Strip out those and you get (in regex literal form):
`(.?#.*|.*,v|.*~|\.svn|CVS|_darcs)'
See this question for more discussion on the subject of escaped parens in elisp.

Related

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

How to highlight words beginning with ‘#’ in Vim syntax?

I have a very simple Vim syntax file for personal notes. I would like to highlight people's name and I chose a Twitter-like syntax #jonathan.
I tried:
syntax match notesPerson "\<#\S\+"
To mean: words beginning with # and having at least one non-whitespace character. The problem is that # seems to be a special character in Vim regular expressions.
I tried to escape \# and enclose in brackets [#], the usual tricks, but that didn't work. I could try something like (^|\s) (beginning of line or whitespace) but that's exactly the problem that word-boundary tries to solve.
Highlighting works on simplified regular expressions, so this is more a question of finding the right regex than anything else. What am I missing?
# is a special character only if you have enabled the “very magic”
mode by having \v somewhere in the pattern prior to that #.
You have another problem here: # does not start a new word. \< is
not just “word boundary” like perl/PCRE’s \b, but “left word
boundary” (in help: “beginning of the word”) meaning that \< must be
followed by some keyword character. As # is not normally a keyword
character, pattern \<# will never match. (And even if it was like
\b, it would match constructs like abc#def which is definitely not
what you want for the aforementioned reasons.)
You should use \k\#<!#\k\S* instead: \k\#<! ensures that # is not preceded by any keyword character, \k\S* makes sure that first character of the name is a keyword one (you could probably also use #\<\S\+).
There is another solution: include # into 'iskeyword' option and leave the regex as is:
:setlocal iskeyword+=#-#
See :help 'isfname' for the explanation why #-# is used here.
(The 'iskeyword' option has exactly the same syntax and will,
in fact, redirect you there for the explanation.)

Parsing a game's console commands?

I need to be able to handle data that can look like:
set setting1 "bind button_x +actionslot1;bind button_y \" bind button_x +stance \" "
bind button_a jump
set setting2 1 1 0 1
toggle setting_3 " \"value 1\" \"value 2\" \"value 3\" "
These are what some of the commands for the console of a game look like, and I'm trying to write an emulator of sorts that will interpret the code the same way the game will.
The first thing that comes to mind is regex, but I'm not sure it's the best option. For example, when matching for the value of a setting, I might trying something like /set [\w_]+ "?(.+)"?/, but the wildcard matches the ending quote because it's not lazy, but if I make it lazy, it matches the quote inside the value. If I make it greedy and stop it from matching the quotes, it won't match the escaped quotes in the values.
Even if there are possible regex solutions, they seem like the wrong option. I had asked before about how programs like Visual Studio and Notepad++ know which parentheses and curly braces matched, and I was told there was something similar to regex in some ways but much more powerful.
The only other thing I can think of is to go through the lines of code character by character and use booleans to determine that state of the current character.
What are my options here? What do game developers use to handle console commands?
edit: Here's another possible command which strongly deters me from using regex:
set setting4 "bind button_a \" bind button_b "\" set setting1 0 \" " \" "
The commands include not just escaped quotes, but quotes of the manner "\" inside escaped quotes.
I would suggest you read about Lexical Analysis
, this is the process of tokenizing your text using a grammar.
I think it will help you with what you are trying to do.
I don't want to keep you on the path of regex -- you are correct that there are non-regex solutions that may be more appropriate (I just don't know what they are). However, here is one possible regex that should fix your quotes issue:
/set [\w_]+ "?((\\"|[^"])+)"?/
I changed .+ to (\\"|[^"])+. Basically it's matching occurrences of \" OR of anything that isn't a quote. In other words, it will will match anything except quotes that aren't escaped.
Again, if someone can suggest a more sophisticated non-regex solution, you should strongly consider it.
Edit: The updated example you've provided breaks this solution, and I think it would break any regex solution.
Edit 2: Here is a C# string version of your regex. It uses # to tell the compiler to treat the string as a verbatim literal, which means it ignores \ as an escape character. The only caveat is that in order to represent " in a verbatim literal you have to type it as "", but it's still better than having slashes everywhere. Given the prevalence of escape sequences in regexes, I recommend using verbatim literals anywhere that you have to type a regex in a string.
string pattern = #"set [\w_]+ ""?((\\""|[^""])+)""?"

How many backslashes are required to escape regexps in emacs' "Customize" mode?

I'm trying to use emacs' customize-group packages to tweak some parts of my setup, and I'm stymied. I see things like this in my .emacs file after I make changes with customize:
'(tramp-backup-directory-alist (quote (("\\\\`.*\\\\'" . "~/.emacs.d/autobackups"))))
This was the result of putting the following into the customize text field:
Regexp matching filename: \\`.*\\'
This is a representative sample: I'm actually trying to change several things that want a regexp, and they all show this same problem. How many layers of quoting are there, really? I can't seem to find the magic number of backslashes to get the gosh-dang thing to do what I'm asking it to, even for the simplest regular expressions like .*. Right now, the given customization produces - nothing. It makes no change from emacs' default behavior.
Better yet, where on earth is this documented? It's a little difficult to Google for, but I've been trying quite a few things there as well as in the official documentation and the Emacs wiki. Where is an authoritative source for how many dang backslashes one needs to make a regular expression in customize-mode actually work - or at the very least, to fail with some kind of warning instead of failing silently?
EDIT: As so often happens with questions asked in anger, I was asking the wrong question. Fortunately the answers below, led me to the answer to the question that I needed, which was about quoting rules. I'm going to try to write down what I learned here, because I find the documentation and Googleable resources to be maddeningly obscure about this. So here are the quoting rules I found by trial and error, and I hope that they help someone else, inspire correction, or both.
When an emacs customize-mode buffer asks you for a "Regexp matching filename", it is being, as emacs often is, both terse and idiosyncratic (how often the creator's personality is imparted to the creation!). It means, for one thing, a regexp that will be compared to the whole path of the file in search of a match, not just to the name of the file itself as you might assume from the term "filename". This is the same sense of "filename" used in emacs' buffer-file-name function, for example.
Further, although if you put foo in the field, you'll see "foo" (with double-quotes) written to the actual file, that's not enough quoting and not the right quoting. You need to quote your regexp with the quoting style that, as far as I can tell, only emacs uses: the ``backtick-foo-single-quote'`scheme. And then you need to escape that, making it \`backslash-backtick-foo-backslash-single-quote\' (and if you think that's a headache to input in Markdown, it's more so in emacs).
On top of this, emacs appears to have a rule that the . regexp special character does not match a / at the beginning of filenames, so, as was happening to me above, the classic .* pattern will appear to match nothing: to match "all files", you actually need the regexp /.*, which then you stuff into the quote format of customize-mode to produce \`/.*\', after which customize paints another layer of escaping onto it and writes it to the customization file.
The final result for one of my efforts - a setting such that #autosave# files don't gunk up the directory you're working in, but instead all live in one place:
(custom-set variables
'(auto-save-file-name-transforms (quote (
("\\`/[^/]*:\\([^/]*/\\)*\\([^/]*\\)\\'" "~/.emacs.d/autobackups/\\2" t)
("\\`/.*/\\(.*?\\)\\'" "~/.emacs.d/autobackups/\\1" t)
))))
Backslashes in elisp are a far greater threat to your sanity than parentheses.
EDIT 2: Time for me to be wrong again. I finally found the relevant documentation (through reading another Stack Overflow question, of course!): Regexp Backslash Constructs. The crucial point of confusion for me: the backtick and single quote are not quoting in this context: they're the equivalent of perl's ^ and $ special characters. The backslash-backtick construct matches an empty string anchored at the beginning of the string being checked for a match, and the backslash-single-quote construct matches the empty string at the end of the string-under-consideration. And by "string under consideration," I mean "buffer, which just happens to contain only a file path in this case, but you need to match the whole dang thing if you want a match at all, since this is elisp's global regexp behavior."
Swear to god, it's like dealing with an alien civilization.
EDIT 3: In order to avoid confusing future readers -
\` is the emacs regex for "the beginning of the buffer." (cf Perl's \A)
\' is the emacs regex for "the end of the buffer." (cf Perl's \Z)
^ is the common-idiom regex for "the beginning of the line." It can be used in emacs.
$ is the common-idiom regex for "the end of the line." It can be used in emacs.
Because regex searches across multi-line bodies of text are more common in emacs than elsewhere (e.g. M-x occur), the backtick and single-quote special characters are used in emacs, and as best as I can tell, they're used in the context of customize-mode because if you are considering generic unknown input to a customize-mode field, it could contain newlines, and therefore you want to use the beginning-of-buffer and end-of-buffer special characters because the beginning and end of the input are not guaranteed to be the beginning and end of a line.
I am not sure whether to regret hijacking my own Stack Overflow question and essentially turning it into a blog post.
In the customize field, you'd enter the regexp according to the syntax described here. When customize writes the regexp into a string, any backslashes or double-quote chars in the regexp will be escaped, as per regular string escaping conventions.
So in short, just enter single backslashes in the regexp field, and they'll get correctly doubled up in the resulting custom-set-variables clause written to your .emacs.
Also: since your regexp is for matching filenames, you might try opening up a directory containing files you'd like to match, and then run M-x re-builder RET. You can then enter the regexp in string-escaped format to confirm that it matches those files. By typing % m in a dired buffer, you can enter a regexp in unescaped format (ie. just like in the customize field), and dired will mark matching filenames.

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.