Perl, Change the Case of Letter at { character - regex

I am a perl newb, and just need to get something done quick and dirty.
I have lines of text (from .bib files) such as
Title = {{the Particle Swarm - Explosion, Stability, and Convergence in a Multidimensional Complex Space}},
How can I write a regex such that the first letter after the second { becomes capitalised.
Thanks

One way, for the question as asked
$string =~ s/{{\K(\w)/uc($1)/ge;
whereby /e makes it evaluate the replacement side as code. The \K makes it drop all previous matches so {{ aren't "consumed" (and thus need not be retyped in the replacement side).
If you wish to allow for possible spaces:  $string =~ s/{{\s*\K(\w)/uc($1)/ge;, and as far as I know bibtex why not allow for spaces between curlies as well, so {\s*{.
If simple capitalization is all you need then \U$1 in the replacement side sufficies and there is no need for /e modifier with it, per comment by Grinnz. The \U is a generic quote-like operator, which can thus also be used in regex; see under Escape sequences in perlre, and in perlretut.
I recommend a good read through the tutorial perlretut. That will go a long way.
However, I must also ask: Are you certain that you may indeed just unleash that on your whole file? Will it catch all cases you need? Will it not clip something else you didn't mean to?

Related

auto-generating substitution in perl

I'm trying to autogenerate a regex pattern in perl based on some input, to handle various variables that are created by token pasting in a Makefile... So, for example, I might have a pattern such as:
foo_1_$(AB)_$(CB)
Given this pattern, I need to create a regex that will convert all instances of foo_1_\$(\w+)_\$(\w+) to bar_1_\$($1)_\$($2). The main issue I'm having is with the "to" side of the pattern -- I need to increment the $ number reference each time -- notice that there may be a variable number of tokens in any given pattern.
So... I'm thinking something like this:
foreach $pattern (#patterns) {
my $from = $pattern;
# foo_1_$(AB)_$(CD)
$from =~ s/\$\(\w+\)/\$\(\\w\\\+\)/g;
# foo_1_$(\w+)_$(\w+)
my $to = $pattern =~ s/foo/bar/r;
# bar_1_$(AB)_$(CD);
$to =~ s/\$\(\w+\)/\\\$\(\$?)/g; #???
# bar_1_\$($1)_\$($2)
# ^ ^
#this next part is done outside of this loop, but for the example code:
$line ~= s/\Q$from\E/$to/;
}
How do I cause each subsequent replacement in my to to have an incremental index?
Writing code to generate regex off of a given pattern is a complex undertaking (except in simplest cases), and that's when it is precisely specified what that pattern can be. In this case I also don't see why one can't solve the problem by writing the regex for a given type of a pattern (instead of writing code that would write regex).†
In either case one would need those regex so here's some of that. Since no precise rules for what the patterns may be are given, I use some basic assumptions drawn from hints in the question.
I take it that the pattern to replace (foo_) is followed by a number, and then by the pattern _$(AB) (literal dollar and parens with chars inside), repeated any number of times ("there may be a variable number of tokens").
One way to approach this is by matching the whole following pattern (all repetitions). With lookahead
s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/;
A simple minded test in a one-liner
perl -wE'$_=q{foo_1_$(AB)_$(CB)}; s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/; say'
replaces foo to XXX. It works for only one group _$(AB), and for more than two, as well.
This does not match the lone foo_1, without following _$(AB), decided based on the "spirit" of the question (since such a requirement is not spelled out). If such a case in fact should be matched as well then that is possible with a few small changes (mostly related to moving _ into the pattern to be replaced, as optional ([a-z]+_[0-9]+_?))
Update If the "tokens" that follow foo_ (to be replaced) can in fact be anything (so not necessarily $(..)), except that they are strung together with _, then we can use a modification like
/[a-z]+_(\d?)(?=(_[^_]+)*)/XXX_$1/;
where the number after foo_ is optional, per example given in a comment. But then it's simpler
/[a-z]+(?=(_[^_]+)*)/XXX/;
Example
perl -wE'
$_=q{foo_$(AB)_123_$(CD)_foo_$(EF)}; say;
s/[a-z]+(?=(_[^_]+)*)/XXX/; say'
prints
foo_$(AB)_123_$(CD)_foo_$(EF)
XXX_$(AB)_123_$(CD)_foo_$(EF)
Note: what the above regex does is also done by /[a-z]+(?=_)/XXX/. However, the more detailed regex above can be tweaked and adapted for more precise requirements and I'd use that, or its variations, as a main building block for complete solutions.
If the rules for what may be a pattern are less structured (less than "any tokens connected with _") then we need to know them, and probably very precisely.
This clearly doesn't generate the regex from a given pattern, as asked, but is a regex to match such a (class of) patterns. That can solve the problem given sufficient specification for what those patterns may be like -- which would be necessary for regex generation as well.
† Another option is that some templating system is used but then you are again directly writing regex to match given types of patterns.

EditPad: How to replace multiple search criteria with multiple values?

I did some searching and found tons of questions about multiple replacements with Regex, but I'm working in EditPadPro and so need a solution that works with the regex syntax of that environment. Hoping someone has some pointers as I haven't been able to work out the solution on my own.
Additional disclaimer: I suck with regex. I mean really... it's bad. Like I barely know wtf I'm doing.So that being said, here is what I need to do and how I'm currently approaching it...
I need to replace two possible values, with their corresponding replacements. My two searches are:
(.*)-sm
(.*)-rad
Currently I run these separately and replace each with simple strings:
sm
rad
Basically I need to lop off anything that comes prior to "sm" so I just detect everything up to and including sm, and then replace it all with that string (and likewise for "rad").
But it seems like there should be a way to do this in a single search/replace operation. I can do the search part fine with:
(.*)-sm|(.*)-rad
But then how to replace each with it's matching value? That's where I'm stuck. I tried:
sm|rad
but alas, that just becomes the literal complete string that is used for replacement.
Jonathan, first off let me congratulate you for using EPP Pro for regex in your text. It's my main text editor, and the main reason I chose it, as a regex lover, is that its support of regex syntax is vastly superior to competing editors. For instance Notepad++ is known for its shoddy support of regular expressions. The reason of course is that EPP's author Jan Goyvaerts is the author of the legendary RegexBuddy.
A picture is worth a thousand words... So here is how I would do your replacement. Just hit the "replace all button". The expression in the regex box assumes that anything before the dash that is not a whitespace character can be stripped, so if this is not what you want, we need to tune it.
Search for:
(.*)-(sm|rad)
Now, when you put something in parenthesis in Regex, those matches are stored in temporary variables. So whatever matched (.*) is stored in \1 and whatever matched (sm|rad) is stored in \2. Therefore, you want to replace with:
\2
Note that the replacement variable may be different depending on what programming language you are using. In Perl, for example, I would have to use $2 instead.

How many backslashes are required to escape regexps in emacs' "Customize" mode?

I'm trying to use emacs' customize-group packages to tweak some parts of my setup, and I'm stymied. I see things like this in my .emacs file after I make changes with customize:
'(tramp-backup-directory-alist (quote (("\\\\`.*\\\\'" . "~/.emacs.d/autobackups"))))
This was the result of putting the following into the customize text field:
Regexp matching filename: \\`.*\\'
This is a representative sample: I'm actually trying to change several things that want a regexp, and they all show this same problem. How many layers of quoting are there, really? I can't seem to find the magic number of backslashes to get the gosh-dang thing to do what I'm asking it to, even for the simplest regular expressions like .*. Right now, the given customization produces - nothing. It makes no change from emacs' default behavior.
Better yet, where on earth is this documented? It's a little difficult to Google for, but I've been trying quite a few things there as well as in the official documentation and the Emacs wiki. Where is an authoritative source for how many dang backslashes one needs to make a regular expression in customize-mode actually work - or at the very least, to fail with some kind of warning instead of failing silently?
EDIT: As so often happens with questions asked in anger, I was asking the wrong question. Fortunately the answers below, led me to the answer to the question that I needed, which was about quoting rules. I'm going to try to write down what I learned here, because I find the documentation and Googleable resources to be maddeningly obscure about this. So here are the quoting rules I found by trial and error, and I hope that they help someone else, inspire correction, or both.
When an emacs customize-mode buffer asks you for a "Regexp matching filename", it is being, as emacs often is, both terse and idiosyncratic (how often the creator's personality is imparted to the creation!). It means, for one thing, a regexp that will be compared to the whole path of the file in search of a match, not just to the name of the file itself as you might assume from the term "filename". This is the same sense of "filename" used in emacs' buffer-file-name function, for example.
Further, although if you put foo in the field, you'll see "foo" (with double-quotes) written to the actual file, that's not enough quoting and not the right quoting. You need to quote your regexp with the quoting style that, as far as I can tell, only emacs uses: the ``backtick-foo-single-quote'`scheme. And then you need to escape that, making it \`backslash-backtick-foo-backslash-single-quote\' (and if you think that's a headache to input in Markdown, it's more so in emacs).
On top of this, emacs appears to have a rule that the . regexp special character does not match a / at the beginning of filenames, so, as was happening to me above, the classic .* pattern will appear to match nothing: to match "all files", you actually need the regexp /.*, which then you stuff into the quote format of customize-mode to produce \`/.*\', after which customize paints another layer of escaping onto it and writes it to the customization file.
The final result for one of my efforts - a setting such that #autosave# files don't gunk up the directory you're working in, but instead all live in one place:
(custom-set variables
'(auto-save-file-name-transforms (quote (
("\\`/[^/]*:\\([^/]*/\\)*\\([^/]*\\)\\'" "~/.emacs.d/autobackups/\\2" t)
("\\`/.*/\\(.*?\\)\\'" "~/.emacs.d/autobackups/\\1" t)
))))
Backslashes in elisp are a far greater threat to your sanity than parentheses.
EDIT 2: Time for me to be wrong again. I finally found the relevant documentation (through reading another Stack Overflow question, of course!): Regexp Backslash Constructs. The crucial point of confusion for me: the backtick and single quote are not quoting in this context: they're the equivalent of perl's ^ and $ special characters. The backslash-backtick construct matches an empty string anchored at the beginning of the string being checked for a match, and the backslash-single-quote construct matches the empty string at the end of the string-under-consideration. And by "string under consideration," I mean "buffer, which just happens to contain only a file path in this case, but you need to match the whole dang thing if you want a match at all, since this is elisp's global regexp behavior."
Swear to god, it's like dealing with an alien civilization.
EDIT 3: In order to avoid confusing future readers -
\` is the emacs regex for "the beginning of the buffer." (cf Perl's \A)
\' is the emacs regex for "the end of the buffer." (cf Perl's \Z)
^ is the common-idiom regex for "the beginning of the line." It can be used in emacs.
$ is the common-idiom regex for "the end of the line." It can be used in emacs.
Because regex searches across multi-line bodies of text are more common in emacs than elsewhere (e.g. M-x occur), the backtick and single-quote special characters are used in emacs, and as best as I can tell, they're used in the context of customize-mode because if you are considering generic unknown input to a customize-mode field, it could contain newlines, and therefore you want to use the beginning-of-buffer and end-of-buffer special characters because the beginning and end of the input are not guaranteed to be the beginning and end of a line.
I am not sure whether to regret hijacking my own Stack Overflow question and essentially turning it into a blog post.
In the customize field, you'd enter the regexp according to the syntax described here. When customize writes the regexp into a string, any backslashes or double-quote chars in the regexp will be escaped, as per regular string escaping conventions.
So in short, just enter single backslashes in the regexp field, and they'll get correctly doubled up in the resulting custom-set-variables clause written to your .emacs.
Also: since your regexp is for matching filenames, you might try opening up a directory containing files you'd like to match, and then run M-x re-builder RET. You can then enter the regexp in string-escaped format to confirm that it matches those files. By typing % m in a dired buffer, you can enter a regexp in unescaped format (ie. just like in the customize field), and dired will mark matching filenames.

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/

how to eliminate dots from filenames, except for the file extension

I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html