How to create regex search-and-replace with comments? - regex

I have a bit of a strange problem: I have a code (it's LaTeX but that does not matter here) that contains long lines with period (sentences).
For better version control I wanted to split these sentences on a new line each.
This can be achieved via sed 's/\. /.\n/g'.
Now the problem arises if there are comments with potential periods as well.
These comments must not be altered, otherwise they will be parsed as LaTeX code and this might result in errors etc.
As a pseudo example you can use
Foo. Bar. Baz. % A. comment. with periods.
The result should be
Foo.
Bar.
Baz. % ...
Alternatively the comment might go on the next line without any problems.
It was ok to use perl if that would work out better. I tried with different programs (sed and perl) a few ideas but none did what I expected. Either the comment was also altered or only the first period was altered (perl -pe 's/^([^%]*?)\. /\1.\n/g').
Can you point me in the right direction?

This is tricky as you're essentially trying to match all occurrences of ". " that don't follow a "%". A negative look-behind would be useful here, but Perl doesn't support variable-width negative look-behind. (Though there are hideous ways of faking it in certain situations.) We can get by without it here using backtracking control verbs:
s/(?:%(*COMMIT)(*FAIL))|\.\K (?!%)/\n/g;
The (?:%(*COMMIT)(*FAIL)) forces replacement to stop the first time it sees a "%" by committing to a match and then unconditionally failing, which prevents back-tracking. The "real" match follows the alternation: \.\K (?!%) looks for a space that follows a period and isn't followed by a "%". The \K causes the period to not be included in the match so we don't have to include it in the replacement. We only match and replace the space.

Putting the comment by itself on a following line can be done with sed pretty easily, using the hold space:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G'
Or if you want the comment by itself before the rest:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/ *%.*//;s/\. /.\n/g;x;s/[^%]*%/%/;G'
Or finally, it is possible to combine the comment with the last line also:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G;s/\n\([^\n]*\)$/ \1/'

Related

regex substitution no global modifiers available

I'm using software with built-in regex implementation that does not support global modifiers, so I have to get it working without /g
my test string is(number of sections can be unlimited:
aaa%2dbbb%2dccc%2dddd%2deee
I want it to be: aaa-bbb-ccc-ddd-eee
normally I would write (%2d) and g flag and substitute with -
I managed to write this to match unlimited number of occurrences
(\w)((%2d)(\w+))+
but I have problems with substitution rule, because my group 2 has 2 subgroups and I cannot find out how to handle them,
can anyone help with substitution rule?
As comments in the end reach same conclusions that I had before posting question, I decided to post answer to close the question nicely (instead of deleting question, cause even negative answer is answer and may save someone an hour or more on research(that happened to me actually)). The general conclusion is - it's not possible to solve this with regex. And I'm quoting two best comments by #ltux here:
This problem can't be solved with regular expression in one go. If capture group is used with quantifier such as +, the content of the capture group will always be the last match found. In your case, the content of the 2nd capture group will be %2deee, and you can't get %2dbbb, %2dccc and so on, so there is chance for you to substitute it. – ltux 2 days ago
Regular expression can't solve your problem. You have to try to bypass the limitations of the software by yourself, unless you tell us which software you are using. – ltux 2 days ago
Create a file containing the line type that you want to process:
cat << EOF >> abcde.txt
aaa%2dbbb%2dccc%2dddd%2deee
EOF
Use this sed snippet as follows using the global substitution you mention as being the way you usually perform such a substitution.
sed -e "s#%2d#-#g" abcde.txt
aaa-bbb-ccc-ddd-eee
Basically you don't have to think about the type of characters that appear around the white space character but instead only concentrate on the white space itself. Replacing this character multiple times will solve the issue for you quite simply. In other words, pattern matching around the character you are concerned with changing is not necessary. This is a common issue that many of us fall into when dealing with regular expressions.
Basically the substitution is saying: find the first occurrence of a white space '%2d', replace it with a hyphen '-' and repeat for the rest of the string.

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Vim - sed like labels or replacing only within pattern

On the basis of some html editing I've came up with need for help from some VIM master out there.
I wan't to achieve simple task - I have html file with mangled urls.
Just description Just description
...
Just description
Unfortunately it's not "one url per line".
I am aware of three approaches:
I would like to be able to replace only within '"http://[^"]*"' regex (similar like replace only in matching lines - but this time not whole lines but only matching pattern should be involved)
Or use sed-like labels - I can do this task with sed -e :a -e 's#\("http://[^" ]*\) \([^"]*"\)#\1_\2#g;ta'
Also I know that there is something like "\#<=" but I am non native speaker and vim manual on this is beyond my comprehension.
All help is greatly appreciated.
If possible I would like to know answer on all three problems (as those are pretty interesting and would be helpful in other tasks) but either of those will do.
Re: 1. You can replace recursively by combining vim's "evaluate replacement as an expression" feature (:h :s\=) with the substitute function (:h substitute()):
:%s!"http://[^"]*"!\=substitute(submatch(0), ' ', '_', 'g')!g
Re: 2. I don't know sed so I can't help you with that.
Re: 3. I don't see how \#<= would help here. As for what it does: It's equivalent to Perl's (?<=...) feature, also known as "positive look-behind". You can read it as "if preceded by":
:%s/\%(foo\)\#<=bar/X/g
"Replace bar by X if it's preceded by foo", i.e. turn every foobar into fooX (the \%( ... \) are just for grouping here). In Perl you'd write this as:
s/(?<=foo)bar/X/g;
More examples and explanation can be found in perldoc perlretut.
I think what you want to do is to replace all spaces in your http:// url into _.
To achieve the goal, #melpomene's solution is straightforward. You could try it in your vim.
On the other hand, if you want to simulate your sed line, you could try followings.
:let #s=':%s#\("http://[^" ]*\)\#<= #_#g^M'
^M means Ctrl-V then Enter
then
200#s
this works in same way as your sed line (label, do replace, back to label...) and #<= was used as well.
one problem is, in this way, vim cannot detect when all match-patterns were replaced. Therefore a relative big number (200 in my example) was given. And in the end an error msg "E486: Pattern not found..." shows.
A script is needed to avoid the message.

replacing _ to - with sed , but only within href-attribute

I would like to replace in text-fragments like:
<strong>Media Event "New Treatment Options on November 4–5, 2010, in Paris, France<br /></strong>>> more
all underscores with dashes. But only in the href-attribute. As there are hundreds of files the best approach is to work on these files with sed or a small shellscript.
I started with
\shref=\"([^_].+?)([_].+?)\"
but this matches only 1 _ and i don't know the number of _ and i stucked how dynamically could replace the underscores in a unknown number of back-references.
A tool that's specifically geared toward working with HTML is by far preferable since trying to work with it using regexes can lead to madness.
However, assuming that there's only one href per line, you might be able to use this divide-and-conquer technique:
sed 's/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/;:a;s/\(\n.*\)_\(.*\n\)/\1-\2/;ta;s/\n//g' inputfile
Explanation:
s/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/ - put newlines around the contents of the href
:a;s/\(\n[^\n]*\)_\([^\n]*\n\)/\1-\2/;ta - replace the underscores one-by-one in the text between the newlines, t branches to label :a if a substitution was made
s/\n//g - remove the newlines added in the first step
Regular expressions are simply fundamentally the wrong tool for this job. There is too much context that must be matched.
Instead, you'll need to write something that goes character-by-character, with two modes: one in which it just copies all input, and one in which it replaces underscore with dash. On finding the start of an href it enters the second mode, on leaving an href it returns to the first. This is essentially a limited form of a tokenizer.

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.