Perl regex start of line anchor fails - regex

I want to change the 'this' on the second row of the example, but it's not happening. Most grateful for an idea where I'm going wrong.
echo "not this but
this one" > test.txt
perl -0777 -i -pe 's/^this/rhinoceros/igs' test.txt
cat test.txt
not this but
this one

You have all the wrong modifiers on your substitution. You presumably only want to make a single change, so the /g is unnecessary; the text to be matched is exactly this, so the /i is unnecessary, and you have no dot . characters in your pattern, so /s doesn't do anything
What you do need is a /m (multi-line) modifier so that the ^ matches the beginning of lines in the middle of the string, as well as just at the start of the string
This should work for you
perl -0777 -i -pe 's/^this/rhinoceros/m' test.txt

You are using the /s flag, you actually want the /m flag.
perl -0777 -i -pe 's/^this/rhinoceros/igm' test.txt
s treats the whole string as a single line, whereas m matches over multiple lines.
Edit: See http://perldoc.perl.org/perlre.html for a more detailed treatment of the s and m modifiers that #ThisSuitIsBlackNot comments on below. For practical purposes, s "treat[s] string as single line".

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

How to refer to the matched pattern in perl search and replace from bash command line?

I want to do search and replace(foo to foobar) with bash using perl command:
sudo perl -0777 -i -pe s/'foo'/'foobar'/gs a.txt
but I don't always know what is 'foo', so I want a variable kind of thing which stores the matched pattern.
Also can I get a substring of the matched pattern? Like 'foo' is replaced with 'oobar'(foo becomes oo)?
Use the expression evaluation modifier e.
perl -0777 -i -pe's/(foo)/substr($1, 1) . "bar"/egs' a.txt
$& refers to the entire matched pattern and instead of taking substring of that I used negative lookbehind (?<!u) this means any character other than u:
sudo perl -i -0777 -pe s/(?<!u)'oo'/'u$&'/gs
This will match not only any foo but any occurrence of oo but never uoo and replace it with uoobar.
In the example you gave, you don't need to use the matched text.
perl -pe's/foo\K/bar/g'
In the scenario you described, you don't need to use the matched text.
perl -pe's/f\Koo/oobar/g'
That said, $& contains the matched text.
perl -pe's/foo/$&bar/g'
And $1 contains the text matched by the earliest capture, $2 contains the text matched by the second earliest capture, etc.
perl -pe's/(f)oo/$1oobar/g'
And /e can be used to treat the replacement expression as code to execute for each match.
perl -pe's/foo/ substr($&,0,1)."oobar" /eg'
There's no point in using /s since the pattern doesn't contain ..
There's no point in using -0777 since your pattern can't span lines.
The quotes you used were useless, and it's less noisy to quote the entire program instead of individual sections of it.

Delete blank line before a pattern. What's wrong? Currently using Perl but open to sed/AWK

In a long document, I want to selectively delete the particular newlines that precede the exact string \begin{enumerate*}, ideally with a one-liner in bash or zsh.
That is, I want to convert test.tex:
Text in paragraphs.
More text
\begin{enumerate*} \item thing
to
Text in paragraphs.
More text \begin{enumerate*} \item thing
with a one-liner like
cat test.tex | perl -p -e 's/\n(?=(\\begin\{enumerate\*\}))/ /'
or
cat test.tex | perl -p -e 's/\n\\begin\{enumerate\*\}/\\begin{enumerate*}/'
but I must be missing something because it doesn't make any change.
I also clearly don't need a regular expression here. If there's a way to do this with exact string matching instead of regex, I'd rather use that way. For instance, in R I could do this with sub("\n\\begin{enumerate*}", "\\begin{enumerate*}", fixed = TRUE).
You can use the -0 (digit zero) switch with Perl to specify the line separator. Traditionally -0777 is used to read the entire file
You also need to be careful about regex metacharacters in your search string. Characters like *, {, } and \ mean something special within a regex pattern, and you should escape them — usually with a \Q ... \E construct
Taking these points into account, this should work for you
perl -0777 -pe' s/\n+(?=\Q\begin{enumerate*}\E)/ / ' myfile
perl -p processes a file string by string, so you can't expect this regex to match.
I would recommend something like
perl -e '$text = join "", <>; $text =~ s/your_regex_here//; print $text' test.txt
Mind that it loads the whole file to memory.
Also, if you want to modify file immediately, you can't just say > test.txt, see this question.
I found a solution with sed (number 25 on this page) that doesn't read the entire file into memory:
sed -i bak -n '/^\\begin{enumerate\*}/{x;d;};1h;1!{x;p;};${x;p;}' test.tex
The downside is that this doesn't actually join the two lines; instead it produces
Text in paragraphs.
More text
\begin{enumerate*} \item thing
which is good enough for what I need (latex treats single newlines the same as regular spaces)

Perl match newline in `-0` mode

Question
Suppose I have a file like this:
I've got a loverly bunch of coconut trees.
Newlines!
Bahahaha
Newlines!
the end.
I'd like to replace an occurence of "Newlines!" that is surrounded by blank lines with (say) NEWLINES!. So, ideal output is:
I've got a loverly bunch of coconut trees.
NEWLINES!
Bahahaha
Newlines!
the end.
Attempts
Ignoring "surrounded by newlines", I can do:
perl -p -e 's#Newlines!#NEWLINES!#g' input.txt
Which replaces all occurences of "Newlines!" with "NEWLINES!".
Now I try to pick out only the "Newlines!" surrounded with \n:
perl -p -e 's#\nNewlines!\n#\nNEWLINES!\n#g' input.txt
No luck (note - I don't need the s switch because I'm not using . and I don't need the m switch because I'm not using ^and $; regardless, adding them doesn't make this work). Lookaheads/behinds don't work either:
perl -p -e 's#(?<=\n)Newlines!(?=\n)#NEWLINES!#g' input.txt
After a bit of searching, I see that perl reads in the file line-by-line (makes sense; sed does too). So, I use the -0 switch:
perl -0p -e 's#(?<=\n)Newlines!(?=\n)#NEWLINES!#g' input.txt
Of course this doesn't work -- -0 replaces new line characters with the null character.
So my question is -- how can I match this pattern (I'd prefer not to write any perl beyond the regex 's#pattern#replacement#flags' construct)?
Is it possible to match this null character? I did try:
perl -0p -e 's#(?<=\0)Newlines!(?=\0)#NEWLINES!#g' input.txt
to no effect.
Can anyone tell me how to match newlines in perl? Whether in -0 mode or not? Or should I use something like awk? (I started with sed but it doesn't seem to have lookahead/behind support even with -r. I went to perl because I'm not at all familiar with awk).
cheers.
(PS: this question is not what I'm after because their problem had to do with a .+ matching newline).
Following should work for you:
perl -0pe 's#(?<=\n\n)Newlines!(?=\n\n)#NEWLINES!#g'
I think they way you went about things caused you to combine possible solutions in a way that didn't work.
if you use the inline editing flag you can do it like this:
perl -0p -i.bk -e 's/\n\nNewlines!\n\n/\n\nNEWLINES!\n\n/g' input.txt
I have doubled the \n's to make sure you only get the ones with empty lines above and below.
If the file is small enough to be slurped into memory all at once:
perl -0777 -pe 's/\n\nNewlines!(?=\n\n)/\n\nNEWLINES!/g'
Otherwise, keep a buffer of the last three lines read:
perl -ne 'push #buffer, $_; $buffer[1] = "NEWLINES!\n" if #buffer == 3 && ' \
-e 'join("", #buffer) eq "\nNewlines!\n\n"; ' \
-e 'print shift #buffer if #buffer == 3; END { print #buffer }'

perl -pe regex problem

I use perl to check some text input for a regex pattern, but one pattern doesn't work with perl -pe.
Following pattern doesn't work with the command call:
s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!
I use the linux shell. Following call I use to test my regex:
cat test | perl -pe 's![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!'
File test:
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
Result:
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
Cache
How can I remove the first result?
Thanks for any advice.
That last slash after "Comp-(.*)" may be what's doing it. Your file content in the "Database" doesn't have a slash. Try replacing Comp-(.*)/.* with Comp-(.*)[/.].* so you can match either the subdirectory or the file extension.
$ cat input
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
$ perl -ne 'print if s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)/.*!$1!' input
Cache
The problem is in last slash character in the regex. Instead of escaping the dot, it is just normal slash character, which is missing from input string. Try this:
s![a-zA-Z]+ +(?:.*?)/(?:.*)Comp-(.*)[./].*!$1!
Edit: Updated to match new input data and added another option:
On the other hand, your replacement regex might be replaced by something like:
perl -ne 'print "$1\n" if /Comp-(.*?)[.\/]/'
Then there is no need to parse full line with whatever it contains.
\s match whitespace (spaces, tabs, and line breaks) and '+' means one or more characters. In this case '\s+' would mean search for one or more whitespaces.
cat test
A MaintanceGie?\195?\159mannFlock/System/Comp-Database.cpp
A MaintanceGie?\195?\159mannFlock/System/Comp-Cache/abc.h
perl -ne 'print "$1\n" if /\w+?\d+?\d+\w+\/\w+\/Comp-(\w+)[\/]/' test