I'm trying to modify a perl filter to insert a line that may be missing.
I might have inputs that are
A
B
C
or
A
C
A and B are fixed and known in advance. C can vary from file to file.
The real data are more complicated - callstacks generated as part of
a regression test. Depending on the compiler used (and hence the
optimization) there may be tail call elimination which can remove
the 'B' frame. After filtering the files are simply diffed.
In the second case I want to insert the 'B' line. In the first case I don't want to insert a duplicate line. I thought that this was a job for negative lookahead, using the following
s/A.(?!B)/A\nB/s;
However this seems to mean "if any part of A.(?!B) matches the input text then substitute it with A\nB" whereas I need "if all of A.(?!B) matches" then substitute.
No matter what I try it either always substitutes or never substitutes.
In a one-liner for a ready test
perl -0777 -wpe's/ ^A.*\n \K (?!B.*\n) /B-line\n/xgm' file
The \K makes it drop all matches before it, so we don't have to capture and copy them back in the replacement side. With the -0777 switch the whole file is slurped into a string, available in $_.
In order to match all such A-B?-C groups of lines we need the /g modifier (match "globally"), and for the anchor ^ to also match on internal newlines we need the /m modifier ("multiline").
The /x modifier makes it disregard literal spaces (and newlines and comments), what allows for spacing things out for readability.
On the other hand, if a line starting with A must be followed by either a line starting with B, or by a line starting with C if the B-line isn't there, then it's simpler, with no need for a lookahead
perl -0777 -wpe's/ ^A.*\n \K (^C.*\n) /B-line\n$1/xgm' file
Both these work correctly in my (basic) tests.
In either case, the rest of the file is printed as is, so you can use the -i switch to change the input file "in-place," if desired, and with -i.bak you get a backup as well. So
perl -i.bak -0777 -wpe'...' file
Or you can dump the output (redirect) into the same file to overwrite it, since the whole file was first read, if this runs out of a script.
Reading the file line by line is of course far more flexible. For example
use warnings;
use strict;
use feature 'say';
my $just_saw_A_line;
while (<>) {
if ($just_saw_A_line and not /^B/) {
say "B-line";
}
$just_saw_A_line = /^A/;
print
}
This also handles multiple A-(B?)-C line groups. It's far more easily adjusted for variations.
The program acts like a filter, taking either STDIN or lines from file(s) given on the command line, and printing lines to STDOUT. The output can then be redirected to a file, but not to the input file itself. (If the input file need be changed then the code need be modified for that.)
Related
I have a text file named "data" with the content:
a
b
c
abc
I'd like to find all "abc" (doesn't need to be on the same line) and replace the leading "a" to "A". Here "b" can be any character (one or more) but not 'c'.
(This is a simplification of my actual use case.)
I thought this perl command would do
perl -pi.bak -e 's/a([^c]+?)c/A\1c/mg' data
With this 'data' was changed to:
a
b
c
Abc
I was expecting:
A
b
c
Abc
I'm not sure why perl missed the first occurrence (on line 1-3).
Let me know if you spot anything wrong with my perl command or you know a working alternative. Much appreciated.
You're reading a line at a time, applying the code to that one line. It can't possibly match across multiple lines. The simple solution is to tell perl to treat the entire file as one line using -0777.
perl -i.bak -0777pe's/a([^c]+c)/A$1/g' data
Replaced the incorrect \1 with $1.
Removed the useless /m. It only affects ^ and $, but you don't use those.
Removed the useless non-greedy modifier.
Moved the c into the capture to avoid repeating it.
I have a file that contains several strings bounded by single quotations ('). These strings can contain whitespace and sometimes occur over multiple lines; however, no string contains a quotation (') mark. I'd like to create a regex that finds strings containing the character "$". The regex I had in mind: '[^']*\$[^']* can't search over multiple lines. How can I get it to do so?
Your regex can search over multiple lines. If it doesn't there is a mistake in your code outside of it. (hint: [^'] does include newlines).
How about this expression (it prevents useless backtracking):
'([^'$]*\$[^']*)'
You are not telling us which language you are using, so we are left to speculate. There are two issues here, really:
Many regex engines only process one line at a time by default
Some regex engines cannot process more than one line at a time
If you are in the former group, we can help you. But the problem isn't with the regex, so much as it's with how you are applying it. (But I added the missing closing single quote to your regex, below, and the negation to prevent backtracking as suggested in Tomalak's answer.)
In Python 2.x:
# doesn't work
with open('file', 'r') as f:
for line in f:
# This is broken because it examines a single line of input
print "match" if re.search(r"'[^'$]*\$[^']*'", line)
# works
s = ''
with open('file', 'r') as f:
for line in f:
s += line
# We have collected all the input lines. Now examine them.
print "match" if re.search(r"'[^'$]*\$[^']*'", s)
(That is not the idiomatic, efficient, correct way to read in an entire file in Python. I'm using clumsy code to make the difference obvious.)
Now, more idiomatically, what you want could be
perl -0777 -ne 'while (m/\x27[^\x27$]*\$[^\x27]*\x27/g) { print "$&\n" }' file
(the \x27 is a convenience so I can put the whole script in single quotes for the shell, and not strictly necessary if you write your Perl program in a file), or
#!/usr/bin/env python
import re
with open('file', 'r') as f:
for match in re.match(r"'[^'$]*\$[^']*'", f.read()):
print match
Similar logic can be applied in basically any scripting language with a regex engine, including sed. If you are using grep or some other simple, low-level regex tool, there isn't really anything you can do to make it examine more than one line at a time (but some clever workarounds are possible, or you could simply switch to a different tool -- pcregrep comes to mind as a common replacement for grep).
If you have really large input files, reading it all into memory at once may not be a good idea; perhaps you can devise a way to read only as much as necessary to perform a single match at a time. But that already goes beyond this simple answer.
I regularly use regex to transform text.
To transform, giant text files from the command line, perl lets me do this:
perl -pe < in.txt > out.txt
But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.
How can I do this in the command-line?
To slurp a file instead of doing line by line processing, use the -0777 switch:
perl -0777 -pe 's/.../.../g' in.txt > out.txt
As documented in perlrun #Command Switches:
The special value -00 will cause Perl to slurp files in paragraph mode. Any value -0400 or above will cause Perl to slurp files whole, but by convention the value -0777 is the one normally used for this purpose.
Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.
Grepping across line boundaries
So you want to grep across lines boundaries...
You quite possibly already have pcregrep installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions, and the library is definitely Perl-style, though not identical to Perl.
To match across multiple lines, you have to turn on the multi-line mode -M, which is not the same as (?m)
Running pcregrep -M "(?s)^b.*\d+" text.txt
On this text file:
a
b
c11
The output will be
b
c11
whereas grep would return empty.
Excerpt from the doc:
-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The output
for a successful match may consist of more than one line, the last
of which is the one in which the match ended. If the matched string
ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in "mul- tiline"
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for
forward matching, and simi- larly the previous 8K characters (or all
the previous charac- ters, if fewer than 8K) are guaranteed to be
available for lookbehind assertions. This option does not work when
input is read line by line (see --line-buffered.)
This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.
I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.