How do I match across newlines in a perl regex?

How do I match across newlines in a perl regex? - regex

I'm trying to work out how to match across newlines with perl (from the shell). following:
(echo a b c d e; echo f g h i j; echo l m n o p) | perl -pe 's/(c.*)/[$1]/'
I get this:
a b [c d e]
f g h i j
l m n o p
Which is what I expect. But when I place an /s at the end of my regex, I get this:
a b [c d e
]f g h i j
l m n o p
What I expect and want it to print is this:
a b [c d e
f g h i j
l m n o p
]
Is the problem with my regex somehow, or my perl invocation flags?

-p loops over input line-by-line, where "lines" are separated by $/, the input record separator, which is a newline by default. If you want to slurp all of STDIN into $_ for matching, use -0777.
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -pe 's/(c.*)/[$1]/s'
a b [c d e
]f g h i j
l m n o p
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -0777pe 's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]
See Command Switches in perlrun for information on both those flags. -l (dash-ell) will also be useful.

The problem is that your one-liner works one line at a time, your regex is fine:
use strict;
use warnings;
use 5.014;
my $s = qq|a b c d e
f g h i j
l m n o p|;
$s =~ s/(c.*)/[$1]/s;
say $s;

There's More Than One Way To Do It: since you're reading "the entire file at a time" anyway, I'd personally drop the -p modifier, slurp the entire input explicitly, and go from there:
echo -e "a b c d e\nf g h i j\nl m n o p" | perl -e '$/ = undef; $_ = <>; s/(c.*)/[$1]/s; print;'
This solution does have more characters, but may be a bit more understandable for other readers (which will be you in three months time ;-D )

Actually your one-liner looks like this:
while (<>) {
$ =~ s/(c.*)/[$1]/s;
}
It's mean that regexp works only with first line of your input.

You're reading a line at a time, so how do you think it can possibly match something that spans more than one line?
Add -0777 to redefine "line" to "file" (and don't forget to add /s to make . match newlines).
$ (echo a b c d e; echo f g h i j; echo l m n o p) | perl -0777pe's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]

Related

egrep the line ended with

$ cat file
c f t e, u y r s p I y
p A w p d. R i
G e w o a l n o v s.
P G e a o c f s p
k e i c w a p p e.
$ od -c file
0000000 c f t e , u y r s
0000020 p I y \r \n p A w p
0000040 d . R i \r \n G e w o
0000060 a l n o v s . \r \n P
0000100 G e a o c f s p
0000120 \r \n k e i c w a p
0000140 p e . \r \n
0000146
I tried to use the egrep command to grep all lines ended with .
However, I was not able to do it!
for example:
$ egrep '.*\.' file
p A w p d. R i
G e w o a l n o v s.
k e i c w a p p e.
It did not give me the correct output!
Also tried to use $ to anchor the dot, \r, and \n, none of them work.
Any suggestions will help.

You should begin converting your file in Unix format:
dos2unix file
Then you can simply use this instruction:
egrep "[.]$" file

Your file is in DOS format (with carriage return / line feed endings). Either convert it to unix format first and use
egrep '\.$'
or leave the file unchanged and search for a literal carriage return
egrep $'\\.\r$'
(using bash trickery because grep doesn't understand \r).
egrep '.*\.' just finds all lines that contain a . anywhere.

Improve Performance of Last Occurrence Match in Perl Regex

I need to find the last occurrence of matches based on an array of acceptable of value. Below is the source codes in Perl. The answer is Q because it is the last occurrence based on acceptable values of A, Q, I & J.
The challenge is how can I change my codes to make the regex faster. It is currently a bottleneck because I have to run it millions times.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/(A|Q|I|J)/;
my #matches = $input =~ m/\b$regex\b/g;
print $matches[$#matches];
I would like to see new codes that improves the query speed but still can find the Q match.

You can find the last match by simply adding a .* before the matching pattern.
Like this
my $input = "APPLE B C D E F G H INDIGO JACKAL K L M N O P QUIVER R S T U V W X Y Z";
my $regex = qr/APPLE|QUIVER|INDIGO|JACKAL/;
my ($last) = $input =~ /.*\b($regex)\b/;
print $last, "\n";
output
QUIVER

Use \K to discard the previously matched characters from printing at the final.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/.*\K\b[AQIJ]\b/;
if ($input =~ m/$regex/) {
print $&."\n";
}
Use capturing group.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/.*\b([AQIJ])\b/;
if ($input =~ m/$regex/) {
print $1."\n";
}
Update:
my $input = "Apple Orange Mango Apple";
my $regex = qr/.*\K\b(?:Apple|Range|Mango)\b/;
if ($input =~ m/$regex/) {
print $&."\n";
}

Solve a puzzle using bash tools such as grep

I need to solve a puzzle using shell script. I tried to combine grep with rev and saved the output into a temporary text file but still don't know how to solve it entirely.
That's the puzzle to solve :
j s e t f l
a l s f e l
g a a n p l
e p f d p k
r e g e l a
f n e t e n
The file that contains the wordlist to use is in http://pastebin.com/DP4mFZAr
I know how to tell grep where to find the patterns to match as fixed strings extracted from a text file using $ grep -Ff wordlist puzzle and
how to search for mirrored words using $ rev puzzle | grep -Ff wordlist puzzle , thus dealing with the horizontal lines, but how do I deal with vertical words too ?

I am covering horizontal and vertical matching. The main idea is to remove the spaces and then use grep -f with the given list of words, stored in words file.
With grep -f, the results are shown within the line. If you just want to see the matched test, use grep -of.
Horizontal matching
$ cat puzzle | tr -d ' ' | grep -f words
alsfel
gaanpl
regela
fneten
$ cat puzzle | tr -d ' ' | grep -of words
als
gaan
regel
eten
Vertical matching
For this, we firstly have to transpose the content of the file. For this, I use what I used for another answer of mine:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
And let's see:
$ cat puzzle | transpose | tr -d ' ' | grep -f words
jagerf
slapen
esafge
tfndet
lllkan
$ cat puzzle | transpose | tr -d ' ' | grep -of words
jager
slapen
af
ge
de
kan
You can then use rev (as you suggest in your question) for mirrored words. Also tac can be interesting for vertically mirrored words.
Diagonal matching
For the diagonal matching, I think that an interesting approach would be to move every single line a little bit to the left/right. This way,
e x x x x
x g x x x
x x g x x
can become
e x x x x
g x x x
g x x
and you can use the vertical/horizontal approaches.
For this, you can use printf as described in Using variables in printf format:
$ cat a
e x x x x
x g x x x
x x g x x
$ awk -v c=20 '{printf "%*s\n", c, $0; c-=2}' a
e x x x x
x g x x x
x x g x x

How do I properly match unicode characters with awk's regex?

I have the following statement in a script, to retrieve the domain portion of an email address from a variety of email logs with a reliably formatted To: line:
awk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
This matches lines such as To: doc#bequerelint.net (Omer). However, it does not match the lines To: andy.vitrella#uol.com.br (André) or To: boggers#operamail.com (Pål), nor any other line with a non-ascii character within the trailing parentheses after the email address.
Incidentally, od -c for the first non-matching example gives:
0000000 T o : a n d y . v i t r e l l
0000020 a # u o l . c o m . b r ( A n
0000040 d r 351 ) \n
0000045
I surmise there is something going on with awk's regex's . not matching the non-ascii character in (André). What is the correct regex statement to match such a line?

I give my comment as an answer to have the code formatted correctly,
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' | gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}'
uol.com.br
operamail.com
$ echo 'To: andy.vitrella#uol.com.br (André)
To: boggers#operamail.com (Pål)' > fileee12
$ gawk '/^To: / { r = gensub(/^To: .+#(.+) .*$/, "\\1", "g"); print r}' fileee12
uol.com.br
operamail.com
$ env | grep -e '\(LOC\)\|\(LAN\)'
LANG=C
XTERM_LOCALE=C
$
as you see, your command works both reading from stdin and reading from a file, using a C locale, so I can exclude that on my computer it is the locale or the differences between reading from stdin rather than from a file to make a difference.
My computer has linux, my gawk is 4.1.1, what are your circumstances?

further simplifying it, where locale setting simply doesn't matter
{mawk/mawk2/gawk [-b]? -e} 'BEGIN { FS = "\100"; # ampersand
} /^To: / && ( NF > 1 ) { # play it safe in case
# of no ampersand
print ($2 !~ / /) ? $2 : \ # in case no "(Omer)" towards the end
\
substr($2, 1, index($2, " ") - 1);
}'
since spaces aren't valid in email address (unless URI-encoded (?)), and you're force delimiting by # , this substr alone does it without all the gsub and unicode and what not

How do I get GNU grep to match exactly "H" and not things that just start with "H"?

I have a file and I would like to do find the number of total number of occurrences of a passed in word, while supporting regex
grep -e "Hello*" filename | wc -w
But there a few bugs, I let's say I do something like
grep -e "H" filename | wc -w
It should only match EXACTLY H and not count things that start with H, the way grep does it right now.
Anyone know how?

try this:
grep '\bH\b'
e.g.:
kent$ echo "Hello
IamH
we need this H
and this H too"|grep '\bH\b'
we need this H
and this H too
Note that if you want to count only the matched words, you need to use -o option on grep. (thx fotanus)
EDIT
You can get all matched words by grep -o, in this case -c doesn't help, because it counts matched lines. you could pass grep -o to wc -l
for example:
kent$ echo "No Hfoo will be counted this line
this line has many: H H H H H H H (7)
H (8 starting)
foo bar (9 ending) H
H"|grep -o '\bH\b'|wc -l
10
or simpler, single process solution with awk:
awk '{s+=gsub(/\<H\>/,"")}END{print s}' file
same example:
kent$ echo "No Hfoo will be counted this line
this line has many: H H H H H H H (7)
H (8 starting)
foo bar (9 ending) H
H"|awk '{s+=gsub(/\<H\>/,"")}END{print s}'
10

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I match across newlines in a perl regex? - regex

The problem is that your one-liner works one line at a time, your regex is fine: use strict; use warnings; use 5.014; my $s = qq|a b c d e f g h i j l m n o p|; $s =~ s/(c.*)/[$1]/s; say $s;

Actually your one-liner looks like this: while (<>) { $ =~ s/(c.*)/[$1]/s; } It's mean that regexp works only with first line of your input.

Related

egrep the line ended with

Improve Performance of Last Occurrence Match in Perl Regex

Solve a puzzle using bash tools such as grep

How do I properly match unicode characters with awk's regex?

How do I get GNU grep to match exactly "H" and not things that just start with "H"?

Categories

Resources