Perl writing to a 'memory file' plays tricks with pattern matching - regex

When I run this code, I get "no" printed out:
my $memory_file;
my $fh;
open ($fh, '>', \$memory_file);
print $fh "abc";
if( $memory_file =~ m/^.*$/ )
{ print "yes\n" }
else
{ print "no\n" }
If I print out $memory_file, the contents are indeed "abc".
If I change the pattern to .* (no ^ or $) it works as expected.
If I put the line $memory_file = "abc" before the match, I get 'yes' printed out (as originally expected).
What on earth is going on here?
(This is perl 5.14.1)
Update: Some more discussion on PerlMonks. It is seeming like a bug, I will log it.
Update 2: The fine Perl developers have fixed this bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=108398

It is the end of line character that is messing things up. While a regular assignment works:
my $str = "abc";
print "Works" if $str =~ /^.*$/;
...the code in the question does not. This regex should match any string, since it also matches the empty string. Even undefined values would match (though it would cause a warning). Also, ^.* does match.
The only reasonable explanation is that for some reason, whatever check is performed to match an end of string, it is not finding it. The end of the string is missing.
Curiously, replacing $ with \z works. But not \Z.
Adding a newline also works. Which would sort of make sense, as adding a newline would imply that an end of the string is also added, in the non-multiline regex sense.
I don't know the inner workings of why this happens, but I suspect that when using this particular form of "assignment", an end-of-the-string marker is never placed on the string. A sort of "raw" assignment, which confuses the check for end of string in the regex.
It feels like a bug. Perhaps this particular feature has not been properly maintained.

just use an auxiliary variable.
#$| = 1; # AutoFlush
my $memory_file;
open ( my $fh, '>>', \$memory_file) or die $!;
print $fh "abc";
my $buff = $memory_file;
if( $buff =~ m/^abc$/ ){ # or m/^.*$/
print "yes\n";
}else{
print "no\n";
}

I think problem is the newline and $ anchor. It checks the string after the newline and your string has not it.
EDIT:
See a detail explanation of the problem in other answers because mine was incorrect. But I will give two options to solve it:
Write "abc\n" to your file and check with $
Check end of string with \z.

Well, my first instinct is to say that files in memory do not act the same as physical files. Try changing print $fh "abc"; to print $fh "abc\n"; to see if your input 'becomes' a line.
My other instinct is that your file isn't actually getting written to before you read from it. Flush your buffer with $|++;
So, try:
my $memory_file;
my $fh;
open ($fh, '>', \$memory_file);
print $fh "abc";
$|++;
if( $memory_file =~ m/^.*$/ )
{ print "yes\n" }
else
{ print "no\n" }

Related

Perl - Uninitialised variable warning

I'm struggling to work out what I'm doing wrong here.
The goal of my code is to read a file, movie_script.txt, and then use regexes to sort each line into an array based on the character speaking the line. It does work, but I'm getting output preceded by these warnings:
Use of uninitialized value $char in string eq at filter.pl line 24, <$fh> line 13.
Use of uninitialized value $char in string eq at filter.pl line 26, <$fh> line 13.
Use of uninitialized value $char in string eq at filter.pl line 28, <$fh> line 13.
[...]
Hello, mother.
Oh. Well-- well, I, uh--
Well, uh, I think they must have popped by for something.
Mm, they-- they started following me yesterday.
Here's the code:
use strict;
use warnings;
my $filename = "movie_script.txt";
unless (-e $filename) {
print "Error: File does not exist.";
}
my #brian;
my #mandy;
my #followers;
open(my $fh, '<', $filename);
my $match = qr/^(\w+):(.+)$/i;
while (my $line = <$fh>) {
my $char = "";
my $scriptline = "";
if ($line) {
($char, $scriptline) = $line =~ $match;
if ($char eq "BRIAN") {
push(#brian, $scriptline);
} elsif ($char eq "MANDY") {
push(#mandy, $scriptline);
} elsif ($char eq "FOLLOWERS") {
push(#followers, $scriptline);
} else {
print($line);
}
}
}
foreach (#brian) {
print "$_\n";
}
I suspect that the problem is a line doesn't fit my regex, and it's causing problems for the variables $char and $scriptline, but I don't know how to confirm whether that's true, or how to find out which line is causing the problem.
I've tried running the Perl debugger using perl -d, but when I proceed through each line, I can't find the error. I've tried to set a breakpoint around the `else { print($line) } line, but I can't work out how to run the debugger until it reaches that line.
Is there an obvious reason in my code why I'm getting an uninitalised value problem?
If you've got lines in the (movie) script that do not have a character speaking them, your regex will not match, and $char and $scriptline will not have values. You will want to skip those lines in some way.
There's many ways to do that, but one way would be to move the match operator to the if condition:
if (($char, $scriptline) = $line =~ $match) {
if ($char eq "BRIAN") {
push(#brian, $scriptline);
} elsif ($char eq "MANDY") {
push(#mandy, $scriptline);
} elsif ($char eq "FOLLOWERS") {
push(#followers, $scriptline);
} else {
print($line);
}
}
The script will now ignore all lines that aren't dialog, push lines spoken by these specific cast members, and print lines spoken by others.
Consider asking Perl to tell you what the problem is.
if ($line) {
if (my ($char, $scriptline) = $line =~ $match) {
# Your existing code here
} else {
warn "Line [$line] doesn't match the regex [$match]\n";
}
Note I've also moved the declaration of $char and $scriptline into the smallest possible scope. There's no point in declaring them any earlier or pre-populating them (as you're going to overwrite that data in the match line).
The reason for your uninitialized warnings is that some lines in your file do not match the regex, but you still use the variables. The solution is to not use the variables when the regex does not match by checking with an if statement as shown in the example below.
A few quick points.
Consider using a hash instead of a number of arrays to store your lines. This will make the script re-usable and flexible.
It is unnecessary to hard code a file name. You can simply supply the file name on the command line, and use the diamond operator <> to read the file: while (my $line = <>) { ....
And run it with
$ lines.pl movie_script.txt
You do not need to add variables for the things you match with your regex. That is already taken care of since you use parentheses () in your regex. The matches are stored in pre-defined variables $1 and $2.
You capture leading spaces in your lines, which can be fixed by adding \s* in the regex in front of the second parenthesis.
You are using the /i modifier in your regex, which is unused, since you do not have any letters in your regex. (E.g. if you did /foo/i it would match FOO)
You can use the /s modifier to allow .+ to match the newline, so you do not have to add it back later.
Your program is a very basic, typical perl one-liner, and it doesn't need to be complicated. In this example, I am using the Data::Dumper module to show you what the resulting data structure looks like:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Useqq = 1; # show newline in Dumper print
my %lines;
my $match = qr/^(\w+):\s*(.+)/s; # remove unused /i, add /s
while (my $line = <DATA>) {
if ($line =~ $match) { # Check before you use the variables
push #{$lines{$1}}, $2; # Store the matches in your hash
} # <--- if you want to do something with unmatched lines, use else
}
print for (#{$lines{BRIAN}});
print Dumper \%lines;
__DATA__
BRIAN: Hello, mother.
MANDY: Hi
BRIAN: Oh. Well-- well, I, uh--
FOLLOWERS: Hello
(I used the <DATA> filehandle to simulate your text file inside my program, just replace <DATA> with <>)
This program will output
Hello, mother.
Oh. Well-- well, I, uh--
$VAR1 = {
"BRIAN" => [
"Hello, mother.\n",
"Oh. Well-- well, I, uh--\n"
],
"MANDY" => [
"Hi\n"
],
"FOLLOWERS" => [
"Hello\n"
]
};
It will work with different file names, it will capture lines of any named characters, not just the 3 you hard coded.
Following perl script
takes a filename for processing or use default filename `movie_script.txt'
looks in the file for pattern CHARACTER: LINE and fills hash with data
sorts characters and prints their lines
use strict;
use warnings;
use feature 'say';
my $filename = shift || 'movie_script.txt';
my $match = qr/^(\w+):(.+)$/i;
my %script;
open my $fh, '<', $filename
or die "Couldn't open $filename";
while(<$fh>) {
next if /^\s*\Z/;
push #{$script{$1}}, $2 if /$match/;
}
close $fh;
for my $char ( sort keys %script ) {
say $char;
say "\t$_" for #{$script{$char}};
}
Output
BRIAN
Oh. Well-- well, I, uh--
Well, uh, I think they must have popped by for something.
Mm, they-- they started following me yesterday.
FOLLOWERS
The Messiah! The Messiah! Show us the Messiah!
The Messiah!
The Messiah! The Messiah!
Show us the Messiah! The Messiah! The Messiah! Show us the Messiah!
MANDY
Don't you 'hello mother' me. What are all those people doing out ther e?!
Come on! What have you been up to, my lad?!
'Popped by'?! 'Swarmed by', more like! There's a multitude out there!
Well, they can stop following you right now. Now, stop following my son! You ought to be ashamed of yourselves.
The who?
Huh, there's no Messiah in here. There's a mess, all right, but no Me ssiah. Now, go away!
Ooooh.
Now, you listen here! He's not the Messiah. He's a very naughty boy! Now, go away!
RIAN
Hello, mother.

Searching /etc/passwd for username

I'm trying to search the /etc/passwd file for a username that the use unputs but I'm not sure about the if statement syntax. My current code is:
print "Username?";
$username = <STDIN>;
open (PASSWD, '/etc/passwd')
while (<PASSWD>);
if (PASSWD ((split(/:/)[0] =~ $username)) {
print ((split(/:/)[5]), "\n");
}
close (PASSWD);
Something is wrong with the syntax and I'm having difficulty finding the correct way despite searching stackoverflow and google. Any help would be appreciated.
Perl has built-in functions for that, see getpwnam or User::pwent:
use warnings;
use strict;
print "Username? ";
chomp( my $username = <STDIN> );
die "Unknown user $username\n" unless getpwnam($username);
my $dir = (getpwnam($username))[7];
print $dir, "\n";
# - or -
use User::pwent;
print getpwnam($username)->dir, "\n";
I assume the missing semicolon at the end of your open() line is a typo.
Your while statement needs a block, not a semicolon.
while (<PASSWD>) {
... # stuff
}
When you run this line:
$username = <STDIN>;
Then $username will end containing all of the characters the user has typed at the command line. Crucially, that includes the newline character that was generated when they pressed the "Enter" key.
You then go on to compare that variable with the first field from the records in /etc/passwd. Those fields don't contain a newline character, so the match never succeeds.
You'll want to remove the newline from the end of $username. That's what chomp() is for.
chomp($username = <STDIN>);
Also, the PASSWD in your if statement is very strange. I'm not sure why you think it's necessary. It's not.
if ( (split(/:/)[0] =~ $username) {
But actually, a regex check is overkill here. You should be checking for string equality.
if ((split(/:/)[0] eq $username) {
A couple of other tips:
Always use strict and use warnings.
Use lexical variables for filehandles, use the three-argument version of open() and always check the return value from open()
open my $passwd_fh, '<', '/etc/passwd'
or die "Cannot open /etc/passwd: $!\n";

Using Perl to print multiple lines

This code grabs a keyword 'fun' from text files that I have and then prints the 20 characters before and after the keyword. However, I also want it to print the previous 2 lines and the next two lines, and I'm not sure how to do that. I wasn't sure if it is easier to change the code with this or just read the whole file at one time.
{my $inputfile = "file";
$searchword = 'fun';
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while ($line1=<INPUT>)
{
#read in a line of the file
if ($line1 =~m/$searchword/i)
{print "searchword found\n";
$keepline = $line1;
$goodline =1;
$keepline =~/(.{1,20})(fun)(.{1,20})/gi;
if ($goodline==1)
{&write_excel};
$goodline =0;
}
Your code as is seems to
Take 20 chars each side of 'pledge' not $searchword;
Have an unmatched '{' at the start;
Doesn't print any file contents save for &write_excel which we can't examine; and
Has a logic problem in that if $searchword is found, $goodline is unconditionally set to '1' and then tested to see if its '1' and finally reset to '0'
Putting that aside, the question as to whether to read in the whole file depends on your circumstances some what - how big are the files you're going to be searching, does your machine have plenty of memory; is the machine a shared resource and so on. I'm going to presume you can read in the whole file as that's the more common position in my experience (those who disagree please keep in mind (a) I've acknowledge that its debatable; and (b) its very dependant on the circumstances that only the OP knows)
Given that, there are several ways to read in a whole file but the consensus seems to be to go with the module File::Slurp. Given those parameters, the answer looks like this;
#!/usr/bin/env perl
use v5.12;
use File::Slurp;
my $searchword = 'fun';
my $inputfile = "file.txt";
my $contents = read_file($inputfile);
my $line = '\N*\n';
if ( $contents =~ /(
$line?
$line?
\N* $searchword \N* \n?
$line?
$line?
)/x) {
say "Found:\n" . $1 ;
}
else {
say "Not found."
}
File::Slurp prints a reasonable error message if the file isn't present (or something else goes wrong), so I've left out the typical or die.... Whenever working with regexes - particularly if your trying to match stuff on multiple lines, it pays to use "extended mode" (by putting an 'x' after the final '/') to allow insignificant whitespace in the regex. This allows a clearer layout.
I've also separated out the definition of a line for added clarity which consists of 0, 1 or more non-newlines characters, \N*, followed by a new line, \n. However, if your target is on the first, second, second-last or last line I presume you still want the information, so the requested preceding and following pairs of lines are optionally matched. $line?
Please note that regular expressions are pedantic and there are inevitably 'fine details' that effect the definition of a successful match vs an unwanted match - ie. Don't expect this to do exactly what you want in all circumstances. Expect that you'll have to experiment and tweek things a bit.
I'm not sure I understand your code block (what purpose does "pledge" have? what is &write_excel?), but I can answer your question itself.
First, is this grep command acceptable? It's far faster and cleaner:
grep -i -C2 --color "fun" "file"
The -C NUM flag tells grep to provide NUM lines of context surrounding each pattern match. Obviously, --color is optional, but it may help you find the matches on really long lines.
Otherwise, here's a bit of perl:
#!/usr/bin/perl
my $searchword = "fun";
my $inputfile = "file";
my $blue = "\e[1;34m"; # change output color to blue
my $green = "\e[1;32m"; # change output color to green
my $nocolor = "\e[0;0m"; # reset output to no color
my $prev1 = my $prev2 = my $result = "";
open (INPUT, '<', $inputfile) or die "fatal error reading the file \n";
while(<INPUT>) {
if (/$searchword/i) {
$result .= $prev2 . $prev1 . $_; # pick up last two lines
$prev2 = $prev1 = ""; # prevent reusing last two lines
for (1..2) { # for two more non-matching lines
while (<INPUT>) { # parse them to ensure they don't match
$result .= $_; # pick up this line
last unless /$searchword/i; # reset counting if it matched
}
}
} else {
$prev2 = $prev1; # save last line as $prev2
$prev1 = $_; # save current line as $prev1
}
}
close $inputfile;
exit 1 unless $result; # return with failure if without matches
$result =~ # add colors (okay to remove this line)
s/([^\e]{0,20})($searchword)([^\e]{0,20})/$blue$1$green$2$blue$3$nocolor/g;
print "$result"; # print the result
print "\n" unless $result =~ /\n\Z/m; # add newline if there wasn't already one
Bug: this assumes that the two lines before and the two lines after are actually 20+ characters. If you need to fix this, it goes in the else stanza.

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);
The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}
You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;
You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.
You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.
$variable = join('',split(/\n/,$variable))

In Perl, how can I print lines read from a file with optional leading whitespace removed?

#!/usr/bin/perl
open(SARAN,"first.txt") or die "Can't Open: $!\n";
while($line=<SARAN>)
{
print "$line\n";
}
close SARAN;
Hi,
In the above perl script, i need one functionality...
in first.txt, each line starts with some space in front..
I need to print the lines without space in front...
What to do.
thanks..
Your question is ambiguous: Do you want to print the lines that do not start
with space(s) or print all the lines after removing any leading space(s)?
#codaddict showed how to do the latter. I will show how to do the former:
#!/usr/bin/perl
use strict;
use warnings;
open my $SARAN, '<', "first.txt"
or die "Can't open 'first.txt': $!";
while (my $line = <$SARAN>)
{
print $line unless $line =~ /^\s/;
}
close $SARAN;
Note the following:
use strict will help you catch
programming errors.
use warnings will alert you to
dubious constructs.
Bareword filehandles such as SARAN are package globals. Use lexical
filehandles.
Prefer the three-argument form of open, especially if the filename is
not hardcoded.
Include the filename in the error message.
Since you are not chomping
$line, print "$line\n" would cause newlines to be doubled.
you can do:
while($line=<SARAN>) {
$line=~s/^\s+//; # delete the leading whitespace.
print "$line\n";
}
We use the Perl's substitute operator s which allows us to find a match using a regex and replace the part of the string that matched with another string.
Here we use the regex ^\s+
^ is the start anchor.
\s is any whitespace character.
+ quantifier to mark one or more
Basically we match and replace the leading one or more whitespace char with nothing which effectively means deleting them.
#!/usr/bin/perl
open(SARAN,"first.txt") or die "Can't Open: $!\n";
#line=< SARAN>;
$str_line = join("", #line);
print(SARAN "$str_line");
close SARAN;