How do I omit lines that contain Unicode NULL (U+0000)? - regex

I am reading a file and am wondering how to skip lines that have Unicode NULL, U+0000? I have tried everything below, but none works:
if($line)
chomp($line)
$line =~ s/\s*$//g;

Your list of "everything" does not seem to include the obvious $line =~ m/\000/.

Because you asked about Unicode NULL (identical to ASCII NUL when encoded in UTF-8), let’s use the \N{U+...} form, described in the perlunicode documentation.
Unicode characters can also be added to a string by using the \N{U+...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces, after the U. For instance, a smiley face is \N{U+263A}.
You can also match against \N{U+...} in regexes. See below.
#! /usr/bin/env perl
use strict;
use warnings;
my $contents =
"line 1\n" .
"\N{U+0000}\n" .
"foo\N{U+0000}bar\n" .
"baz\N{U+0000}\n" .
"\N{U+0000}quux\n" .
"last\n";
open my $fh, "<", \$contents or die "$0: open: $!";
while (defined(my $line = <$fh>)) {
next if $line =~ /\N{U+0000}/;
print $line;
}
Output:
$ ./filter-nulls
line 1
last

Perl strings can contain arbitrary data, including NUL characters. Your if only checks for true or false (where "" and "0" are the two false strings, everything else being true including a string containing a single NUL "\x00"). Your chomp only removes the line separator, not NULs. A NUL character is not whitespace, so doesn't match \s.
You can explicitly match a NUL character by specifying it in a regex using octal or hex notation ("\000" or "\x00", respectively).

Related

perl regrex that captures substring between tic marks

I am trying to find a solution in perl that captures the filename in the following string -- between the tic marks.
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])‘/;
print $results if $results;
I need to end up with wapenc?T=mavodi-7-13b-2b-3-96-1e3431a
The final tick seems to be different in your regex than in the input string - char 8217 (RIGHT SINGLE QUOTATION MARK U+2019) versus 8216 (LEFT SINGLE QUOTATION MARK U+2018). Also, when using Unicode characters in the source, be sure to include
use utf8;
and save the file UTF-8 encoded.
After fixing these two issues, the code worked for me:
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])’/;
print $results if $results;
Your tic characters aren't in the 7-bit ASCII character set, so there is a whole character-encoding rabbit hole to go down here. But the quick and dirty solution is to capture everything in between extended characters.
($result) = $str =~ /[^\0-\x7f]+(.*?)[^\0-\x7f]/;
[^\0-\x7f] matches characters with character values not between 0 and 127, i.e., anything that is not a 7-bit ASCII character including new lines, tabs, and other control sequences. This regular expression will work whether your input is UTF-8 encoded or has already been decoded, and may work for other character encodings, too.

How to remove the whitespaces in fasta file using perl?

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?
Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;
Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.

Perl substitute for white space substitutes at newline also

I have a text file, its content as follows:
a b
c
and I use the below Perl code to substitute underscore '-' char at where ever the space char appears in the input line:
while (<>) {
$_ =~ s/\s/_/;
print $_;
}
and I get output like this:
a_b
c_
So my question is why Perl substitutes underscore in the place of newline '\n' char too which is evident from the input line which contains 'c'?
When I use chomp in the code it works as expected.
\s matches all white space chars [ \t\r\n\f], so use space if you want to replace plain spaces
$_ =~ s/ /_/g;
# or just
s/ /_/g;
Translation could also be used for such simple substitutions, eg. tr/ /_/;

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);
The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}
You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;
You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.
You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.
$variable = join('',split(/\n/,$variable))

Removing CRLF (0D 0A) from string in Perl

I've got a Perl script which consumes an XML file on Linux and occasionally there are CRLF (Hex 0D0A, Dos new lines) in some of the node values which.
The system which produces the XML file writes it all as a single line, and it looks as if it occasionally decides that this is too long and writes a CRLF into one of the data elements. Unfortunately there's nothing I can do about the providing system.
I just need to remove these from the string before I process it.
I've tried all sorts of regex replacement using the perl char classes, hex values, all sorts and nothing seems to work.
I've even run the input file through dos2unix before processing and I still can't get rid of the erroneous characters.
Does anyone have any ideas?
Many Thanks,
Typical, After battling for about 2 hours, I solved it within 5 minutes of asking the question..
$output =~ s/[\x0A\x0D]//g;
Finally got it.
$output =~ tr/\x{d}\x{a}//d;
These are both whitespace characters, so if the terminators are always at the end, you can right-trim with
$output =~ s/\s+\z//;
A few options:
1. Replace all occurrences of cr/lf with lf: $output =~ s/\r\n/\n/g; #instead of \r\n might want to use \012\015
2. Remove all trailing whitespace: output =~ s/\s+$//g;
3. Slurp and split:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
sub main{
createfile();
outputfile();
}
main();
sub createfile{
(my $file = $0)=~ s/\.pl/\.txt/;
open my $fh, ">", $file;
print $fh "1\n2\r\n3\n4\r\n5";
close $fh;
}
sub outputfile{
(my $filei = $0)=~ s/\.pl/\.txt/;
(my $fileo = $0)=~ s/\.pl/out\.txt/;
open my $fin, "<", $filei;
local $/; # slurp the file
my $text = <$fin>; # store the text
my #text = split(/(?:\r\n|\n)/, $text); # split on dos or unix newlines
close $fin;
local $" = ", "; # change array scalar separator
open my $fout, ">", $fileo;
print $fout "#text"; # should output numbers separated by comma space
close $fout;
}