Perl substitute for white space substitutes at newline also - regex

I have a text file, its content as follows:
a b
c
and I use the below Perl code to substitute underscore '-' char at where ever the space char appears in the input line:
while (<>) {
$_ =~ s/\s/_/;
print $_;
}
and I get output like this:
a_b
c_
So my question is why Perl substitutes underscore in the place of newline '\n' char too which is evident from the input line which contains 'c'?
When I use chomp in the code it works as expected.

\s matches all white space chars [ \t\r\n\f], so use space if you want to replace plain spaces
$_ =~ s/ /_/g;
# or just
s/ /_/g;
Translation could also be used for such simple substitutions, eg. tr/ /_/;

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?
A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.
You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

Perl regular expression that only keeps characters until first newline

When I try with a regular character instead of a newline such as X, the following works:
my $str = 'fooXbarXbaz';
$str =~ s/X.*//;
print $str;
→ foo
However, when the character to look for is \n, the code above fails:
my $str = "foo\nbar\nbaz";
$str =~ s/\n.*//;
print $str;
→ foo
baz
Why is this happening? It looks like . only matches b, a, and r, but not the second \n and the rest of the string.
The . metacharacter does not match a newline unless you add the /s switch. This way it matches everything after the newline even if there is another newline.
$str =~ s/\n.*//s;
Another way to do this is to match a character class that matches everything, such as digits and non-digits:
$str =~ s/\n[\d\D]*//;
Or, the character class of the newline and not a newline:
$str =~ s/\n[\n\N]*//;
I'd be tempted to do this with simpler string operations. Split into lines but only save the first one:
$str = ( split /\n/, $str, 2 )[0]
Or, a substring up to the first newline:
$str = substr $str, 0, index($str, "\n");
And, I'm not recommending this, but sometimes I'll open a file handle on a reference to a scalar so I can read its contents line by line:
open my $string_fh, '<', \ $str;
my $line = <$string_fh>;
close $string_fh;
say $line;

Perl regex multiline match without dot

There are numerous questions on how to do a multiline regex in Perl. Most of them mention the s switch that makes a dot match a newline. However, I want to match an exact phrase (so, not a pattern) and I don't know where the newlines will be. So the question is: can you ignore newlines, instead of matching them with .?
MWE:
$pattern = "Match this exact phrase across newlines";
$text1 = "Match\nthis exact\nphrase across newlines";
$text2 = "Match this\nexact phra\nse across\nnewlines";
$text3 = "Keep any newlines\nMatch this exact\nphrase across newlines\noutside\nof the match";
$text1 =~ s/$pattern/replacement text/s;
$text2 =~ s/$pattern/replacement text/s;
$text3 =~ s/$pattern/replacement text/s;
print "$text1\n---\n$text2\n---\n$text3\n";
I can put dots in the pattern instead of spaces ("Match.this.exact.phrase") but that does not work for the second example. I can delete all newlines as preprocessing but I would like to keep newlines that are not part of the match (as in the third example).
Desired output:
replacement text
---
replacement text
---
Keep any newlines
replacement text
outside
of the match
Just replace the literal spaces with a character class that matches a space or a newline:
$pattern = "Match[ \n]this[ \n]exact[ \n]phrase[ \n]across[ \n]newlines";
Or, if you want to be more lenient, use \s or \s+ instead, since \s also matches newlines.
Most of the time, you are treating newlines as spaces. If that's all you wanted to do, all you'd need is
$text =~ s/\n/ /g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Then there's the one time you want to ignore it. If that's all you wanted to do, all you'd need is
$text =~ s/\n//g;
$text =~ /\Q$text_to_find/ # or $text =~ /$regex_pattern_to_match/
Doing both is next to impossible if you have a regex pattern to match. But you seem to want to match literal text, so that opens up some possibilities.
( my $pattern = $text_to_find )
=~ s/(.)/ $1 eq " " ? "[ \\n]" : "\\n?" . quotemeta($1) /seg;
$pattern =~ s/^\\n\?//;
$text =~ /$pattern/
It sounds like you want to change your "exact" pattern to match newlines anywhere, and also to allow newlines instead of spaces. So change your pattern to do so:
$pattern = "Match this exact phrase across newlines";
$pattern =~ s/\S\K\B/\n?/g;
$pattern =~ s/ /[ \n]/g;
It certainly is ugly, but it works:
M\n?a\n?t\n?c\n?h\st\n?h\n?i\n?s\se\n?x\n?a\n?ct\sp\n?h\n?r\n?a\n?s\n?e\sa\n?c\n?r\n?o\n?s\n?s\sn\n?e\n?w\n?l\n?i\n?n\n?e\n?s
For every pair of letters inside a word, allow a newline between them with \n?. And replace each space in your regex with \s.
May not be usable, but it gets the job done ;)
Check it out at regex101.

Match multiple line string in Perl

I'm new to Perl and I was wondering if someone can help me.
I have an input like this:
a,b,
c,d,e,f,g,h,
i,j,q // Letras
I'm trying to get the letters before // separately and then print them between {} separated by :.
I tried with this RE ([\w,;:\s\t]*)(\n|\/\/)/m and I could get in $1 all letters for each line (as a string including separators) but not what I want.
I need to match that pattern more than one time in the same file so I was using /g.
Edit:
Here is my code block:
while ( <> ) {
if ( /([\w,;:\s\t]*)(\n|\/\/)/m ) {
print "$1\n";
}
}
/m is for using ^, and $ to match by line in a string with multiple lines.
On the other hand, you are reading the input line by line. You cannot expect to match across lines with a single expression if you only look at one line at a time.
Instead, read by chunks by setting $/ to an appropriate value. If the chunks always end in the exact string "// Letras\n\n", the task is even simpler.
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = '//';
while (my $chunk = <DATA>) {
chomp $chunk;
my #fields = ($chunk =~ /([a-z])[, ]/g);
next unless #fields;
printf "{%s}\n", join(':', #fields);
}
__DATA__
a,b,
c,d,e,f,g,h,
i,j,q // Letras
a,b,
c,d,e,f,g,h,
i,j,q // Metras
Output:
{a:b:c:d:e:f:g:h:i:j:q}
{a:b:c:d:e:f:g:h:i:j:q}
You can also use File::Stream:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Stream;
my $stream = File::Stream->new(
\*DATA,
separator => qr{ (?: \s+ // [^\n]+ ) \n\n }x
);
while (my $chunk = <$stream>) {
$chunk =~ s{ \s+ // .* \z }{}sx;
$chunk =~ s{ ,\n? }{:}gx;
print "{$chunk}\n";
}
__DATA__
a,b,
c,d,e,f,g,h,
i,j,q // Letras
a,b,
c,d,e,f,g,h,
i,j,q // Metras
I think what you're aiming for is to remove comments (denoted by a double slash) from each line, and print it out enclosed by braces, and with a colon : separator instead of commas
First of all you should remove the trailing linefeed character from each line using chomp
Then all you need to remove any trailing comment is s|\s*//.*||. That removes any spaces before the // as well. I'm using a pipe character | as the delimiter so as to avoid having to escape the slashes within the regex pattern. And the data is being processed one line at a time so there no need for the global /g modifier
This program reads from the file specified on the command line, which I've set up to contain the data you show in the question
use strict;
use warnings;
while ( <DATA> ) {
chomp;
s|\s*//.*||;
print "{$_}\n";
}
output
{a,b,}
{c,d,e,f,g,h,}
{i,j,q}
Update
Thanks to Sinan Ünür's solution I notice that you've asked to "print [the letters] between {} separated by :"
This is a modification of the while loop above, which finds all substrings within the current line that don't contain commas, and joins them together again using colons :
while ( <> ) {
chomp;
s|\s*//.*||;
my $values = join ':', /[^,]+/g;
print "{$values}\n";
}
output
{a:b}
{c:d:e:f:g:h}
{i:j:q}
I am sure the true solution is much more simple, but unless you elaborate your question we have to cater for all possibilities
Are you looking to combine the letters on all 3 lines into the output, or convert each line?
In other words, is your desired output
{a:b}
{c:d:e:f:g:h}
{i:j:q}
or
{a:b:c:d:e:f:g:h:i:j:q}
?
If you want the former, Borodin's answer works.
If you want the latter, then you should load the contents into an array, and print it using a join statement. To do that, I've modified Borodin's answer:
while ( <> ) { # read each line
chomp; # remove \n from line
s|\s*//.*||; # remove comment
push #values, ':', /[^,]+/g; # store letters in array
}
my $values = join ':', #values; # convert array to string
print "{$values}\n"; # print the results
my $str = "a,b,
c,d,e,f,g,h,
i,j,q // Letras";
$str = join "",map {s/,/:/g ;(split)[0]} split '\n', $str;
print "{$str}";
Sample output
{a:b:c:d:e:f:g:h:i:j:q}
I am considering a string with multilines separated by newline character.
join "",map {s/,/:/g ;(split)[0]} split '\n', $str
This is evaluated from right to left.
Split with \n on $str produces 3 elements which is input for map.
(split)[0] : default delimiter for split is whitespace. so each element is split for whitespace and 0th element is only considered discarding others.
Ex (split)[0] for i,j,q // Letras produces 3 elements "i,j,q" "//" "Letras" where only element 0 i.e., "i,j,q" is considered.
, is replaced with :
join is used to combine all the resulting elements from map.

How to remove the whitespaces in fasta file using perl?

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?
Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;
Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.