How to extract 2+ character words from string in perl - regex

I assume some sort of regex would be used to accomplish this?
I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.
This is the code I currently have, although it isn't very close to my desired output:
while (my $line=<>) {
# remove leading and trailing whitespace
$line =~ s/^\s+|\s+$//g;
$line = lc $line;
#array = split / /, $line;
foreach my $a (#array){
$a =~ s/[\$##~!&*()\[\];.,:?^ `\\\/]+//g;
push(#list, "$a");
}
}
A sample input would be:
#!/usr/bin/perl -w
use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;
And the desired output would be (alphabetical order):
bin
exit
hello
hello
line
perl
print
print
strict
this
use
usr
will
world

my #matches = $string =~ /\b([a-z][a-z0-9_]+)/ig;
If case-insensitive operation need be applied only to a subpattern, can embed it
/... \b((?i)[a-z][a-z0-9_]+) .../
(or, it can be turned off after the subpattern, (?i)pattern(?-i))
That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.
The above regex picks words as required without a need to first split the line on space, done in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.†
There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters may there be?
Note on split-ing on whitespace
The shown split / /, $line splits on exactly that one space. Better is split /\s+/, $line -- or, better yet is to use split's special pattern split ' ', $line: split on any number of any consecutive whitespace, and where leading and trailing spaces are discarded.
† The shown example is correctly processed as desired by the given regex alone
use strict;
use warnings;
use feature 'say';
use Path::Tiny qw(path); # convenience, to slurp the file
my $fn = shift // die "Usage: $0 filename\n";
my #matches = sort map { lc }
path($fn)->slurp =~ /\b([a-z][a-z0-9_]+)/ig;
say for #matches;
I threw in sorting and lower-casing to match the sample code in the question but all processing is done with the shown regex on the file's content in a string.
Output is as desired (except that line and world here come twice, what is correct).
Note that lc can be applied on the string with the file content, which is then processed with the regex, what is more efficient. While this is in principle not the same in this case it may be
perl -MPath::Tiny -wE'$f = shift // die "Need filename\n";
#m = sort lc(path($f)->slurp) =~ /\b([a-z]\w+)/ig;
say for #m'
Here I actually used \w. Adjust to the actual character to match, if different.

Curiously, this can be done with one of those long, typical Perl one-liners
$ perl -lwe'print for sort grep /^\pL/ && length > 1, map { split /\W+/ } map lc, <>' a.txt
bin
exit
hello
hello
line
line
perl
print
print
strict
this
use
usr
will
world
world
Lets go through that and see what we can learn. This line reads from right to left.
a.txt is the argument file to read
<> is the diamond operator, reading the lines from the file. Since this is list context, it will exhaust the file handle and return all the lines.
map lc, short for map { lc($_) } will apply the lc function on all the lines and return the result.
map { split /\W+/ } is a multi-purpose operation. It will remove the unwanted characters (the non-word characters), and also split the line there, and return a list of all those words.
grep /^\pL/ && length > 1 sorts out strings that begin with a letter \pL and are longer than 1 and returns them.
sort sorts alphabetically the list coming in from the right and returns it left
for is a for-loop, applied to the incoming list, in the post-fix style.
print is short for print $_, and it will print once for each list item in the for loop.
The -l switch in the perl command will "fix" line endings for us (remove them from input, add them in output). This will make the print pretty at the end.
I won't say this will produce a perfect result, but you should be able to pick up some techniques to finish your own program.

Related

Perl: Trying to see if the Nth column of line X matches the Nth column of line X-1

I have a perl script that reads text file line by line and splits the line into 4 different columns (shown by dashes & referred to as $cols[0-3] in code; important parts are bolded). For each distinct value before the decimal point in column 0, it should randomly generate a hex color.
Essentially, I need to compare if the Xth column in the current line matches that of the previous line.
A----last_column----221----18
A----last_column----221----76
A----last_column----221----42
B----last_column----335----18
C----last_column----467----83
So far, I am randomly generating a new #random_hex_color for every line, but desired output is below:
221.18-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.76-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.42-------#EB23AE1-------#$some/random/path/A.txt-------last_column
335.18-------#AC16D6E-------#$some/random/path/B.txt-------last_column
467.83-------#FD89A1C-------#$some/random/path/C.txt-------last_column
[Image of input file and desired output][1]
my #cols;
my $row;
my $color = color_gen();
my $path = "\t#\some_random_path/";
my $newvar = dir_contents();
my #array = ($color, $path, $newvar);
my %hash;
while ($row = <$fh>){
next if $row =~ /^(#|\s|\t)/; #skip lines beginning with comments and spaces
#cols = split(" ", $row);
%hash = (
"$cols[2]" => ["$color", "$path", "$newvar"]
);
say Dumper (\%hash);
print("$cols[2].$cols[3]\t#");
print(color_gen());
printf("%-65s", $path.dir_contents());
print("\t\t$cols[0]_"."$cols[1]"." 1 1\n");
}
Use a hash to store, and thus be able to check for, the distinct values in the first column.
I assume that color_gen() returns a new random color at each invocation. The desired output is unclear to me so it is only indicated in the code.
use warnings;
use strict;
my $file = shift #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
open my $fh, '<', $file or die "Can't open $file: $!";
my %c0;
while (<$fh>) {
next if /^(?:\s*$|\s*#)/; # skip: spaces only or empty, comment
my #cols = split;
my ($num) = $cols[0] =~ /^([0-9]+)/;
if (not exists $c0{$num}) { # this number not seen yet; assign color
$c0{$num} = color_gen();
}
# write line of output, with $c0{$num} and #cols
}
The value "before the decimal point in column 0" is extracted using regex as the leading number in that string and stored in $num. The parens around are needed to provide the list context for the match operator, in which case it returns the captured values. See perlretut.
This number is stored as a key in a hash with its value being the generated color. Unless it already exists that is, in which case it has been seen and a color for it generated. This way you can keep track of distinct numbers in that column. Then you can write output using $c0{$num}.
This can be written far more compactly but I hoped for clarity.
The skipped lines here aren't those "beginning with comments and spaces" but are ones with only spaces (or empty), or comments. If you really mean to skip lines that merely start with whitespace (or #) then indeed use /^(?:\s|#)/, where ?: makes () only group and not capture.
A few comments on the code
Always have use warnings; and use strict; at the beginning of each program
The \s in regex matches most types of whitespace; no need for a separate pattern for tab
A variable can be declared right in the while condition, which makes it perfectly scoped -- to that loop. However, you can also omit it and use $_
If while condition has only the input read, such as <$fh>, then the value is assigned to $_ variable; also see I/O in perlop.
I use that here since then the regex is simpler (match on $_ by default) and so is split
The split without arguments has default of split ' ', $_;, where ' ' stands for any amount of any whitespace (and leading spaces are removed before splitting)
Please provide exact samples of input and desired output for a more complete example.

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

How to remove the whitespaces in fasta file using perl?

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?
Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;
Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.

Perl, match one pattern multiple times in the same line delimited by unknown characters

I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.
One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}
The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!
None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.

How can I do a multi-line match on the data returned from Perl's diamond operator

Is there some trick to do multi-line regular expression matches with <>, and loop over them? This example results in no matches when run on files with \n as the newline separator:
while (<> =~ m/\n./) {
print($.);
}
I need to know the line of the start of the match inside the while loop, as in the example.
The goal is to find all lines which have less than 75 characters which are followed by a line starting with a space (the standard vCard way of splitting long lines):
while (<> =~ m/(^|\n).{0,74}\n /)
What are you tring to do in that regex? It looks like you are trying to find any case where a newline is followed by at least one character, and then that leads you to print the line number ($.) of whatever matches that criterion.
If you don't mind my asking, what's the larger purpose here?
In any case, see this article for a clear discussion of multiline matching: Regexp Power
Edited after the move to SO: If what you really want is to find the lines with less than 75 characters and a next line beginning with a space, I wouldn't use one regex. The description points to an easier and clearer (I think) solution: (1) filter out all lines with less than 75 characters (the length function is good for that). For the lines that remain, (2) check if the next line starts with a space. That gives you clear logic and an easy regex to write.
In response to the question about getting the "next" line. Think of it the other way around: you want to check every next line, but only if the previous line was less than 75 characters. So how about this:
my $prev = <>; # Initialize $prev with the first line
while (<>) {
# Add 1 to 75 for newline or chomp it perhaps?
if (length $prev < 76) {
print "$.: $_" if $_ =~ m/^\s/;
}
$prev = $_;
}
(Note that I don't know anything about vCard format and that \s is broader than literally "a single space." So you may need to adjust that code to fit your problem better.)
Did you remember to put the handle in multi-line mode by setting $/ to the empty string or the undefined value?
The following program does what you want:
#! /usr/bin/perl
use warnings;
use strict;
$/ = "";
*ARGV = *DATA;
while (<>) {
while (/^(.{0,75}\n(^[ \t].{1,75}\n)*)/mg) {
my $vcard = $1;
$vcard =~ s/\r?\n[ \t]//g;
print $vcard;
}
}
__DATA__
DESCRIPTION:This is a long description that exists on a long line.
DESCRIPTION:This is a long description
that exists on a long line.
DESCRIPTION:This is a long descrip
tion that exists o
n a long line.
Output:
$ ./try
DESCRIPTION:This is a long description that exists on a long line.
DESCRIPTION:This is a long description that exists on a long line.
DESCRIPTION:This is a long description that exists on a long line.
Do you have a file with arbitrary text mixed with vCards?
If all you have is a bunch of vCards in file and you want to parse them, there some vCard parsing modules on CPAN.
See, for example, Text::vCard, specifically Text::vCard::Addressbook.
Regarding,
while (<> =~ m/\n./) {
print($.);
}
This would indeed not match anything because of the simple fact that input is read line-by-line meaning there cannot be anything in $_ after the newline.
If there never be more than single continuation line following each line shorter than 76 characters, the following might fulfill the requirements:
#!/usr/bin/perl
use strict; use warnings;
for
(
my $this = <>, my $next = <>;
defined ($next = <>);
close ARGV if eof
)
{
printf "%s : %d\n", $ARGV, $. - 1 if 76 > length $this and $next =~ /^ /;
}