How can I extract abbreviations from a file using Perl?

How can I extract abbreviations from a file using Perl? - regex

I need to extract certain Abbreviations from a file such as ABS,TVS,and PERL. Any abbreviations which are in uppercase letters. I'd preferably like to do this with a regular expression. Any help is appreciated.

It would have been nice to hear what part you were particularly having trouble with.
my %abbr;
open my $inputfh, '<', 'filename'
or die "open error: $!\n";
while ( my $line = readline($inputfh) ) {
while ( $line =~ /\b([A-Z]{2,})\b/g ) {
$abbr{$1}++;
}
}
for my $abbr ( sort keys %abbr ) {
print "Found $abbr $abbr{$abbr} time(s)\n";
}

Reading text to be searched from standard input and writing
all abbreviations found to standard output, separated by spaces:
my $text;
# Slurp all text
{ local $/ = undef; $text = <>; }
# Extract all sequences of 2 or more uppercase characters
my #abbrevs = $text =~ /\b([[:upper:]]{2,})\b/g;
# Output separated by spaces
print join(" ", #abbrevs), "\n";
Note the use of the POSIX character class [:upper:], which will match
all uppercase characters, not just English ones (A-Z).

Untested:
my %abbr;
open (my $input, "<", "filename")
|| die "open: $!";
for ( < $input > ) {
while (s/([A-Z][A-Z]+)//) {
$abbr{$1}++;
}
}
Modified it to look for at least two consecutive capital letters.

#!/usr/bin/perl
use strict;
use warnings;
my %abbrs = ();
while(<>){
my #words = split ' ', $_;
foreach my $word(#words){
$word =~ /([A-Z]{2,})/ && $abbrs{$1}++;
}
}
# %abbrs now contains all abreviations

Related

A non-greedy Perl regular expression

I need to write a script which does the following:
$ cat testdata.txt
this is my file containing data
for checking pattern matching with a patt on the back!
only one line contains the p word.
$ ./mygrep5 pat th testdata.txt
this is my file containing data
for checking PATTERN MATCHING WITH a PATT ON THe back!
only one line contains the p word.
I have been able to print the line which is amended with the "a" capitalized as well. I have no idea how to only take what is needed.
I have been messing around (below is my script so far) and all I manage to return is the "PATT ON TH" part.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Data::Dump 'pp';
my ($f, $s, $t) = #ARGV;
my #output_lines;
open(my $fh, '<', $t);
while (my $line = <$fh>) {
if ($line =~ /$f/ && $line =~ /$s/) {
$line =~ s/($f.+?$s)/$1/g;
my $sub_phrase = uc $1;
$line =~ s/$1/$sub_phrase/g;
print $line;
}
#else {
# print $line;
#}
}
close($fh);
which returns: "for checking pattern matching with a PATT ON THe back!"
How can I fix this problem?

It sounds like you want to capitalize from pat to th except for instances of a surrounded by spaces. The easiest way is to uppercase the whole thing, and then fix any instances of A surrounded by spaces.
sub capitalize {
my $s = shift;
my $uc = uc($s);
$uc =~ s/ \s \K A (?=\s) /a/xg;
return $uc;
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The downside is that will replacing any existing A surrounded by spaces with a. The following is more complicated, but it doesn't suffer from that problem:
sub capitalize {
my $s = shift;
my #parts = $s =~ m{ \G ( \s+ | \S+ ) }xg;
for (#parts) {
$_ = uc($_) if $_ ne "a";
}
return join('', #parts);
}
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
The rest of the code can be simplified:
#!/usr/bin/perl
use strict;
use warnings;
sub capitalize { ... }
my $f = shift;
my $s = shift;
while (<>) {
s{ ( \Q$f\E .* \Q$s\E ) }{ capitalize($1) }xseg;
print;
}

So, if you want to match each sequence that starts with pat and ends with th, non-greedily, and uppercase that sequence, you can simply use an expression on the right side of your substitution:
$line =~ s/($f.+?$s)/uc($1)/eg;
And that's it.

Search for a pattern in a file

I have a file eg.txt with contents of this sort :
....text...
....text...
COMP1 = ../../path1/path2/path3
COMP2 = ../../path4/path5/path6
and so on, for a large number of application names (the "COMP"s). I need to get the path -- the stuff including and after the second slash -- for a user-specified application.
This is the code I've been trying :
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while (<$fh>) {
$line = <$fh>;
if ( $line && $line =~ /($app)( = )(..\/)(..)(.*)/ ) {
print $5;
}
}
This prints the name of the user-input application, and does nothing else. Any help would be greatly appreciated!

There are two main problems with your program
The $app variable contains a newline at the end from the enter key you pressed when you typed it in. That will prevent the pattern from matching so you need to use chomp to remove it. The same applies to lines read from your file
The <$fh> in your while statement reads a line from your file into the default variable $_, and then $line = <$fh> reads another, so you are ignoring alternate lines from the file
Here is a version of your program that I think should work although I am unable to test it at present. I have dropped your $line variable altogether and hope that doesn't confuse you. $_ is the default variable for the pattern match so it isn't mentioned explicitly anywhere
use strict;
use warnings;
print "Enter the app: ";
my $app = <STDIN>;
chomp $app;
open my $fh, '<', 'eg.txt' or die "Cannot open: $!";
while ( <$fh> ) {
if ( /$app\s*=\s*(.+)/ ) {
my $path = $1;
$path =~ s/.*\.\.//;
print $path, "\n";
}
}

The input did not matched in regex because newlines were coming along with them, so better use chomp to trim them. In while loop you are displacing two times the file handle, I don't know why. So after corrections this should work:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
chomp($app = <STDIN>);
print "$app: ";
open my $fh, '<', "eg.txt" or die "Cannot open $!";
while($line = <$fh>)
{
chomp $line;
if($line && $line =~ /($app)( = )(..\/)(..)(.*)/)
{
print "$5 \n";
}
}
close($fh);

Try this code:
use strict;
use warnings;
my $line = "";
my $app = "";
print "Enter the app";
$app = <STDIN>;
print $app;
open my $fh, '<', "eg.txt" or die "Cannot open $!";
my #line = <$fh>;
my #fetch = map { /COMP\d+\s\=\s(\..\/\..\/.*)/g } #line ;
$, = "\n";
print #fetch;
and then please send your response.

You are accessing <$fh> twice in your loop. This will have the effect of interpreting only every other line. You might want to change the top of the loop to something like this:
while (defined(my $line = <$fh>)) {
and remove the my $line ... at the top of the program.
Also, you might want to consider chomping your input line so that you don't have to think about the trailing newline character:
while (defined(my $line = <$fh>)) {
chomp $line;
Your regular expression is also a bit dicey. You probably want to bind it to the beginning and end of the search space and escape the literal dots. You may also want $app to be interpreted as a string rather than a regexp, which can be done by wrapping it with \Q...\E. Also unless your file format specifies single spaces around the equals, I'd be tempted to make those flexible to zero or more occurrences. Also, if you aren't going to use the earlier captures, I would say don't do them, so:
if ($line && $line =~ /^\Q$app\E *= *\.\.\/\.\.(.*)$/)
{
print $1;
(Some may say you should use \A and \z rather than ^ and $. That choice is left as an exercise to the reader.)

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

I'm just starting to learn Perl. I need to parse JavaScript file. I came up with the following subroutine, to do it:
sub __settings {
my ($_s) = #_;
my $f = $config_directory . "/authentic-theme/settings.js";
if ( -r $f ) {
for (
split(
'\n',
$s = do {
local $/ = undef;
open my $fh, "<", $f;
<$fh>;
}
)
)
{
if ( index( $_, '//' ) == -1
&& ( my #m = $_ =~ /(?:$_s\s*=\s*(.*))/g ) )
{
my $m = join( '\n', #m );
$m =~ s/[\'\;]//g;
return $m;
}
}
}
}
I have the following regex, that removes ' and ; from the string:
s/[\'\;]//g;
It works alright but if there is a mentioned chars (' and ;) in string - then they are also removed. This is undesirable and that's where I stuck as it gets a bit more complicated for me and I'm not sure how to change the regex above correctly to only:
Remove only first ' in string
Remove only last ' in string
Remove ont last ; in string if exists
Any help, please?

You can use the following to match:
^'|';?$|;$
And replace with '' (empty string)
See DEMO

Remove only first ' in string
Remove only last ' in string
^[^']*\K'|'(?=[^']*$)
Try this .See demo.
https://regex101.com/r/oF9hR9/8
Remove ont last ; in string if exists
;(?=[^;]*$)
Try this.See demo.
https://regex101.com/r/oF9hR9/9
All three in one
^[^']*\K'|'(?=[^']*$)|;(?=[^;]*$)
See Here

You can use this code:
#!/usr/bin/perl
$str = "'string; 'inside' another;";
$str =~ s/^'|'?;?$//g;
print $str;
IDEONE demo
The main idea is to use anchors: ^ beginning of string, $ end of string and ;? matches the ";" symbol at the end only if it is present (? quantifier is making the pattern preceding it optional).EDIT: Also, ; will get removed even if there is no preceding '.

I suggest that your original code should look more like this. It is much more idiomatic Perl and I think more straightforward to follow
sub __settings {
my ($_s) = #_;
my $file = "$config_directory/authentic-theme/settings.js";
return unless -r $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #file = <$fh>;
chomp #file;
for ( #file ) {
next if m{//};
if ( my #matches = $_ =~ /(?:$_s\s*=\s*(.*))/g ) {
my $matches = join "\n", #matches;
$matches =~ tr/';//d;
return $matches;
}
}
}

Find and get the location of a list of words in a text

I'm sure this is simple but I just can't figure out what to do...
I have a text file with a bunch of words in it (let's call it "wordlist") organized in a single column. Then I have a big text file (let's call it "essay"). What I want to do is to look in the "essay" file for the words in my "wordlist".
The trick is that I want to know the position of the matched word in the "essay" (meaning, match found after X characters).
I'm actually able to do it when I look for a single word (so wordlist containing just 1 word) but I can't get it to work when working with a list of words...
Any advice ?
thanks a lot
Ok so I just realized it would just tell me "no match found" anyway...Here is the code
use strict;
use warnings;
open (my $wordlist, "<", "/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "/essay.txt")
or die "cannot open < essay.txt $!";
while (<$essay>) { print "match found\n" if ($essay =~ m/$wordlist/) ; }
{ print "no match found\n" if ($essay !~ m/$wordlist/) ; }
Help please...?

perl index function basically matches substring which does not ensure the match of a full string. A regular expression based match is more useful here imho.
Explanation:
Read whole text of essay in a string. => $essay
For each word from wordlist.txt => $_
-- Keep matching $_ within $essay with proper regex. The one used here is b$_\b
-- For each match, collect the value of #-[0]
\b: is the word boundary character here which ensures that it only matches with complete words not substrings.
#-: is a special variable that contains the start position of the last regex match.
Here is a sample code:
use strict;
use warnings;
use 5.010;
my $wordlist_file = 'wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
my %pos;
my $essay_file = 'essay.txt';
my $essay = do {
local $/ = undef;
open my $fh, "<", $essay_file
or die "could not open $essay_file: $!";
<$fh>;
};
while (<$wordlist_fh>) {
chomp;
$pos{$_} = [] unless $pos{$_};
while($essay =~ m/\b$_\b/g){
push #{$pos{$_}}, #-;
}
}
use Data::Dumper;
print Dumper(\%pos);
the wordlist file and essay files are similar as mentioned by ThisSuitIsBlackNot.
wordlist.txt
I
Perl
hacker
essay.txt
I want to be just another Perl hacker when I grow up
I want to be just another Perl hacker when I grow up
The %pos hash now contains all the positions of your each word. I just showed them through dumper
$VAR1 = {
'hacker' => [
'31',
'84'
],
'Perl' => [
'26',
'79'
],
'I' => [
'0',
'43',
'53',
'96'
]
};
Note that the counts are including the newline characters at the end of each line.

Maybe you can use index() function.
Here is the link: Using the Perl index() function
This is my sample. The performance may be not too well. Hope it helps~:)
open (my $wordlist, "<", "files/wordlist.txt")
or die "cannot open < wordlist.txt $!";
open (my $essay, "<", "files/essay.txt")
or die "cannot open < essay.txt $!";
my $words = {};
while (<$wordlist>) {
chomp($_);
$words->{$_} = 1;
}
my $row_count = 0;
while (<$essay>) {
$row_count++;
chomp($_);
foreach my $word (keys %{$words}) {
my $offset = 0;
my $r = index($_, $word, $offset);
while ($r != -1) {
print "Found [$word] in line $row_count at $r\n";
$offset = $r + 1;
$r = index($_, $word, $offset);
}
}
}

In your code, $essay and $wordlist are both filehandles. When you say
print "match found\n" if ($essay =~ m/$wordlist/);
You're trying to match the stringification of one filehandle to the stringification of another filehandle. When a filehandle is stringified, it looks something like this:
GLOB(0x9a26c38)
So your code actually does something like:
print "match found\n" if ('GLOB(0x9a26c38)' =~ m/GLOB(0x94bbc38)/);
This is not what you want. You need to read the contents of your files and compare those, not the filehandles themselves.
Essay words each on their own line
The following code assumes that your "essay" consists of one word per line. We read the contents of the essay file into a hash of arrays, with the lines as keys and an array of positions as values. We use an array in case the same word appears multiple times in the file. The position of the first word is zero. We then loop through the word list file, printing the word and the first matching position, if there is one.
use strict;
use warnings;
use 5.010;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
my $pos = 0;
my %essay;
while (<$essay_fh>) {
chomp;
push #{ $essay{$_} }, $pos;
$pos += length $_;
}
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: $essay{$_}[0]" if exists $essay{$_};
}
essay.txt
I
want
to
be
just
another
Perl
hacker
when
I
grow
up
wordlist.txt
I
Perl
hacker
Output
I: 0
Perl: 20
hacker: 24
Note that I'm ignoring newline characters when computing the position values. You can adjust this as necessary.
Essay words more than one per line
If your essay file can have more than one word per line, we can use a regex to check for matches:
use strict;
use warnings;
use 5.010;
# Slurp entire essay file into a variable
my $essay = do {
local $/;
my $essay_file = 'files/essay.txt';
open my $essay_fh, '<', $essay_file or die "Failed to open '$essay_file': $!";
<$essay_fh>;
};
my $wordlist_file = 'files/wordlist.txt';
open my $wordlist_fh, '<', $wordlist_file or die "Failed to open '$wordlist_file': $!";
while (<$wordlist_fh>) {
chomp;
say "$_: ", pos($essay) - length($_) if $essay =~ /\b$_\b/g;
}
essay.txt
I want to be just another Perl hacker when I grow up
wordlist.txt
I
Perl
hacker
hack
Output
I: 0
Perl: 26
hacker: 31
Note that the results are a little bit different from our other program, because now there are spaces between words. Also note that there is no output for the word hack, since we're only checking for whole word matches.

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}

I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$

perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;

You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I extract abbreviations from a file using Perl? - regex

I need to extract certain Abbreviations from a file such as ABS,TVS,and PERL. Any abbreviations which are in uppercase letters. I'd preferably like to do this with a regular expression. Any help is appreciated.

Untested: my %abbr; open (my $input, "<", "filename") || die "open: $!"; for ( < $input > ) { while (s/([A-Z][A-Z]+)//) { $abbr{$1}++; } } Modified it to look for at least two consecutive capital letters.

#!/usr/bin/perl use strict; use warnings; my %abbrs = (); while(<>){ my #words = split ' ', $_; foreach my $word(#words){ $word =~ /([A-Z]{2,})/ && $abbrs{$1}++; } } # %abbrs now contains all abreviations

Related

A non-greedy Perl regular expression

Search for a pattern in a file

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

Find and get the location of a list of words in a text

Match different variant of a word using regex Perl

Categories

Resources