How to extract the text between two patterns using REGEX perl

How to extract the text between two patterns using REGEX perl - regex

In the following lines how can I store the lines between "Description:" and "Tag:" in a variable using REGEX PERL and what would be a good datatype to use, string or list or something else?
(I am trying to write a program in Perl to extract the information of a text file with Debian package information and convert it into a RDF(OWL) file(ontology).)
Description: library for decoding ATSC A/52 streams (development)
liba52 is a free library for decoding ATSC A/52 streams. The A/52 standard is
used in a variety of applications, including digital television and DVD. It is
also known as AC-3.
This package contains the development files.
Homepage: http://liba52.sourceforge.net/
Tag: devel::library, role::devel-lib
The code I have written so far is:
#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");
$i=0;
while(my $line = <DEB>)
{
if($line =~ /Package/)
{
$line =~ s/Package: //;
print ONT ' <package rdf:ID="instance'.$i.'">';
print ONT ' <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
}
elsif($line =~ /Priority/)
{
$line =~ s/Priority: //;
print ONT ' <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}
elsif($line =~ /Section/)
{
$line =~ s/Section: //;
print ONT ' <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}
elsif($line =~ /Maintainer/)
{
$line =~ s/Maintainer: //;
print ONT ' <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}
elsif($line =~ /Architecture/)
{
$line =~ s/Architecture: //;
print ONT ' <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
$line =~ s/Version: //;
print ONT ' <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
$line =~ s/Provides: //;
print ONT ' <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
$line =~ s/Depends: //;
print ONT ' <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
$line =~ s/Suggests: //;
print ONT ' <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}
elsif($line =~ /Description/)
{
$line =~ s/Description: //;
print ONT ' <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
$line =~ s/Tag: //;
print ONT ' <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
print ONT ' </Package>'."\n\n";
}
$i=$i+1;
}

my $desc = "Description:";
my $tag = "Tag:";
$line =~ /$desc(.*?)$tag/;
my $matched = $1;
print $matched;
or
my $desc = "Description:";
my $tag = "Tag:";
my #matched = $line =~ /$desc(.*?)$tag/;
print $matched[0];
or
my $desc = "Description:";
my $tag = "Tag:";
(my $matched = $line) =~ s/$desc(.*?)$tag/$1/;
print $matched;
Additional
If your Description and Tag may be on separate lines, you may need to use the /s modifier, to treat it as a single line, so the \n won't wreck it. Example:
$_=qq{Description:foo
more description on
new line Tag: some
tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;

Assuming:
my $example; # holds the example text above
You could:
(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker
$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line
Or
(my $result=$example)=~s/^.*?\n(Description:.+?Tag:[^\n]*\n).*$/$1/s;
Both assume the Tag: value is contained on a single line.
If this is not the case, you might try:
(my $result=$example)=~s/
( # start capture
Description: # literal 'Description:'
.+? # any chars (non-greedy) up to
Tag: # literal 'Tag:'
.+? # any chars up to
)
(?: # either
\n[A-Z][a-z]+\: # another tagged value name
| # or
$ # end of string
)
/$1/sx;

I believe that the problem is caused by using a line reading loop for data structured by paragraphs. If you can slurp the file into memory and and apply split with a captured delimiter, the processing will be much smoother:
#!/usr/bin/perl -w
use strict;
use diagnostics;
use warnings;
use English;
# simple sample sub
my $printhead = sub {
printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
, 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
, 'Tag' => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my #pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar #pks; $p +=2) {
my $blk = "PackageName: " . $pks[$p + 1];
my #inf = split /\s*^([\w-]+):\s+/ms, $blk;
printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
# outer loop: key-value-pairs (or whatever they are called)
for (my $x = 1; $x < scalar #inf; $x += 2) {
if (exists($tagsoups{$inf[ $x ]})) {
$tagsoups{$inf[ $x ]}($inf[$x + 1]);
}
}
}
output for a shortened Packages file from my Ubuntu Linux:
3 Package named abrowser-3.5-branding
got the PackageName:
---------
abrowser-3.5-branding
----------
got the Description:
---------
dummy upgrade package for firefox-3.5 -> firefox
This is a transitional package so firefox-3.5 users get firefox on
upgrades. It can be safely removed.
----------
4 Package named casper
got the PackageName:
---------
casper
----------
got the Description:
---------
Run a "live" preinstalled system from read-only media
----------
got the Tag:
---------
admin::boot, admin::filesystem, implemented-in::shell, protocol::smb, role::plugin, scope::utility, special::c
ompletely-tagged, works-with-format::iso9660
----------
Using a hash for the functions to apply to the extracted parts will keep the details of generating xml out of the parser loops.

Related

Perl Regex Match and loop HTML Comments

I have a log file with data in format :
<!-- 12/15/16 01:02:27:950.125
DATA1 -->
<!-- 12/15/16 01:02:27:950.373
DATA2 -->
<!-- 12/15/16 01:02:27:950.921
DATA3: Text1 -->
<!-- 12/15/16 01:02:27:951.066
DATA4: Text2 -->
I need to extract and loop all the data inside the comments.
I am reading the file and saving data as one string.
I have tried a few solutions but getiing "undef" on match
use strict;
use Data::Dumper;
use File::Basename;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );
use Date::Format;
use DateTime;
use warnings;
.
.
.
if ( open(ORIGFILE, $filepath) ) {
my #wrp_record_content = <ORIGFILE>;
# my $content = join('', #wrp_record_content);
# my #matches = $content =~ s/<!--(.*)-->//g;
# my $data;
# while ( <ORIGFILE> ) {
# $data .= $_;
# }
# while ( $data =~ m/<!--(.*)-->/g ) {
# print Dumper('===DATA===');
# print Dumper($data);
# }
my $content = join('', #wrp_record_content);
#print Dumper('------CONTENT------');
#print Dumper($content);
#print Dumper('------ CONTENT ENDED ------');
my #matches;
while ($content =~ /<!--.*?-->/gs) {
push #matches, $1;
}
foreach my $m (#matches) {
print Dumper('===MATCH===', "\n");
print Dumper($m);
}
}
Can someone please guide on where it is going wrong?

There is nothing in $1. You must add capturing parentheses to your regex pattern
$content =~ /<!--(.*?)-->/gs
You have done it correctly in the loop that you commented out!

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3

It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3

The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

I'm just starting to learn Perl. I need to parse JavaScript file. I came up with the following subroutine, to do it:
sub __settings {
my ($_s) = #_;
my $f = $config_directory . "/authentic-theme/settings.js";
if ( -r $f ) {
for (
split(
'\n',
$s = do {
local $/ = undef;
open my $fh, "<", $f;
<$fh>;
}
)
)
{
if ( index( $_, '//' ) == -1
&& ( my #m = $_ =~ /(?:$_s\s*=\s*(.*))/g ) )
{
my $m = join( '\n', #m );
$m =~ s/[\'\;]//g;
return $m;
}
}
}
}
I have the following regex, that removes ' and ; from the string:
s/[\'\;]//g;
It works alright but if there is a mentioned chars (' and ;) in string - then they are also removed. This is undesirable and that's where I stuck as it gets a bit more complicated for me and I'm not sure how to change the regex above correctly to only:
Remove only first ' in string
Remove only last ' in string
Remove ont last ; in string if exists
Any help, please?

You can use the following to match:
^'|';?$|;$
And replace with '' (empty string)
See DEMO

Remove only first ' in string
Remove only last ' in string
^[^']*\K'|'(?=[^']*$)
Try this .See demo.
https://regex101.com/r/oF9hR9/8
Remove ont last ; in string if exists
;(?=[^;]*$)
Try this.See demo.
https://regex101.com/r/oF9hR9/9
All three in one
^[^']*\K'|'(?=[^']*$)|;(?=[^;]*$)
See Here

You can use this code:
#!/usr/bin/perl
$str = "'string; 'inside' another;";
$str =~ s/^'|'?;?$//g;
print $str;
IDEONE demo
The main idea is to use anchors: ^ beginning of string, $ end of string and ;? matches the ";" symbol at the end only if it is present (? quantifier is making the pattern preceding it optional).EDIT: Also, ; will get removed even if there is no preceding '.

I suggest that your original code should look more like this. It is much more idiomatic Perl and I think more straightforward to follow
sub __settings {
my ($_s) = #_;
my $file = "$config_directory/authentic-theme/settings.js";
return unless -r $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #file = <$fh>;
chomp #file;
for ( #file ) {
next if m{//};
if ( my #matches = $_ =~ /(?:$_s\s*=\s*(.*))/g ) {
my $matches = join "\n", #matches;
$matches =~ tr/';//d;
return $matches;
}
}
}

match using regex in perl

HI I am trying to exract some data from a text file in perl. My file looks like this
Name:John
FirstName:Smith
Name:Alice
FirstName:Meyers
....
I want my string to look like John Smith and Alice Meyers
I tried something like this but I'm stuck and I don't know how to continue
while (<INPUT>) {
if (/^[Name]/) {
$match =~ /(:)(.*?)(\n) /
$string = $string.$2;
}
if (/^[FirstName]/) {
$match =~ /(:)(.*?)(\n)/
$string = $string.$2;
}
}
What I try to do is that when I match Name or FirstName to copy to content between : and \n but I get confused which is $1 and $2

This will put you first and last names in a hash:
use strict;
use warnings;
use Data::Dumper;
open my $in, '<', 'in.txt';
my (%data, $names, $firstname);
while(<$in>){
chomp;
($names) = /Name:(.*)/ if /^Name/;
($firstname) = /FirstName:(.*)/ if /^FirstName/;
$data{$names} = $firstname;
}
print Dumper \%data;

Through perl one-liner,
$ perl -0777 -pe 's/(?m).*?Name:([^\n]*)\nFirstName:([^\n]*).*/\1 \2/g' file
John Smith
Alice Meyers

while (<INPUT>) {
/^([A-Za-z])+\:\s*(.*)$/;
if ($1 eq 'Name') {
$surname = $2;
} elsif ($1 eq 'FirstName') {
$completeName = $2 . " " . $surname;
} else {
/* Error */
}
}
You might want to add some error handling, e.g. make sure that a Name is always followed by a FirstName and so on.

$1 $2 $3 .. $N , it's the capture result of () inside regex.
If you do something like that , you cant avoid using $1 like variables.
my ($matched1,$matched2) = $text =~ /(.*):(.*)/
my $names = [];
my $name = '';
while(my $row = <>){
$row =~ /:(.*)/;
$name = $name.' '.$1;
push(#$names,$name) if $name =~ / /;
$name = '' if $name =~ / /;
}
`while(<>){
}
`

open (FH,'abc.txt');
my(%hash,#array);
map{$_=~s/.*?://g;chomp($_);push(#array,$_)} <FH>;
%hash=#array;
print Dumper \%hash;

Match different variant of a word using regex Perl

I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}

I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$

perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;

You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract the text between two patterns using REGEX perl - regex

Related

Perl Regex Match and loop HTML Comments

perl count line in double looping, if match regular expression plus 1

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

match using regex in perl

Match different variant of a word using regex Perl

Categories

Resources