Retrieve string between two string delimiters using regex in perl

Retrieve string between two string delimiters using regex in perl - regex

I have been working on this for a little while now and can't seem to figure it out. I have a file containing a bunch of lines all structured like the one below meaning each line starts with "!" and has three separators "<DIV>".
!the<DIV>car<DIV>drove down the<DIV>road off into the distance
I am interested in retrieving the last string "road off into the distance" I can't seem to get it to work. Below I have listed the current code I have.
while($line = <INFILE>) {
$line =~ /<SEP>{3}(.*)/;
print $1;
}
Any help would be greatly appreciated!

The statement
#b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/
will split the string into a list, and you can then extract the 4th element with
$b[3]
If you really want only the last one, do this instead:
($text) = $a =~ /^!.*<DIV>(.*)/

I don't know whether you insist on regex or simply didn't think of else, but split will nicely do this
$text = (split '<DIV>', $str)[-1];
If you regularly have such repeating patterns split may well be better for the job than a pure regex. (Split also uses full regular expressions in its pattern, of course.)
ADDED
All this can be done directly, if you simply only need to pull the last thing from each line:
open my $fh, '<', $file;
my #text = map { (split '<DIV>')[-1] } <$fh>;
close $fh;
print "$_\n" for #text;
The split by default uses $_, which inside the map is the current element processed. For lines without a <DIV> this returns the whole line. A file handle in the list context serves all lines as a list; the list context is imposed by map here.
In case you want all text between delimiters you can do
my #rlines = map { [ split '<DIV>' ] } <$fh>;
where [ ] takes a reference to the list returned by split and thus #rlines contains references to arrays, each with text in between <DIV>s on a line. The leading ! is there though and to drop it a little more processing is needed.
Of course, for the map block you can use { (/.*<DIV>(.*)/)[0] } from Jim Garrison's answer for a single match, or modify the regex a little to catch'em all.
If performance is a factor then that's a little different question.

A simple substitution could work too:
while(<DATA>){
chomp;
my $text = (s/.*<DIV>//g, $_);
say $text;
}

Simple regex which answers your question:
my $match= '';
while($line = <INFILE>) {
($match) = $line =~/.*<DIV>(.*)/;
}
print $match, "\n";

Related

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.

There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length

Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";

Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven

To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

Perl: how to supply regexp list from file

So, I need parse a file and if something matches the pattern, replace it with something:
while(<$ifh>) {
s/(.*pattern_1*)/$1\nsome more stuff/ ;
s/(.*pattern_2*)/$1\neven more stuff/;
s/(.*pattern_3.*)// ;
# and so on ...
print $ofh $_;
}
Question: what would be a simplest way to have this regexp rules list in the file (something similar to sed '-f' option)?
EDIT: perhaps I need to clarify a bit. We need to have the regexp rules in the separate file (not in the parsed file - although this was nice, thanks!), so they are not hardcoded. So, basically the external file should consist of 's//' lines.
Of course this can probably be done with foreach loop and eval, or even with external call to sed, but I suppose there can be something nicer.
regards, Wojtek

You could create patterns and store them in a Perl list, then simply iterate through your patterns in your input loop.
my #patterns = ( qr/pattern1/, qr/pattern2/, qr/pattern3/, etc..);
while(<$ifh>) {
for my $pattern (#patterns) {
s/($pattern)/$1\neven more stuff/;
}
print $ofh $_;
}

Based on your edit, here's a way to store regular expressions in an external file. Note that I would not recommend storing whole s/.../.../ statements in an external file, but you could use the eval function to accomplish it. Here's how I would solve this, however:
my (#regexes,$i);
$i=0;
while (my $line=<$rifh>) {
chomp($line);
if (index($line, "::")>-1) {
my #texts=split(/::/,$line);
$regexes[$i]{"to find"} = qr!$texts[0]!;
$regexes[$i]{"replace with"} = $texts[1];
$i++;
}
}
while (my $line=<$ifh>) {
chomp($line);
foreach my $regex (#regexes)
$line ~= s!$regex->{"to find"}!$regex->{"replace with"}!;
print $ofh $line . "\n";
}
(My use of ! as a regexp delimiter is my own style mechanism, often useful for allowing bare / characters to be used directly in a regexp definition.)

String replace in Perl

I am trying to deobfuscate code. This code uses a lot of long variable names which are substituted with meaningful names at the time of running the code.
How do I preserve the state while searching and replacing?
For instance, with an obfuscated line like this:
${${"GLOBALS"}["ttxdbvdj"]}=_hash(${$urqboemtmd}.substr(${${"GLOBALS"}["wkcjeuhsnr"]},${${"GLOBALS"}["gjbhisruvsjg"]}-${$rrwbtbxgijs},${${"GLOBALS"}["ibmtmqedn"]}));
There are multiple mappings in mappings.txt which match above obfuscated line like:
$rrwbtbxgijs = hash_length;
$urqboemtmd = out;
At the first run, it will replace $rrwbtbxgijs with hash_length in the obfuscated line above. Now, when it comes across the second mapping during the next iteration of the outer while loop, it will replace $urqboemtmd with out in the obfuscated line.
The problem is:
When it comes across first mapping, it does the substitution. However, when it comes across next mapping in the same line for a different matching string, the previous search/replace result is not there.
It should preserve the previous substitution. How do I do that?
I wrote a Perl script, which would pick one mapping from mapping.txt and search the entire obfuscated code for all the occurrences of this mapping and replace it with the meaningful text.
Here is the code I wrote:
#! /usr/bin/perl
use warnings;
($mapping, $input) = #ARGV;
open MAPPING, '<', $mapping
or die "couldn't read from the file, $mapping with error: $!\n";
while (<MAPPING>) {
chomp;
$line = $_;
($key, $value) = split("=", $line);
open INPUT, '<', $input;
while (<INPUT>) {
chomp;
if (/$key/) {
$_=~s/\Q$key/$value/g;
print $_,"\n";
}
}
close INPUT;
}
close MAPPING;

To match the literal meta characters inside your string, you can use quotemeta or:
s/\Q$key\E/$replace/

Just tell Perl not to interpret the characters in $key:
s/\Q$key/$value/g

Consider using B::Deobfuscate and gradually enter variable names into its configuration file as you figure out what they do.
I'm a little confused about your request to save state. What exactly are you doing/do you intend to do with the output? Here's an (untested) example of doing all the substitutions in one pass, if that helps?
my %map;
while ( my $line = <MAPPING> ) {
chomp $line;
my ($key, $value) = split("=", $line);
$map{$key} = $value;
}
close MAPPING;
my $search = qr/(#{[ join '|', map quotemeta, sort { length $b <=> length $a } keys %map ]})/;
while ( my $line = <INPUT> ) {
$line =~ s/$search/$map{$1}/g;
print OUTPUT $line;
}

Regular expression statement inside a while loop only matching and printing one of several expected matches

I've been struggling with this for a while and I was wondering if there was something obvious I've missed.
As programming learning/practice, I'm trying to put together a simple script for calculating the components of a restriction enzyme digest mix. However, first I need to get a list of enzyme stock concentrations.
I pulled all the individual pages from the New England Biolabs enzyme page, and my goal with this current script is to pull out the name of the enzyme and the concentrations available from the company.
This example works with a local copy of EcoRI (link included at bottom of submission).
use warnings;
use strict;
open(FILE,'productR0101.asp');
my $line;
my $counter;
my $array1;
my $array2;
my $array3;
my $concentration;
my #array4;
$counter = 1;
while ($line = <FILE>) {
chomp($line);
if ($counter == 6 ){
$array1 = $line;
$counter++;
}
else{
$counter++;
}
if ($line =~ m/.{8}units.ml/g) {
(#array4) =$line =~ m/.{8}units.ml/g;
print #array4;
}
}
print "\n".$array1;
exit;
Every file has the enzyme name on the sixth line of the file, so I just pulled that whole line. However, the concentrations are in different locations, so my approach was to read in the file one line at a time, and match to the units/ml tag.
My thinking was that it should print out the match for each line, if there was one, every time the while loop runs, effectively resulting in a string of separate print statements.
This is where I get messed up. There are six different locations in this file with a units/ml tag: three for 20,000 and three for 100,000.
I was expecting six different results printed, but when I run this, only one 100,000 units/ml result is returned.
I've tried all sorts of fixes. I tried concatenating strings, I tried storing it as a string, I tried concatenating it onto another array that never gets touched by the (#array4) = $line =~ m/.{8}units.ml/g line, and it either breaks it or gives the same result.
And finally, I apologize for any weird conventions. I'm still learning Perl, and my first experience programming was with MATLAB.
Also, the $array1, $array2, etc. exist because I was trying to keep track of exactly what was getting put where; my intention is to clean it up once I get it functional.
So does anyone have any ideas about what I'm doing wrong?
EDIT: the data source is the source code to each individual enzyme page. For this example, if you view the page source you get the complete input file I gave to the script.

Are the 20,000 units/ml at the start of the line? Because in that case, .{8} would fail to match - the dot doesn't match newlines, and 20,000_ is only 7 characters.

We really need to see the data you are processing, but it looks like you are storing only the last occurrence of /units.ml/ in #array4 because you are reading the file line by line.
I will add to this answer if you supplement your question, but for now I need to know
What your data looks like
What the mysterious /.{8}/ is for
Are you aware that $array1, $array2, and $array3, are scalars, as well as being very bad names for variables?
For now, here is a rewrite of your code using idiomatic Perl, and the $. variable that evaluates to the line number of the file most recently read
use strict;
use warnings;
open my $file, '<', 'productR0101.asp' or die $!;
my $array1;
my #array4;
while (my $line = <$file>) {
chomp $line;
$array1 = $line if $. == 6;
if ($line =~ m/.{8}units.ml/) {
#array4 = $line =~ m/.{8}units.ml/g;
print "#array4\n";
}
}
print "\n".$array1;

I can't exactly reproduce the behavior you've reported of only getting one of the 100,000 units/ml results, as I'm not exactly sure what your input data is. However, I think the problem is with the regular expression not having any captures. You should put parenthesis around the part of the regex match that you want to be returned to #array4. So instead of this:
#array4 = $line =~ m/.{8}units.ml/g;
Try this:
#array4 = $line =~ m/(.{8})units.ml/g;
#array4 = $line =~ /(.{8})units.ml/;
EDIT:
You also don't want to use the m/ and /g modifiers.

Perl, match one pattern multiple times in the same line delimited by unknown characters

I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.

One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}

The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!

None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Retrieve string between two string delimiters using regex in perl - regex

The statement #b = $a =~ /^!(.?)<DIV>(.?)<DIV>(.?)<DIV>(.)/ will split the string into a list, and you can then extract the 4th element with $b[3] If you really want only the last one, do this instead: ($text) = $a =~ /^!.<DIV>(.)/

A simple substitution could work too: while(<DATA>){ chomp; my $text = (s/.*<DIV>//g, $_); say $text; }

Simple regex which answers your question: my $match= ''; while($line = <INFILE>) { ($match) = $line =~/.<DIV>(.)/; } print $match, "\n";

Related

Replace only the second occurance of string in a line in perl regex

Perl: how to supply regexp list from file

String replace in Perl

Regular expression statement inside a while loop only matching and printing one of several expected matches

Perl, match one pattern multiple times in the same line delimited by unknown characters

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Retrieve string between two string delimiters using regex in perl - regex

The statement #b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/ will split the string into a list, and you can then extract the 4th element with $b[3] If you really want only the last one, do this instead: ($text) = $a =~ /^!.*<DIV>(.*)/

A simple substitution could work too: while(<DATA>){ chomp; my $text = (s/.*<DIV>//g, $_); say $text; }

Simple regex which answers your question: my $match= ''; while($line = <INFILE>) { ($match) = $line =~/.*<DIV>(.*)/; } print $match, "\n";

Related

Replace only the second occurance of string in a line in perl regex

Perl: how to supply regexp list from file

String replace in Perl

Regular expression statement inside a while loop only matching and printing one of several expected matches

Perl, match one pattern multiple times in the same line delimited by unknown characters

Categories

Resources

The statement #b = $a =~ /^!(.?)<DIV>(.?)<DIV>(.?)<DIV>(.)/ will split the string into a list, and you can then extract the 4th element with $b[3] If you really want only the last one, do this instead: ($text) = $a =~ /^!.<DIV>(.)/

Simple regex which answers your question: my $match= ''; while($line = <INFILE>) { ($match) = $line =~/.<DIV>(.)/; } print $match, "\n";