Perl Extracting Text

Perl Extracting Text - regex

I have been working on this for so long!
I'd appreciate your help...
What my doc will look like:
<text>
<text> command <+>= "stuff_i_need" <text>
<text>
<text> command <+>= stuff <text>
<text>
<text> command <+>= -stuff <text>
<text>
Anything with tangle brackets around it is optional
stuff could be anything (apple, orange, banana) but it is what I need to extract
the command is fixed
My code so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::Diff;
# File Handlers
open(my $ofh, '>in.txt');
open(my $ifh, '<out.txt');
while (<$ifh>)
{
# Read in a line
my $line = $_;
chomp $line;
# Extract stuff
my $extraction = $line;
if ($line =~ /command \+= /i) {
$extraction =~ s/.*"(.*)".*/$1/;
# Write to file
print $ofh "$extraction\n";
}
}

Based on the example input:
if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) {
$extraction = $1;
print "$extraction\n";
}

A few things:
For extraction, don't use substitution (i.e., use m// and not s///). If you use a match, the parenthetical groups inside the match will be returned as a list (and assigned to $1, $2, $3, etc. if you prefer).
The =~ binds the variable you want to match. So you want $extraction to actually be $line.
Your .* match is too greedy and will prevent the match from succeeding the way you want. What I mean by "greedy" is that .* will match the trailing " in your lines. It will consume the rest of the input on the line and then try match that " and fail because you've reached the end of the line.
You want to specify what the word could be. For example, if it's letters, then match [a-zA-Z]
my ($extraction) = $line =~ /command \+= "([a-zA-Z]*)"/;
If it's a number, you want [0-9]:
my ($extraction) = $line =~ /command \+= "([0-9]*)"/;
If it could be anything except ", use [^"], which means "anything but "":
my ($extraction) = $line =~ /command \+= "([^"]*)"/;
It usually helps to try to match against just what you are looking for rather than the blanket .*.

The following regular expression would help you:
m{
(?<= = ) # Find an `=`
\s* # Match 0 or more whitespaces
(?: # Do not capture
[ " \- ] # Match either a `"` or a `-`
)? # Match once or never
( # Capture
[^ " \s ]+ # Match anything but a `"` or a whitespace
)
}x;

The following one-liner will extract a word (a sequence of characters without spaces) that follows an equal sign prefixed by an optional plus sign, surrounded by optional quotes. It will read from in.txt and write to out.txt.
perl -lne 'push #a, $1 if /command\s*\+?=\s*("?\S+"?)/ }{
print for #a' in.txt > out.txt
The full code - if you prefer script form - is:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
push #a, $1 if /command\s*\+?=\s*("?\S+"?)/;
}
{
print $_ foreach (#a);
}
Courtesy of the Deparse function of the O module.

A light solution.
#!/usr/bin/env perl
use warnings;
use strict;
open my $ifh, '<','in.txt';
open my $ofh, '>', 'out.txt';
while (<$ifh>)
{
if (/
\s command\s\+?=\s
(?:-|("))? # The word can be preceded by an optional - or "
(\w+)
(?(1)\1)\s+ # If the word is preceded by a " it must be end
# with a "
/x)
{
print $ofh $2."\n";
}
}

Related

How to get the string which is starting after front slash in Perl regex?

For example, I have a below string starting with two front slashes. Now I want to get the string "foo_foo". How do I do that? Thanks in advance.
my $str = "// filename : foo_foo";
if ($_ =~ m/^filename\s+:\s+(.+)/) {print "regex $1 \n";}

You populate $str but bind the match against $_.
Use a different delimiter so you don't have to escape the slashes.
my $str = "// filename : foo_foo";
if ($str =~ m{^/+\s+filename\s+:\s+(.+)}) {
print "regex: '$1'\n";
}

You can use
my $str = "// filename : foo_foo";
if ($str =~ m{^//\h*filename\s*:\s*(.+)}) {
print "regex $1 \n";
}
See the online Perl demo. Here, I used {...} regex delimiters instead of /.../ and the pattern looks like ^//\h*filename\s*:\s*(.+) now, matching
^ - start of string
// - a // substring
\h* - zero or more horizontal whitespaces
filename - some fixed string
\s*:\s* - a : char enclosed with zero or more whitespaces
(.+) - Group 1: one or more chars other than line break chars as many as possible (greedy dot).

Something in line with following sample code should produce desired result.
use strict;
use warnings;
use feature 'say';
my $str = "// filename : foo_foo";
my($fname) = $str =~ m|// filename : (.*)\z|;
say $fname;
Output
foo_foo

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";

You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;

One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

How to match thing that is not in capture buffer in perl

Here is the sample script with my problem:
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
my $string = "aaabc";
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? #
([^\1]) # This is incorrect. It does not work as I need
# Here I need to match the thing that is not \1
# (in this case I need to match 'b')
/x;
if ($string =~ $re) {
say $1;
say $2;
} else {
say 'no match';
}

you need a Negative Lookahead. This will find the pattern and start the rest of the search from there. Meaning the next capture will be the one you seek.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
.*? # also (.)+? works as first expression.
(?!\1) # Negative Lookahead is non-capturing
(.) # $2 is b
/x;

As suggested by #DeVadder, you could make use of (?>pattern) which is:
an "independent" subexpression, one which matches the substring that a
standalone pattern would match if anchored at the given position, and
it matches nothing other than this substring.
my $re = qr/
^ # Start of line
(.) # Now \1 has 'a'
(?>\1*) # Matches \1
(.)
/x;
This would handle both cases as expected.

The regex searches captures first character and use it as \1*. Finally get a character that might be same as \1 or different if exists and check if $1 and $2 are same. If they are same then there is no character other than $1. If we have a character then we have a match and $1 ne $2.
#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
while(<DATA>){
my $re = qr/^(.)\1*(.)/x;
if ($_=~$re && $1 ne $2) {
say $1;
say $2;
} else {
say 'no match';
}
}
__DATA__
aaaa
aaabc
abc
baacd
Output :
no match
a
b
a
b
b
a

Printing first instance of match in each line of file (Perl)

I have the following in an executable .pl file:
#!/usr/bin/env perl
$file = 'TfbG_peaks.txt';
open(INFO, $file) or die("Could not open file.");
foreach $line (<INFO>) {
if ($line =~ m/[^_]*(?=_)/){
#print $line; #this prints lines, which means there are matches
print $1; #but this prints nothing
}
}
Based on my reading at http://goo.gl/YlEN7 and http://goo.gl/VlwKe, print $1; should print the first match in each line, but it doesn't. Help!

No, $1 should print the string saved by so-called capture groups (created by the bracketing construct - ( ... )). For example:
if ($line =~ m/([^_]*)(?=_)/){
print $1;
# now this will print something,
# unless string begins from an underscore
# (which still matches the pattern, as * is read as 'zero or more instances')
# are you sure you don't need `+` here?
}
The pattern in your original code didn't have any capture groups, that's why $1 was empty (undef, to be precise) there. And (?=...) didn't count, as these were used to add a look-ahead subexpression.

$1 prints what the first capture ((...)) in the pattern captured.
Maybe you were thinking of
print $& if $line =~ /[^_]*(?=_)/; # BAD
or
print ${^MATCH} if $line =~ /[^_]*(?=_)/p; # 5.10+
But the following would be simpler (and work before 5.10):
print $1 if $line =~ /([^_]*)_/;
Note: You'll get a performance boost when the pattern doesn't match if you add a leading ^ or (?:^|_) (whichever is appropriate).
print $1 if $line =~ /^([^_]*)_/;

Perl Regex to match only 1 domain

I'm trying to create a regex which matches the following:
part1#domain.com
part1: where part1 is any 5 digit number from 0-9
part2: [optional] where #domain.com are all domains except #yahoo.com
example: 12345#yahoo.com
I'm not able to find how to insert a conditional into the regex. Now only my regex match digits + domain. Still need to figure out:
how to match only the digits
conditional to accept all domains except #yahoo.com
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $regex1 = '^(\d{5})([#]([a-zA-Z0-9_-]+?\.[a-zA-Z]{2,6})+?)';
while ( my $line = <DATA> ) {
chomp $line;
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}
Sample data:
1234
12345#
12345#tandberg
A12345#tandberg.com
12345
12345#tandberg.com
12345#cisco.com
12345#tandberg.amer.com
12345#tandberg.demo

why not simply first check for yahoo.com and if you get a match go to the next line:
while ( my $line = <DATA> ) {
chomp $line;
next if ($line =~ /yahoo\.com$/);
if ($line =~ /$regex1/)
{
print "MATCH FOR:\t$line \n";
}
}

How about this?
\d{5}(?:#(?!yahoo)[a-zA-Z0-9.]+\.[a-zA-Z]{2,3})?
In expanded form:
\d{5} # 5 digits
(?: # begin a grouping
# # literal # symbol
(?!yahoo\.com) # don't allow something that matches 'yahoo.com' to match here
[a-zA-Z0-9.]+ # one or more alphanumerics and periods
\. # a literal period
[a-zA-Z]{2,3} # 2-3 letters
) # end grouping
? # make the previous item (the group) optional
(?!yahoo\.com) is what's called a "negative lookahead assertion".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Extracting Text - regex

Based on the example input: if ($line =~ /command\d\s\+?=\s*["-]?(\w+)"?/i) { $extraction = $1; print "$extraction\n"; }

The following regular expression would help you: m{ (?<= = ) # Find an `=` \s* # Match 0 or more whitespaces (?: # Do not capture [ " \- ] # Match either a `"` or a `-` )? # Match once or never ( # Capture [^ " \s ]+ # Match anything but a `"` or a whitespace ) }x;

Related

How to get the string which is starting after front slash in Perl regex?

Perl Regex To Remove Commas Between Quotes?

How to match thing that is not in capture buffer in perl

Printing first instance of match in each line of file (Perl)

Perl Regex to match only 1 domain

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Extracting Text - regex

Based on the example input: if ($line =~ /command\d*\s*\+?=\s*["-]?(\w+)"?/i) { $extraction = $1; print "$extraction\n"; }

The following regular expression would help you: m{ (?<= = ) # Find an `=` \s* # Match 0 or more whitespaces (?: # Do not capture [ " \- ] # Match either a `"` or a `-` )? # Match once or never ( # Capture [^ " \s ]+ # Match anything but a `"` or a whitespace ) }x;

Related

How to get the string which is starting after front slash in Perl regex?

Perl Regex To Remove Commas Between Quotes?

How to match thing that is not in capture buffer in perl

Printing first instance of match in each line of file (Perl)

Perl Regex to match only 1 domain

Categories

Resources

Based on the example input: if ($line =~ /command\d\s\+?=\s*["-]?(\w+)"?/i) { $extraction = $1; print "$extraction\n"; }