Perl Regex, get strings between two strings - regex

I am new to Perl and trying to use Regex to get a piece of string between two tags that I know will be there in that string. I already tried various answers from stackoverflow but none of them seems to be working for me. Here's my example...
The required data is in $info variable out of which I want to get the useful data
my $info = "random text i do not want\n|BIRTH PLACE=Boston, MA\n|more unwanted random text";
The Useful Data in the above string is Boston, MA. I removed the newlines from the string by $info =~ s/\n//g;. Now $info has this string "random text i do not want|BIRTH PLACE=Boston, MA|more unwanted random text". I thought doing this will help me capture the required data easily.
Please help me in getting the required data. I am sure that the data will always be preceded by |BIRTH PLACE= and succeeded by |. Everything before and after that is unwanted text. If a question like this is already answered please guide me to it as well. Thanks.

Instead of replacing everything around it, you could search for /\|BIRTH PLACE=([^\|]+)\n\|/, [^\|]+ being one or more of anything that is not a pipe.

$info =~ m{\|BIRTH PLACE=(.*?)\|} or die "There is no data in \$info?!";
my $birth_place = $1;
That should do the trick.

You know, actually, those newlines might have helped you. I would have gone for an initial regular expression of:
/^\|BIRTH PLACE=(.*)$/m
Using the multiline modifer (m) to match ^ at the beginning of a line and $ at the end of it, instead of just matching at the beginning and end of the string. Heck, you can even get really crazy and match:
/(?<=^\|BIRTH PLACE=).+$/m
To capture only the information you want, using lookbehind ((?<= ... )) to assert that it's the birth place information.
Why curse the string twice when you can do it once?
So, in perl:
if ($info =~ m/(?<=^\|BIRTH PLACE=).+$/m) {
print "Born in $&.\n";
} else {
print "From parts unknown";
}

You have presumably read this data from a file, which is a bad start. You program should look like this
use strict;
use warnings;
use autodie;
open my $fh, '<', 'myfile';
my $pob;
while (<$fh>) {
if (/BIRTH PLACE=(.+)/) {
$pob = $1;
last;
}
}
print $pob;
output
Boston, MA

Related

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

find the occurrences of particular string in give string?

hi friends now 'm working in perl i need to check the give string occurence in a set of array using perl!i tried lot but i can't can anybody?
open FILE, "<", "tab_name.txt" or die $!;
my #tabName=<FILE>;
my #all_files= <*>;
foreach my $tab(#tabName){
$_=$tab;
my $pgr="PGR Usage";
if(m/$pgr/)
{
for(my $t=0;scalar #all_files;$t++){
my $file_name='PGR.csv';
$_=$all_files[$t];
if(m\$file_name\)
{
print $file_name;
}
}
print "\n$tab\n";
}
}
Here is a problem:
for(my $t=0;scalar #all_files;$t++){
The second part of the for loop needs to be a condition, such as:
for(my $t=0;$t < #all_files;$t++){
Your code as written will never end.
However, this is much better:
foreach (#all_files){
In addition, you have a problem with your regex. A variable in a regex is treated as a regular expression. . is a special character matching anything. Thus, your code would match PGR.csv, but also PGRacsv, etc. And it would also match filenames where that is a part of the name, such as FOO_PGR.csvblah. To solve this:
Use quote literal (\Q...\E) to make sure the filename is treated literally.
Use anchors to match the beginning and end of the string (^, $).
Also, backslashes are valid, but they are a strange character to use.
The corrected regex looks like this:
m/^\Q$file_name\E$/
Also, you should put this at the top of every script you write:
use warnings;
use strict;
This line :
for(my $t=0;scalar #all_files;$t++){
produces an infinite loop, you'd better use:
for(my $t=0;$t < #all_files;$t++){
Aside from the problems you have going through the array, are you looking for substr?

Perl: string manipulation - surrounding a word with a character '#'

I am trying to extract email address from a txt file. I've thought about surrounding words that contain the '#' character. Does anybody know a expression to do that?
Whenever you need some reasonably common matching problem resolve in Perl, you should always first check the Regexp::Common family on CPAN. In this case: Regexp::Common::Email::Address. From POD Synopsys:
use Regexp::Common qw[Email::Address];
use Email::Address;
while (<>) {
my (#found) = /($RE{Email}{Address})/g;
my (#addrs) = map $_->address, Email::Address->parse("#found");
print "X-Addresses: ", join(", ", #addrs), "\n";
}
Here's a very quick and dirty regex which will match non-whitespace characters on either side of an #:
/\S+#\S+/
This will match john.smith#example.com in
some rubbish text john.smith#example.com more rubbish text
Hope this helps.

Perl - Search for specific string in csv and extract characters that immediately follow it

I have a csv file that has 2 columns: an ID and a free text columns. The ID column contains a 16-character alphanumeric id but it may not be the only data present in the cell: it may be a blank cell, or a cell that contains only the 16-character id, or contain a bunch of stuff along with the following buried in it - "user_id=xxxxxxxxxxxxxxxx"
What I want is to somehow extract the 16-character id from whichever cells have it. So I need to:
(a) ignore blank cells
(b) extract the whole cell's content if all it has is a continuous 16-character string with no spaces in between
(c) look for the pattern "user_id=" and then extract the 16 characters that immediately follow it
I see a lot of Perl scripts for either pattern matching or find/replace string etc., but I am not sure how I can do different kinds of parsing/pattern searching and extraction one after the other on the same column. As you may have already realized, I am fairly new to Perl.
I understand that you want to (1) skip lines that contain nothing, or that fail to match your spec. (2) Capture 16 non-space characters if they are the only content of the cell. (3) Capture 16 non-space characters following the literal pattern "user_id=".
If it's ok to capture space characters too, if they follow a "user_id=" literal, you can change \S to . in the appropriate place.
My solution uses Text::CSV to handle the details of dealing with a CSV file. Here's how you might do it:
use strict;
use warnings;
use autodie;
use open ':encoding(utf8)';
use utf8;
use feature 'unicode_strings';
use Text::CSV;
binmode STDOUT, ':utf8';
my $csv = Text::CSV->new( {binary => 1} )
or die "Cannot use CSV: " . Text::CSV->error_diag;
while( my $row = $csv->getline( \*DATA ) ) {
my $column = $row->[0];
if( $column =~ m/^(\S{16})$/ || $column =~ m/user_id=(\S{16})/ ) {
print $1, "\n";
}
}
__DATA__
abcdefghijklmnop
user_id=abcdefghijklmnop
abcd fghij lmnop
randomdatAuser_id=abcdefghijklmnopMorerandomdata
user_id=abcd fghij lmnop
randomdatAuser_id=abcd fghij lmnopMorerandomdata
In your own code you would not be using the DATA filehandle, but I assume you know how to open a file already.
CSV is a format that is deceptively simple. Don't confuse its high readability with parsing simplicity though. When dealing with CSV, it's best to use a well-proven module to extract the columns. Other solutions can fail quote-embedded commas, escaped commas, unbalanced quotes, and other irregularities that our brain fixes for us on the fly, but that make a pure-regex solution fragile.
Well I can set you up with a basic file and regex commands that might do what you need (in a sorta basic format for someone not familiar with perl):
use strict;
use warnings;
open FILE "<:utf8", "myfile.csv";
#"slurp" the file into an array, each element is a line
my #lines = <FILE>;
my #idArray;
foreach my $line (#lines){
#make two captures, the first we can ignore and both are optional
$line =~ /^(user_id=|)([A-Za-z0-9]{16}|),/;
#for display purposes, this is just the second captured group
my $id = $2;
#if the group actually has something in it, add it to your final array
if($id){ push #idArray, $id; }
}
for example, in the next example only line 2 and 3 is valid, so in the cell1 (column1) is
string what is exactly 16 chars long, or
has the "user=16charshere"
Any other is not valid.
use 5.014;
use warnings;
while(<DATA>) {
chomp;
my($col1, #remainder) = split /\t/;
say $2 if $col1 =~ m/^(|user=)(.{16})$/;
}
__DATA__
ToShort col2 not_valid
a123456789012345 col2 valid
user=b123456789012345 col2 valid
TooLongStringHereSoNotValidOne col2 not_valid
In this example the columns are TAB separated.
Please provide (a) some example data which can be used for testing solutions and (b) please try supplying code you have written so far for this problem.
However, you will probably want to go through all rows of your table, then split it into fields, performe all your operations on a certain field, perform business logic, and then write everything back.
Problem (c) is solved by $idField =~ /user_id=(.{16})/; my $id = $1;
If the user_id always appears at the beginning of a line, this does the trick: for (<FILE>) {/^user_id=(.{16})/; ...}

Perl: Parse maillog to get date/recipient in a single regex statement

I'm trying to parse my maillog, which contains a number of lines which look similar to the following line:
Jun 6 17:52:06 host sendmail[30794]: p569q3sX030792: to=<person#recipient.com>, ctladdr=<apache#host.com> (48/48), delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121354, relay=gmail-smtp-in.l.google.com. [1.2.3.4], dsn=2.0.0, stat=Sent (OK 1307354043 x8si28599066ict.63)
The rules I'm trying to apply are:
The date is always the first 2 words
The email address always occurs between " to=person#recipient.com, " however the email address might be surrounded by <>
There are some lines in the log which do not relate to a recipient, so I'd like to ignore those lines entirely.
The following code works for either rule individually, however I'm having trouble combining them:
if($_ =~ m/\ to=([<>a-zA-Z0-9\.\#]*),\ /g) {
print "$1\n";
}
if($_ =~ /^+(\S+\s+\S+\s)/g) {
print "$1\n";
}
As always, I'm not sure whether the regex I'm using above is "best practice" so feel free to point out anything I'm doing badly there too :)
Thanks!
print substr($_, 0, 7), "$1\n" if / to=(.+?), /;
Your date is in a fixed-length format, you don't need a regular expression to match it.
For the address, what you need is the part between to= and the next ,, so a non-greedy match is just what you need.
To match either with one regex, or them using syntax (regex1|regex2) together:
((?<\ to=)[<>a-zA-Z0-9\.\#]*(?=,\ )|^\S+\s+\S+\s)
The outer brackets preserve $1 being assigned the match.
The look behind (?<\ to=) and look ahead (?=,\ ) do not capture anything, so these regexes only capture your target string.