Perl Arrays and grep

Perl Arrays and grep - regex

I think its more a charachters, anyway, I have a text file, consisted of something like that:
COMPANY NAME
City
Addresss,
Address number
Email
phone number
and so on... (it repeats itself, but with different data...), lets assume thing text is now in $strting variable.
I want to have an array (#row), for example:
$row[0] = "COMPANY NAME";
$row[1] = "City";
$row[2] = "Addresss,
Address number";
$row[3] = "Email";
$row[4] = "phone number";
At first I though, well thats easily can be done with grep, something like that:
1) #rwo = grep (/^^$/, $string);
No GO!
2) #row = grep (/\n/, $string);
still no go, tried also with split and such, still no go.
any idea?
thanks,

FM has given an answer that works using split, but I wanted to point out that Perl makes this really easy if you're reading this data from a filehandle. All you need to do is to set the special variable $/ to an empty string. This puts Perl into "paragraph mode". In this mode each record returned by the file input operator will contain a paragraph of text rather than the usual line.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $/ = '';
my #row = <DATA>;
chomp #row;
print Dumper(\#row);
__DATA__
COMPANY NAME
City
Addresss,
Address number
Email
phone number
The output is:
$ ./addr
$VAR1 = [
'COMPANY NAME',
'City',
'Addresss,
Address number',
'Email ',
'phone number'
];

The way I understand your question, you want to grab the items separated by at least one blank line. Although /\n{2,}/ would be correct in a literal sense (split on one or more newlines), I would suggest the regex below, because it will handle nearly blank lines (those containing only whitespace characters).
use strict;
use warnings;
my $str = 'COMPANY NAME
City
Addresss,
Address number
Email
phone number';
my #items = split /\n\s*\n/, $str;

use strict;
use warnings;
my $string = "COMPANY NAME
City
Addresss,
Address number
Email
phone number";
my #string_parts = split /\n\n+/, $string;
foreach my $test (#string_parts){
print"$test\n";
}
OUTPUT:
COMPANY NAME
City
Addresss,
Address number
Email
phone number

grep cannot take a string as an argument.
This is why you need to split the string on the token that you're after (as FM shows).
While it isn't clear what you need this for, I would strongly recommend considering the Tie::File module:

Related

Perl regex to extract multiple matches from string

I have a string for example
id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'.....
I want to extract the 2 user names (testuser1, testuser2) and save it to an array.

You don't need to do everything in one pattern. Do something simple in multiple matches:
my $string = qq(id:123,createdby:'testuser1',"lastmodifiedby":'testuser2');
my( $created_by ) = $string =~ /,createdby:'(.*?)'/;
my( $last_modified_by ) = $string =~ /,"lastmodifiedby":'(.*?)'/;
print <<"HERE";
Created: $created_by
Last modified by: $last_modified_by
HERE
But, this looks like comma-separated data, and the data that you show are inconsistently quoted. I don't know if that's from you typing it out or it's your actual data.
But, it also looks like it might have come from JSON. It that's true, there are much better ways to extract data.

Try this
use strict;
use warnings;
my $string = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my #matches = ($string =~ /,createdby:'(.+?)',"lastmodifiedby":'(.+?)'/) ;
print " #matches\n";
Outputs
testuser1 testuser2
User requirements changed to allow coping with missing files. To deal with that, try this
use strict;
use warnings;
my $string1 = q[id:123,createdby:'testuser1',"lastmodifiedby":'testuser2'....];
my $string2 = q[id:123,createdby:'testuser1'....] ;
for my $s ($string1, $string2)
{
my #matches = ( $s =~ /(?:createdby|"lastmodifiedby"):'(.+?)'/g ) ;
print "#matches\n";
}
Outputs
testuser1 testuser2
testuser1

Problem description does not give enough details, inside the string quoting is not consistent.
As already stated the string can be part of JSON block and in such case should be handled by other means. Perhaps this assumption is correct but it not clearly stated in the question.
Please read How do I ask a good question?, How to create a Minimal, Reproducible Example.
Otherwise assumed that quoting is just a typing error. A bigger data sample and better problem description would be a significant improvement of the question.
Following code sample demonstrates one of possible approaches to get desired result and assumes that data fields does not includes , and : (otherwise other approach to process data must be in place).
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%data,#arr);
$str = "id:123,createdby:'testuser1','lastmodifiedby':'testuser2'";
$str =~ s/'//g;
%data = split(/[:,]/,$str);
say Dumper(\%data);
#arr = ($data{createdby},$data{lastmodifiedby});
say Dumper(\#arr);
Output
$VAR1 = {
'id' => '123',
'createdby' => 'testuser1',
'lastmodifiedby' => 'testuser2'
};
$VAR1 = [
'testuser1',
'testuser2'
];
Other approach could be as following
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,$re,#data,#arr);
$str = "id:123,createdby:'testuser1',\"lastmodifiedby\":'testuser2'";
#data = split(',',$str);
$re = qr/(createdby|lastmodifiedby)/;
for ( #data ) {
next unless /$re/;
s/['"]//g;
my($k,$v) = split(':',$_);
push #arr, $v;
}
say Dumper(\#arr);
Output
$VAR1 = [
'testuser1',
'testuser2'
];

Perl Regex regular expression to split //

I went through s'flow and other sites for simple solution with regex in perl.
$str = q(//////);#
Say I've six slash or seven, or other chars like q(aaaaa)
I want them to split like ['//','//'],
I tried #my_split = split ( /\/\/,$str); but it didn't work
Is it possible with regex?
Reason for this question is, say I have this domain name:
$site_name = q(http://www.yahoo.com/blah1/blah2.txt);
I wanted to split along single slash to get 'domain-name', I couldn't do it.
I tried
split( '/'{1,1}, $sitename); #didn't work. I expected it split on one slash than two.
Thanks.

The question is rather unclear.
To break a string into pairs of consecutive characters
my #pairs = $string =~ /(..)/g;
or to split a string by repeating slash
my #parts = split /\/\//, $string;
The separator pattern, in /.../, is an actual regex so we need to escape / inside it.
But then you say you want to parse URI?
Use a module, please. For example, there is URI
use warnings;
use strict;
use feature 'say';
use URI;
my $string = q(http://www.yahoo.com/blah1/blah2.txt);
my $uri = URI->new($string);
say "Scheme: ", $uri->scheme;
say "Path: ", $uri->path;
say "Host: ", $uri->host;
# there's more, see docs
and then there's URI::Split
use URI::Split qw(uri_split uri_join);
my ($scheme, $auth, $path, $query, $frag) = uri_split($uri);
A number of other modules or frameworks, which you may already be using, nicely handle URIs.

Here's a quick way to split the full URL into its components:
my $u = q(http://www.yahoo.com/blah1/blah2.txt);
my ($protocol, $server, $path) = split(/:\/\/([^\/]+)/, $u);
print "($protocol, $server, $path)\n";
h/t #Mike

Well next piece of code does the trick
use strict;
use warnings;
use Data::Dumper;
my %url;
while( <DATA> ) {
chomp;
m|(\wttps{0,1})://([\w\d\.]+)/(.+)/([^/]+)$|;
#url{qw(proto dn path file)} = ($1,$2,$3,$4);
print Dumper(\%url);
}
__DATA__
http://www.yahoo.com/blah1/blah2.txt
http://www.google.com/dir1/dir2/dir3/file.ext
ftp://www.server.com/dir1/dir2/file.ext
https://www.inter.net/dir/file.ext

So it seems you want to simply get the Domain name:
my $url = q(http://www.yahoo.com/blah1/blah2.txt);
my #vars = split /\//, $url;
print $vars[2];
results:
www.yahoo.com

Pattern match in perl

my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my $name = "";
#name = ( $line =~ m/Name:([\w\s\_\,/g );
foreach (#name) {
print $name."\n";
}
I want to capture the word between Name: and ,Region whereever it occurs in the whole line. The main loophole is that the name can be of any format
Amanda_Marry_Rose
Amanda.Marry.Rose
Amanda Marry Rose
Amanda/Marry/Rose
I need a help in capturing such a pattern every time it occurs in the line. So for the line I provided, the output should be
Amanda_Marry_Rose
Raghav.S.Thomas
Does anyone has any idea how to do this? I tried keeping the below line, but it's giving me the wrong output as.
#name=($line=~m/Name:([\w\s\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\#\[\\\]\^\_\`\{\|\}\~\´]+)\,/g);
Output
Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE

To capture between Name: and the first comma, use a negated character class:
/Name:([^,]+)/g
This says to match one or more characters following Name: which isn't a comma:
while (/Name:([^,]+)/g) {
print $1, "\n";
}
This is more efficient than a non-greedy quantifier, e.g:
/Name:(.+?),/g
As it doesn't require backtracking.

Reg-ex corrected:
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
my #name = ($line =~ /Name\:([\w\s_.\/]+)\,/g);
foreach my $name (#name) {
print $name."\n";
}

What you have there is comma separated data. How you should parse this depends a lot on your data. If it is full-fledged csv data, the most safe approach is to use a proper csv parser, such as Text::CSV. If it is less strict data, you can get away with using the light-weight parser Text::ParseWords, which also has the benefit of being a core module in Perl 5. If what you have here is rather basic, user entered fields, then I would recommend split -- simply because when you know the delimiter, it is easier and safer to define it, than everything else inside it.
use strict;
use warnings;
use Data::Dumper;
my $line = "Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,";
# Simple split
my #fields = split /,/, $line;
print Dumper for map /^Name:(.*)/, #fields;
use Text::ParseWords;
print Dumper map /^Name:(.*)/, quotewords(',', 0, $line);
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
});
$csv->parse($line);
print Dumper map /^Name:(.*)/, $csv->fields;
Each of these options give the same output, save for the one that uses Text::CSV, which also issues an undefined warning, quite correctly, because your data has a trailing comma (meaning an empty field at the end).
Each of these has different strengths and weaknesses. Text::CSV can choke on data that does not conform with the CSV format, and split cannot handle embedded commas, such as Name:"Doe, John",....
The regex we use to extract the names very simply just captures the entire rest of the lines that begin with Name:. This also allows you to perform sanity checks on the field names, for example issue a warning if you suddenly find a field called Doe;Name:

The simple way is to look for all sequences of non-comma characters after every instance of Name: in the string.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my #names = $line =~ /Name:([^,]+)/g;
print "$_\n" for #names;
output
Amanda_Marry_Rose
Raghav.S.Thomas
However, it may well be useful to parse the data into an array of hashes so that related fields are gathered together.
use strict;
use warnings;
my $line = 'Name:Amanda_Marry_Rose,Region:US,host:USE,cardType:DebitCard,product:Satin,Name:Raghav.S.Thomas,Region:UAE,';
my %info;
my #persons;
while ( $line =~ / ([a-z]+) : ([^:,]+) /gix ) {
my ($key, $val) = (lc $1, $2);
if ($info{$key}) {
push #persons, { %info };
%info = ();
}
$info{$key} = $val;
}
push #persons, { %info };
use Data::Dump;
dd \#persons;
print "\nNames:\n";
print "$_\n" for map $_->{name}, #persons;
output
[
{
cardtype => "DebitCard",
host => "USE",
name => "Amanda_Marry_Rose",
product => "Satin",
region => "US",
},
{
name => "Raghav.S.Thomas",
region => "UAE",
},
]
Names:
Amanda_Marry_Rose
Raghav.S.Thomas

Count occurrences of an email address in a text file

I have a .txt file with many emails including headers. I'm just wondering how I would use perl to find out how many occurrences of the same email address are found in this text file?
Would it involve regular expressions?

You might find cpan: Email::Find useful. You could store the addresses you find in a hash table with email as the key and counter as value. You should be able to do that with the callback. Can you get started with this?

How about this script:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my #email_list = ();
my %count;
while (my $line = <>) {
foreach my $email (split /\s+/, $line) {
if ( $email =~ /^[-\w.]+#([a-z0-9][a-z-0-9]+\.)+[a-z]{2,4}$/i ) {
push(#email_list,$email);
}
}
}
print "Total Email Count: ".scalar(#email_list)."\n\n";
$count{$_}++ for #email_list;
print Dumper(\%count);
Save it to a file such as email.pl and make sure it executable chmod +x email.pl.
./email.pl file.txt
It will print the total number of email addresses found and count per email address.

If you want to find all email addresses, I recommend trying a module rather than writing your own regex. Correctly matching all email addresses gets quite complicated.
However, if you simply want to search for a given email address, you can accomplish this with a fairly simple regex:
#!usr/bin/perl
use strict;
use warnings;
my $count = 0;
my $email = 'foo#bar.com';
while(<DATA>)
{
$count++ while (m/(^|\s)\K\Q$email\E(?=\s|$)/g);
}
print "Found $email $count times";
__DATA__
foo#bar.com foo#bar.com
mr-foo#bar.com #not a match
old.foo#bar.com #not a match
blah blah blah foo#bar.com blah blah
foo#bar.commmm #not a match
Note that this requires the email address to be separated from any other content by whitespace.
A couple of notes:
\Q...\E is the quote-literal escape. It ensures that nothing in the email address is treated as special regex characters (Without this, the . would match any character rather than a literal period).
(?=...) is a look-ahead insertion. This will match the contents without including it in the actual match. This is important, because a single space may be before one occurrence of the email and after another. In order to match both, you don't want the first match to "eat up" that space.

Perl RegEx to find the portion of the email address before the #

I have this below issue in Perl.I have a file in which I get list of emails as input.
I would like to parse the string before '#' of all email addresses. (Later I will store all the string before # in an array)
For eg. in : abcdefgh#gmail.com, i would like to parse the email address and extract abcdefgh.
My intention is to get only the string before '#'. Now the question is how to check it using regular expression. Or is there any other method using substr?
while I use regular expression : $mail =~ "\#" in Perl, it's not giving me the result.
Also, how will I find that the character '#' is in which index of the string $mail?
I appreciate if anyone can help me out.
#!usr/bin/perl
$mail = "abcdefgh#gmail.com";
if ($mail =~ "\#" ) {
print("my name = You got it!");
}
else
{
print("my name = Try again!");
}
In the above code $mail =~ "\#" is not giving me desired output but ($mail =~ "abc" ) does.
$mail =~ "#" will work only if the given string $mail = "abcdefgh\#gmail.com";
But in my case, i will be getting the input with email address as its.
Not with an escape character.
Thanks,
Tom

Enabling warnings would have pointed out your problem:
#!/usr/bin/perl
use warnings;
$mail = "abcdefgh#gmail.com";
__END__
Possible unintended interpolation of #gmail in string at - line 3.
Name "main::gmail" used only once: possible typo at - line 3.
and enabling strict would have prevented it from even compiling:
#!/usr/bin/perl
use strict;
use warnings;
my $mail = "abcdefgh#gmail.com";
__END__
Possible unintended interpolation of #gmail in string at - line 4.
Global symbol "#gmail" requires explicit package name at - line 4.
Execution of - aborted due to compilation errors.
In other words, your problem wasn't the regex working or not working, it was that the string you were matching against contained "abcdefgh.com", not what you expected.

The # sign is a metacharacter in double-quoted strings. If you put your email address in single quotes, you won't get that problem.
Also, I should add the obligatory comment that this is fine if you're just experimenting, but in production code you should not parse email addresses using regular expressions, but instead use a module such as Mail::Address.

What if you tried this:
my $email = 'user#email.com';
$email =~ /^(.+?)#/;
print $1
$1 will be everything before the #.

If you want the index of a string, you can use the index() function. ie.
my $email = 'foo#bar';
my $index = index($email, '#');
If you want to return the former half of the email, I'd use split() over regular expressions.
my $email = 'foo#bar';
my #result = split '#', $email;
my $username = $result[0];
Or even better with substr
my $username = substr($email, 0, index($email, '#'))

$mail = 'abcdefgh#gmail.com';
$mail =~ /^([^#]*)#/;
print "$1\n"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl Arrays and grep - regex

use strict; use warnings; my $string = "COMPANY NAME City Addresss, Address number Email phone number"; my #string_parts = split /\n\n+/, $string; foreach my $test (#string_parts){ print"$test\n"; } OUTPUT: COMPANY NAME City Addresss, Address number Email phone number

grep cannot take a string as an argument. This is why you need to split the string on the token that you're after (as FM shows). While it isn't clear what you need this for, I would strongly recommend considering the Tie::File module:

Related

Perl regex to extract multiple matches from string

Perl Regex regular expression to split //

Pattern match in perl

Count occurrences of an email address in a text file

Perl RegEx to find the portion of the email address before the #

Categories

Resources