I need to match multiple pattern in the same line. For example, in this file:
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
...
First, I want to match every string into the brackets knowing that it is possible to have more than 1 pattern and also that we can have 1 to n strings into the brackets always separated by a semicolon.
Finally, for each line i need to compare the values to the string located after the #. For example in the first sentence, i want to compare:
[McAllister;Scientist] & [Massachusetts;USA;Fenway Park] TO "McAllister works in USA"
The tidiest way is probably to use a regex to find all the embedded sequences delimited by square brackets, and then use map with split to separate those sequences into terms.
This program demonstrates.
Note that I have assumed that all of the data in the file has been read into a single scalar variable. You can alter this to process a single line at a time, but only if the bracketed subsequences are never split across multiple lines
use strict;
use warnings;
my $s = <<END_TEXT;
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
END_TEXT
my #data = map [ split /;/ ], $s =~ / \[ ( [^\[\]]+ ) \] /xg;
use Data::Dump;
dd \#data;
output
[
["McAllister", "Scientist"],
["Massachusetts", "USA", "Fenway Park"],
["NYC", "USA"],
]
Try this
This is also gives what you expect.
use strict;
use warnings;
open('new',"file.txt");
my #z =map{m/\[[\w;\s]+\]/g} <new>;
print "$_ ,",foreach(#z);
You actually need match the words separated by the ; within the [].
Related
I am trying to split texts into "steps"
Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
I would indeed use split. But you need to exclude the digit from the match by using a lookahead.
my #steps = split /\s+(?=\d+\.)/, $steps;
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my #s = $steps =~ / [0-9]+\. [^0-9]+ /xg;
say for #s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences
my #s = $steps =~ / [0-9]+\. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.‡
If an item can have multiple sentences, or use end-of-sentence punctuation mid-sentence (as a part of a quotation perhaps) then tighten the condition for an item's end by combining footnotes -- end-of-sentence punctuation and followed by number+period
my #s = $steps =~ /[0-9]+\. .*? (?: \."|\!"|[.\!]) (?=\s+[0-9]+\. | \z)/xg;
If this isn't good enough either then we'd really need a more precise description of that text.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9]+\. .*? (?=\s+[0-9]+\. | \z) /xg;
(or in a lookahead in split) fails with text like
1. Only $2.50 or 1. Version 2.4.1 ...
‡ To include text like 1. Do "this." and 2. Or "that!" we'd want
/ [0-9]+\. .*? (?: \." | !" | [.!?]) /xg;
Following sample code demonstrates power of regex to fill up %steps hash in one line of code.
Once the data obtained you can dice and slice it anyway your heart desires.
Inspect the sample for compliance with your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my($str,%steps,$re);
$str = '1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!';
$re = qr/(\d+)\.(\D+)\./;
%steps = $str =~ /$re/g;
say Dumper(\%steps);
say "$_. $steps{$_}" for sort keys %steps;
Output
$VAR1 = {
'1' => 'Do this',
'2' => 'Then do that',
'3' => 'And then maybe that'
};
1. Do this
2. Then do that
3. And then maybe that
I want to split a string using repeating letters as delimiter, for example,
"123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:
#s = split /([[:alpha:]])\1+/, '123aaaa23a3';
But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference.
How can I resolve this situation?
Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.
Something like this perhaps?:
my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;
This'll give you:
$VAR1 = {
'23a3' => '',
'123' => 'a'
};
So you can extract your pattern via keys.
Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).
But something like this:
my #delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g;
print Dumper \%+;
By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.
This is the closest I got:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $str = '123aaaa23a3';
#build a regex out of '2-or-more' characters.
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";
#split on the regex
my #s = split m/$regex/, $str;
print Dumper \#s;
We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.
One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:
use List::Util 1.29 qw( pairkeys );
my #vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';
Gives
Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]
That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:
my #vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;
Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:
use List::Util 1.29 qw( pairvalues );
my #vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';
The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:
#s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';
(??{ code }) is documented on the 'perlre' manual page.
Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.
I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig
I have a csv file that has 2 columns: an ID and a free text columns. The ID column contains a 16-character alphanumeric id but it may not be the only data present in the cell: it may be a blank cell, or a cell that contains only the 16-character id, or contain a bunch of stuff along with the following buried in it - "user_id=xxxxxxxxxxxxxxxx"
What I want is to somehow extract the 16-character id from whichever cells have it. So I need to:
(a) ignore blank cells
(b) extract the whole cell's content if all it has is a continuous 16-character string with no spaces in between
(c) look for the pattern "user_id=" and then extract the 16 characters that immediately follow it
I see a lot of Perl scripts for either pattern matching or find/replace string etc., but I am not sure how I can do different kinds of parsing/pattern searching and extraction one after the other on the same column. As you may have already realized, I am fairly new to Perl.
I understand that you want to (1) skip lines that contain nothing, or that fail to match your spec. (2) Capture 16 non-space characters if they are the only content of the cell. (3) Capture 16 non-space characters following the literal pattern "user_id=".
If it's ok to capture space characters too, if they follow a "user_id=" literal, you can change \S to . in the appropriate place.
My solution uses Text::CSV to handle the details of dealing with a CSV file. Here's how you might do it:
use strict;
use warnings;
use autodie;
use open ':encoding(utf8)';
use utf8;
use feature 'unicode_strings';
use Text::CSV;
binmode STDOUT, ':utf8';
my $csv = Text::CSV->new( {binary => 1} )
or die "Cannot use CSV: " . Text::CSV->error_diag;
while( my $row = $csv->getline( \*DATA ) ) {
my $column = $row->[0];
if( $column =~ m/^(\S{16})$/ || $column =~ m/user_id=(\S{16})/ ) {
print $1, "\n";
}
}
__DATA__
abcdefghijklmnop
user_id=abcdefghijklmnop
abcd fghij lmnop
randomdatAuser_id=abcdefghijklmnopMorerandomdata
user_id=abcd fghij lmnop
randomdatAuser_id=abcd fghij lmnopMorerandomdata
In your own code you would not be using the DATA filehandle, but I assume you know how to open a file already.
CSV is a format that is deceptively simple. Don't confuse its high readability with parsing simplicity though. When dealing with CSV, it's best to use a well-proven module to extract the columns. Other solutions can fail quote-embedded commas, escaped commas, unbalanced quotes, and other irregularities that our brain fixes for us on the fly, but that make a pure-regex solution fragile.
Well I can set you up with a basic file and regex commands that might do what you need (in a sorta basic format for someone not familiar with perl):
use strict;
use warnings;
open FILE "<:utf8", "myfile.csv";
#"slurp" the file into an array, each element is a line
my #lines = <FILE>;
my #idArray;
foreach my $line (#lines){
#make two captures, the first we can ignore and both are optional
$line =~ /^(user_id=|)([A-Za-z0-9]{16}|),/;
#for display purposes, this is just the second captured group
my $id = $2;
#if the group actually has something in it, add it to your final array
if($id){ push #idArray, $id; }
}
for example, in the next example only line 2 and 3 is valid, so in the cell1 (column1) is
string what is exactly 16 chars long, or
has the "user=16charshere"
Any other is not valid.
use 5.014;
use warnings;
while(<DATA>) {
chomp;
my($col1, #remainder) = split /\t/;
say $2 if $col1 =~ m/^(|user=)(.{16})$/;
}
__DATA__
ToShort col2 not_valid
a123456789012345 col2 valid
user=b123456789012345 col2 valid
TooLongStringHereSoNotValidOne col2 not_valid
In this example the columns are TAB separated.
Please provide (a) some example data which can be used for testing solutions and (b) please try supplying code you have written so far for this problem.
However, you will probably want to go through all rows of your table, then split it into fields, performe all your operations on a certain field, perform business logic, and then write everything back.
Problem (c) is solved by $idField =~ /user_id=(.{16})/; my $id = $1;
If the user_id always appears at the beginning of a line, this does the trick: for (<FILE>) {/^user_id=(.{16})/; ...}
I need a rather complicated regular expression that will select words with one space between them and that can include the '-' symbol in them, it should not however select continuous whitespace.
'KENEDY JOHN G JR E' 'example' 'D-54'
I have tried the following regular expression:
\'([\s\w-]+)\'
but it selects continuous whitespace which I don't want it to do.
I want the expression to select
'KENEDY JOHN G JR E'
'example'
'D-54'
Perhaps,
\'([\w-]+(?:\s[\w-]+)*)\'
?
EDIT
If leading/trailing dashes (on the word boundaries) are not allowed, this should read:
/\'(\w+(?:[\s-]\w+)*)\'/
An expression like this should do it:
'[\w-]+(?:\s[\w-]+)*'
Try this:
my $data = "'KENEDY JOHN G JR E' 'example' 'D-54'";
# Sets of
# one or more word characters or dash
# followed by an optional space
# enclosed in single quotes
#
# The outermost ()s are optional. There just
# so i can print the match easily as $1.
while ($data =~ /(\'([\w-]+\s?)+\')/g)
{
print $1, "\n";
}
outputs
'KENEDY JOHN G JR E'
'example'
'D-54'
Not sure if this applies to you, since you asked for a regex specifically. However, if you want strings separated by two or more whitespace or dashes, you can use split
use strict;
use warnings;
use v5.10;
my $str = q('KENEDY JOHN G JR E' 'example' 'D-54');
my #match = split /\s{2,}/, $str;
say for #match;
A regex with similar functionality would be
my #match = $str =~ /(.*?)(?:\s{2,}|$)/g;
Note that you'll need the edge case of finding end of line $.
The benefit of using split or the wildcard . is that you rely on whitespace to define your fields, not the content of the fields themselves.
Your code actually works as is.
use feature qw( say );
$_ = "'KENEDY JOHN G JR E' 'example' 'D-54'";
say for /\'([\s\w-]+)\'/g;
output:
KENEDY JOHN G JR E
example
D-54
(Move the parens if you want the quotes too.)
I would simply use
my #data = /'([^']*)'/g;
If you have any validation to do, do it afterwards.