Regular expression for selecting single spaced phrases but not whitespace - regex

I need a rather complicated regular expression that will select words with one space between them and that can include the '-' symbol in them, it should not however select continuous whitespace.
'KENEDY JOHN G JR E' 'example' 'D-54'
I have tried the following regular expression:
\'([\s\w-]+)\'
but it selects continuous whitespace which I don't want it to do.
I want the expression to select
'KENEDY JOHN G JR E'
'example'
'D-54'

Perhaps,
\'([\w-]+(?:\s[\w-]+)*)\'
?
EDIT
If leading/trailing dashes (on the word boundaries) are not allowed, this should read:
/\'(\w+(?:[\s-]\w+)*)\'/

An expression like this should do it:
'[\w-]+(?:\s[\w-]+)*'

Try this:
my $data = "'KENEDY JOHN G JR E' 'example' 'D-54'";
# Sets of
# one or more word characters or dash
# followed by an optional space
# enclosed in single quotes
#
# The outermost ()s are optional. There just
# so i can print the match easily as $1.
while ($data =~ /(\'([\w-]+\s?)+\')/g)
{
print $1, "\n";
}
outputs
'KENEDY JOHN G JR E'
'example'
'D-54'

Not sure if this applies to you, since you asked for a regex specifically. However, if you want strings separated by two or more whitespace or dashes, you can use split
use strict;
use warnings;
use v5.10;
my $str = q('KENEDY JOHN G JR E' 'example' 'D-54');
my #match = split /\s{2,}/, $str;
say for #match;
A regex with similar functionality would be
my #match = $str =~ /(.*?)(?:\s{2,}|$)/g;
Note that you'll need the edge case of finding end of line $.
The benefit of using split or the wildcard . is that you rely on whitespace to define your fields, not the content of the fields themselves.

Your code actually works as is.
use feature qw( say );
$_ = "'KENEDY JOHN G JR E' 'example' 'D-54'";
say for /\'([\s\w-]+)\'/g;
output:
KENEDY JOHN G JR E
example
D-54
(Move the parens if you want the quotes too.)
I would simply use
my #data = /'([^']*)'/g;
If you have any validation to do, do it afterwards.

Related

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Need to match multiple pattern in the same line - Perl

I need to match multiple pattern in the same line. For example, in this file:
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
...
First, I want to match every string into the brackets knowing that it is possible to have more than 1 pattern and also that we can have 1 to n strings into the brackets always separated by a semicolon.
Finally, for each line i need to compare the values to the string located after the #. For example in the first sentence, i want to compare:
[McAllister;Scientist] & [Massachusetts;USA;Fenway Park] TO "McAllister works in USA"
The tidiest way is probably to use a regex to find all the embedded sequences delimited by square brackets, and then use map with split to separate those sequences into terms.
This program demonstrates.
Note that I have assumed that all of the data in the file has been read into a single scalar variable. You can alter this to process a single line at a time, but only if the bracketed subsequences are never split across multiple lines
use strict;
use warnings;
my $s = <<END_TEXT;
Hello, Chester [McAllister;Scientist] lives in Boston [Massachusetts;USA;Fenway Park] # McAllister works in USA
I'm now working in New-York [NYC;USA] # I work in USA
END_TEXT
my #data = map [ split /;/ ], $s =~ / \[ ( [^\[\]]+ ) \] /xg;
use Data::Dump;
dd \#data;
output
[
["McAllister", "Scientist"],
["Massachusetts", "USA", "Fenway Park"],
["NYC", "USA"],
]
Try this
This is also gives what you expect.
use strict;
use warnings;
open('new',"file.txt");
my #z =map{m/\[[\w;\s]+\]/g} <new>;
print "$_ ,",foreach(#z);
You actually need match the words separated by the ; within the [].

Perl - split command with regex - split numeric and strings

My data look as follows:
20110627 ABC DBE EFG
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
I am trying to split in such a way that I keep numeric values in one cell, and strings in one cell.
Thus, I want "20110627" in one cell, "ABC DBE EFG" in another, "0.811585416264778" in another, "-0.157081048309312" in another, etc.
I have the following split command in perl with a regex
my #Fld = split(/[\d+][\s][\w+]/, $_);
But that doesn't seem to do what I want.. Can someone tell me which regex to use? Thanks in advance
EDIT : Following vks suggestion, I changed his regex a little bit to get rid of whitespace, take into account the string might have commas (,) or slash (/) or a dash (-) but then the negative sign (-) seems to be taken as a separate token in numbers:
(-?\d+(\.\d+)?)|([\/?,?\.?\-?a-zA-Z\/ ]+)
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Expected output :
20110627 in one token
A B C in one token
-0.170130027949933 in one token
G C I /DE/ in one token
A - E, INC. in one token.. (of course all the other should be in separate tokens, in other words the strings in one token and the numbers in one token.. I cannot write every single one of them but I think it it straightforward)
2nd EDIT:
Brian found the right regex: /(-?\d+(?:.\d+)?)|([/,.-a-zA-Z]+(?:\s+[/,.-a-zA-Z]+)*)/ (see below). Thanks Brian ! I now have a follow up question: I am writing the results of the regex split to an Excel file, using the following code:
use warnings;
use strict;
use Spreadsheet::WriteExcel;
use Scalar::Util qw(looks_like_number);
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
use Spreadsheet::ParseExcel::Workbook;
if (($#ARGV < 1) || ($#ARGV > 2)) {
die("Usage: tab2xls tabfile.txt newfile.xls\n");
};
open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";
my $workbook = Spreadsheet::WriteExcel->new($ARGV[1]);
my $worksheet = $workbook->add_worksheet();
my $row = 0;
my $col = 0;
while (<TABFILE>) {
chomp;
# Split
my #Fld = split(/(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)/, $_);
$col = 0;
foreach my $token (#Fld) {
$worksheet->write($row, $col, $token);
$col++;
}
$row++;
}
The problem is I get empty cells when I use that code:
> "EMPTY CELL" "1000" "EMPTY CELL" "EMPTY CELL" "ABC DEG" "EMPTY CELL"
> "2500" "EMPTY CELL" "EMPTY CELL" "1500" "3500"
Why am I getting these empty cells? Any way to avoid that? Thanks a lot
This is a broad scoped regex that does whitespace trim.
For some reason Perl always inserts the captures.
Since the regex is basically \d or \D, it matches everything,
so running split results through grep removes empty elements.
I'm using Perl 5.10, they probably have a noemptyelements flag by now.
Regex
# \s*([-\d.]+|\D+)(?<!\s)\s*
\s*
( [-\d.]+ | \D+ )
(?<! \s )
\s*
Perl
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my #ary = grep { length($_) > 0 } split m/\s*([-\d.]+|\D+)(?<!\s)\s*/, $data;
for (#ary) {
print "'$_'\n";
}
__DATA__
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Output
'20110627'
'A B C'
'217722'
'1425'
'1767'
'0.654504367955466'
'0.811585416264778'
'-0.157081048309312'
'19950725'
'A C'
'16458'
'63'
'91'
'0.38279256288735'
'0.552922590837283'
'-0.170130027949933'
'19980323'
'G C I /DE/'
'20130516'
'A - E, INC.'
'33019'
'398'
'197'
'1.205366607105'
'0.596626184923832'
'0.608740422181168'
'20130516'
'A - E, INC.'
'24094'
'134'
'137'
'0.556155059350876'
'0.56860629202291'
'-0.0124512326720345'
'19960327'
'A F C /DE'
'38905'
'503'
'169'
'1.29289294435163'
'0.434391466392495'
'0.858501477959131'
Using your revised requirements that allow for /, ,, -, etc., here's a regex that will capture all numeric tokens in capture group #1 and alpha in capture group #2:
(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)
(see regex101 example)
Breakdown:
(-?\d+(?:\.\d+)?) (capture group #1) matches numbers, with possible negative sign and possible decimal places (in non-capturing group)
([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*) (capture group #2) matches alpha strings with possible embedded whitespace
(-?\d+(\.\d+)?)|([a-zA-Z ]+)
Try this.See demo.Grab the captures.Remove the empty ones.
http://regex101.com/r/lZ5mN8/35

How to change nested quotes?

I'm looking for way how to change quotes for fancy ones: "abc" -> «abc».
It works for me in simple situations and next step i am looking for is how to get it work also with nested quotes: "abc "d e f" ghi" -> «abc «d e f» ghi»
$pk =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # at least one word-char or point
.*?\b[\.,?!]*? # any char followed boundary + opt. punctuation
)" # stop capture, ending quote
/«$1»/xg; # change to fancy
I hoped regex will match 1st and 3rd quote and changes them. And it does. Problem is: i hoped then match again 2nd and 4th, but it wont, because 2nd is already left behind. One solution is to run same replacement again until there is less than 2 quote chars in.
Is there better way to achieve my goal? My approach won't work when there will be third level of nesting and this is not my goal, i stay with 2 levels.
NB! Changing startquote and enquote in separate replacement wont work because then will single doublequotes replaced too. I need to replace only when they appear as couple!
MORE examples:
"abc "d e f" -> «abc "d e f»
"abc"d e f" -> «abc"d e f»
This seems impossible:
"abc" d e f" -> «abc" d e f»
There is no general way to pair up nested double quotes. If your quotes are always next to the beginning or end of a word then this may work. It replaces a double quote that precedes a non-space character with an open quote, and one that succeeds a non-space character with an close quote.
use strict;
use warnings;
use utf8;
my $string = '"abc "d e f" ghi"';
$string =~ s/"(?=\S)/«/g;
$string =~ s/(?<=\S)"/»/g;
print $string;
output
«abc «d e f» ghi»
You can use negative lookaround assertions to find the matching directions on your fancy quotes. The double negations help handle the edge cases (e.g. end/beginning of line). I used << and >> instead of your fancy quotes here for simplicity.
use strict;
use warnings;
while (<DATA>) {
s/(?<!\S)"(?!\s)/<</g;
s/(?<!\s)"(?!\S)/>>/g;
print;
}
__DATA__
"abc "d e f" ghi"
Output:
<<abc <<d e f>> ghi>>

Perl regular expression question

Suppose I have variables
$x1 = 'XX a b XX c d XX';
$x2 = 'XX a b XX c d XX e f XX';
I want a regular expression that will find each instance of letters between XX. I'm looking for a general solution, because I don't know how many XX's there are.
I tried using /XX(.*?)XX/g but this only matches "a b" for x1 and "a b", "e f" for x2 because once the first match is found, the engine has already read the second "XX".
Thanks for any help.
Try using a positive lookahead:
/XX(.*?)(?=XX)/
I would suggest split as well as knittl. But you might want to remove the whitespace as well:
my #stuff = split /\s*XX\s*/, $line;
Also you could use lookaheads, but you really don't need them, because you can use reasonably complex alternations as well:
Non-ws version would just be:
my #stuff = $line =~ m/XX((?:[^X]|X[^X])*)/g;
The alternation says that you'll take anything if it's not an 'X'--but you will take an 'X' if it's not followed by another 'X'. There will be one character of lookahead, but it can consume characters aggressively, without backtracking.
The trimming version will have to backtrack to get rid of space characters, so the expression is uglier.
my #stuff = $line =~ m/XX\s*((?:[^X]|X[^X])*?(?:[^X\s]|X[^X]))/g;
You can use split
#stuff_between_xx = split /XX/, $x1;
To store the number of matches in a scalar variable:
$stuff_between_xx = split /XX/, $x1;
my $x2 = 'XX a b XX c d XX e f XX';
my #parts = grep { $_ ne '' } split /\s*XX\s*/, $x2;