How to change nested quotes? - regex

I'm looking for way how to change quotes for fancy ones: "abc" -> «abc».
It works for me in simple situations and next step i am looking for is how to get it work also with nested quotes: "abc "d e f" ghi" -> «abc «d e f» ghi»
$pk =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # at least one word-char or point
.*?\b[\.,?!]*? # any char followed boundary + opt. punctuation
)" # stop capture, ending quote
/«$1»/xg; # change to fancy
I hoped regex will match 1st and 3rd quote and changes them. And it does. Problem is: i hoped then match again 2nd and 4th, but it wont, because 2nd is already left behind. One solution is to run same replacement again until there is less than 2 quote chars in.
Is there better way to achieve my goal? My approach won't work when there will be third level of nesting and this is not my goal, i stay with 2 levels.
NB! Changing startquote and enquote in separate replacement wont work because then will single doublequotes replaced too. I need to replace only when they appear as couple!
MORE examples:
"abc "d e f" -> «abc "d e f»
"abc"d e f" -> «abc"d e f»
This seems impossible:
"abc" d e f" -> «abc" d e f»

There is no general way to pair up nested double quotes. If your quotes are always next to the beginning or end of a word then this may work. It replaces a double quote that precedes a non-space character with an open quote, and one that succeeds a non-space character with an close quote.
use strict;
use warnings;
use utf8;
my $string = '"abc "d e f" ghi"';
$string =~ s/"(?=\S)/«/g;
$string =~ s/(?<=\S)"/»/g;
print $string;
output
«abc «d e f» ghi»

You can use negative lookaround assertions to find the matching directions on your fancy quotes. The double negations help handle the edge cases (e.g. end/beginning of line). I used << and >> instead of your fancy quotes here for simplicity.
use strict;
use warnings;
while (<DATA>) {
s/(?<!\S)"(?!\s)/<</g;
s/(?<!\s)"(?!\S)/>>/g;
print;
}
__DATA__
"abc "d e f" ghi"
Output:
<<abc <<d e f>> ghi>>

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Regex to match certain words but not others

I am working with regexes in perl, and I'm trying to make a regex that finds two words where one ends with d and the next word starts with p (but not ph). Here is my regex, which works:
d\s(p[^h])}
However, I'd also like this to exclude the word "and" (but only within this pattern) so I have tried to use a negative lookahead, so my code looks like this:
if ($text =~ m{d\s(p[^h])} && $text =~ m{(?:(?!\sand\s))}) {
print "Yes\n";
} else {
}
However, this doesn't seem to work.
Here are some sample input/output:
sand pet -> yes
sand phone -> no
go and pet -> no
sand pet and -> yes
Any help with this is greatly appreciated!
You can accomplish what you need with a single regex:
/(?<!\ban)d\s(p[^h]\w+)/
Where:
\b is the word boundary anchor, doesn't consume any chars but assure that the excluded words is and and not sand. It matches between \w (word chars: [a-zA-Z0-9_]) and \W (not in word chars) and in the same position of ^ and $.
(?<!\ban)d a d not preceded by an isolated an, technically speaking is almost equivalent to (?<!\Wan).
Online Demo
If you doesn't need to extract the first and second word separately you may remove the capturing groups also and add some tolerance (one or more spaces between words):
if ( $input =~ m/(?<!\ban)d\s+p(?!h)/ )
print "Yes\n";
else
print "No\n";
Note: this regex is actually search for a d (not preceded by a non-substring an) separated by one or more spaces from a p not followed by a h. It tells nothing aboud words. If you want to make sure there are words of more than one char you can add a leading and trailing \w+.
Another Demo
You're getting too complicated. That negative lookahead is applied to the string, and matches against any substring. So it will match any substring which doesn't contain \sand\s which is always going to work, because zero length substrings are 'ok'.
You can see this at work with turning on debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
while ( <DATA> ) {
print if m{(?:(?!\sand\s))};
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
empty
That lookahead is used with another pattern to say 'match this, but only if this is (or isn't) next'.
So something like:
m{d\s(p[^h])} and not m{\sand\s};
Might do what you want - or alternatively, just break it down into phases:
#!/usr/bin/env perl
use strict;
use warnings;
#use re 'debug';
while (<DATA>) {
my ($capture) = m{d\s(p[^h])};
if ( $capture and not $capture =~ m/\sand\s/ ) {
print $capture, " => ", $_, "\n";
}
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
empty
It's often inappropriate to try to get everything working in a single regex. This program has a subroutine ok_words that checks a pair of words to see if your criteria apply. The calling code takes every pair of words in the string and prints a yes if the test is true for any pair, otherwise no
These are your tests, together with the Perl code that checks for them
The first ends in a d — /\d\z/
...but isn't and — `ne 'and'
The second starts with a p, but not ph — /\Ap(?!h)/
And this is the program that applies them
use strict;
use warnings 'all';
use List::MoreUtils qw/ any /;
while ( <DATA> ) {
chomp;
my #w = split;
if ( any { ok_words( $w[$_], $w[$_+1] ) } 0 .. $#w-1 ) {
print "$_ -> yes\n";
}
else {
print "$_ -> no\n";
}
}
sub ok_words {
my ($this, $next) = map lc, #_;
$this =~ /d\z/ and $this ne 'and' and $next =~ /\Ap(?!h)/;
}
__DATA__
sand pet
sand phone
go and pet
sand pet and
output
sand pet -> yes
sand phone -> no
go and pet -> no
sand pet and -> yes

Perl - split command with regex - split numeric and strings

My data look as follows:
20110627 ABC DBE EFG
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
I am trying to split in such a way that I keep numeric values in one cell, and strings in one cell.
Thus, I want "20110627" in one cell, "ABC DBE EFG" in another, "0.811585416264778" in another, "-0.157081048309312" in another, etc.
I have the following split command in perl with a regex
my #Fld = split(/[\d+][\s][\w+]/, $_);
But that doesn't seem to do what I want.. Can someone tell me which regex to use? Thanks in advance
EDIT : Following vks suggestion, I changed his regex a little bit to get rid of whitespace, take into account the string might have commas (,) or slash (/) or a dash (-) but then the negative sign (-) seems to be taken as a separate token in numbers:
(-?\d+(\.\d+)?)|([\/?,?\.?\-?a-zA-Z\/ ]+)
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Expected output :
20110627 in one token
A B C in one token
-0.170130027949933 in one token
G C I /DE/ in one token
A - E, INC. in one token.. (of course all the other should be in separate tokens, in other words the strings in one token and the numbers in one token.. I cannot write every single one of them but I think it it straightforward)
2nd EDIT:
Brian found the right regex: /(-?\d+(?:.\d+)?)|([/,.-a-zA-Z]+(?:\s+[/,.-a-zA-Z]+)*)/ (see below). Thanks Brian ! I now have a follow up question: I am writing the results of the regex split to an Excel file, using the following code:
use warnings;
use strict;
use Spreadsheet::WriteExcel;
use Scalar::Util qw(looks_like_number);
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
use Spreadsheet::ParseExcel::Workbook;
if (($#ARGV < 1) || ($#ARGV > 2)) {
die("Usage: tab2xls tabfile.txt newfile.xls\n");
};
open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";
my $workbook = Spreadsheet::WriteExcel->new($ARGV[1]);
my $worksheet = $workbook->add_worksheet();
my $row = 0;
my $col = 0;
while (<TABFILE>) {
chomp;
# Split
my #Fld = split(/(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)/, $_);
$col = 0;
foreach my $token (#Fld) {
$worksheet->write($row, $col, $token);
$col++;
}
$row++;
}
The problem is I get empty cells when I use that code:
> "EMPTY CELL" "1000" "EMPTY CELL" "EMPTY CELL" "ABC DEG" "EMPTY CELL"
> "2500" "EMPTY CELL" "EMPTY CELL" "1500" "3500"
Why am I getting these empty cells? Any way to avoid that? Thanks a lot
This is a broad scoped regex that does whitespace trim.
For some reason Perl always inserts the captures.
Since the regex is basically \d or \D, it matches everything,
so running split results through grep removes empty elements.
I'm using Perl 5.10, they probably have a noemptyelements flag by now.
Regex
# \s*([-\d.]+|\D+)(?<!\s)\s*
\s*
( [-\d.]+ | \D+ )
(?<! \s )
\s*
Perl
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my #ary = grep { length($_) > 0 } split m/\s*([-\d.]+|\D+)(?<!\s)\s*/, $data;
for (#ary) {
print "'$_'\n";
}
__DATA__
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Output
'20110627'
'A B C'
'217722'
'1425'
'1767'
'0.654504367955466'
'0.811585416264778'
'-0.157081048309312'
'19950725'
'A C'
'16458'
'63'
'91'
'0.38279256288735'
'0.552922590837283'
'-0.170130027949933'
'19980323'
'G C I /DE/'
'20130516'
'A - E, INC.'
'33019'
'398'
'197'
'1.205366607105'
'0.596626184923832'
'0.608740422181168'
'20130516'
'A - E, INC.'
'24094'
'134'
'137'
'0.556155059350876'
'0.56860629202291'
'-0.0124512326720345'
'19960327'
'A F C /DE'
'38905'
'503'
'169'
'1.29289294435163'
'0.434391466392495'
'0.858501477959131'
Using your revised requirements that allow for /, ,, -, etc., here's a regex that will capture all numeric tokens in capture group #1 and alpha in capture group #2:
(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)
(see regex101 example)
Breakdown:
(-?\d+(?:\.\d+)?) (capture group #1) matches numbers, with possible negative sign and possible decimal places (in non-capturing group)
([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*) (capture group #2) matches alpha strings with possible embedded whitespace
(-?\d+(\.\d+)?)|([a-zA-Z ]+)
Try this.See demo.Grab the captures.Remove the empty ones.
http://regex101.com/r/lZ5mN8/35

Regular expression for selecting single spaced phrases but not whitespace

I need a rather complicated regular expression that will select words with one space between them and that can include the '-' symbol in them, it should not however select continuous whitespace.
'KENEDY JOHN G JR E' 'example' 'D-54'
I have tried the following regular expression:
\'([\s\w-]+)\'
but it selects continuous whitespace which I don't want it to do.
I want the expression to select
'KENEDY JOHN G JR E'
'example'
'D-54'
Perhaps,
\'([\w-]+(?:\s[\w-]+)*)\'
?
EDIT
If leading/trailing dashes (on the word boundaries) are not allowed, this should read:
/\'(\w+(?:[\s-]\w+)*)\'/
An expression like this should do it:
'[\w-]+(?:\s[\w-]+)*'
Try this:
my $data = "'KENEDY JOHN G JR E' 'example' 'D-54'";
# Sets of
# one or more word characters or dash
# followed by an optional space
# enclosed in single quotes
#
# The outermost ()s are optional. There just
# so i can print the match easily as $1.
while ($data =~ /(\'([\w-]+\s?)+\')/g)
{
print $1, "\n";
}
outputs
'KENEDY JOHN G JR E'
'example'
'D-54'
Not sure if this applies to you, since you asked for a regex specifically. However, if you want strings separated by two or more whitespace or dashes, you can use split
use strict;
use warnings;
use v5.10;
my $str = q('KENEDY JOHN G JR E' 'example' 'D-54');
my #match = split /\s{2,}/, $str;
say for #match;
A regex with similar functionality would be
my #match = $str =~ /(.*?)(?:\s{2,}|$)/g;
Note that you'll need the edge case of finding end of line $.
The benefit of using split or the wildcard . is that you rely on whitespace to define your fields, not the content of the fields themselves.
Your code actually works as is.
use feature qw( say );
$_ = "'KENEDY JOHN G JR E' 'example' 'D-54'";
say for /\'([\s\w-]+)\'/g;
output:
KENEDY JOHN G JR E
example
D-54
(Move the parens if you want the quotes too.)
I would simply use
my #data = /'([^']*)'/g;
If you have any validation to do, do it afterwards.

Perl regular expression question

Suppose I have variables
$x1 = 'XX a b XX c d XX';
$x2 = 'XX a b XX c d XX e f XX';
I want a regular expression that will find each instance of letters between XX. I'm looking for a general solution, because I don't know how many XX's there are.
I tried using /XX(.*?)XX/g but this only matches "a b" for x1 and "a b", "e f" for x2 because once the first match is found, the engine has already read the second "XX".
Thanks for any help.
Try using a positive lookahead:
/XX(.*?)(?=XX)/
I would suggest split as well as knittl. But you might want to remove the whitespace as well:
my #stuff = split /\s*XX\s*/, $line;
Also you could use lookaheads, but you really don't need them, because you can use reasonably complex alternations as well:
Non-ws version would just be:
my #stuff = $line =~ m/XX((?:[^X]|X[^X])*)/g;
The alternation says that you'll take anything if it's not an 'X'--but you will take an 'X' if it's not followed by another 'X'. There will be one character of lookahead, but it can consume characters aggressively, without backtracking.
The trimming version will have to backtrack to get rid of space characters, so the expression is uglier.
my #stuff = $line =~ m/XX\s*((?:[^X]|X[^X])*?(?:[^X\s]|X[^X]))/g;
You can use split
#stuff_between_xx = split /XX/, $x1;
To store the number of matches in a scalar variable:
$stuff_between_xx = split /XX/, $x1;
my $x2 = 'XX a b XX c d XX e f XX';
my #parts = grep { $_ ne '' } split /\s*XX\s*/, $x2;