Perl - split command with regex - split numeric and strings - regex

My data look as follows:
20110627 ABC DBE EFG
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
I am trying to split in such a way that I keep numeric values in one cell, and strings in one cell.
Thus, I want "20110627" in one cell, "ABC DBE EFG" in another, "0.811585416264778" in another, "-0.157081048309312" in another, etc.
I have the following split command in perl with a regex
my #Fld = split(/[\d+][\s][\w+]/, $_);
But that doesn't seem to do what I want.. Can someone tell me which regex to use? Thanks in advance
EDIT : Following vks suggestion, I changed his regex a little bit to get rid of whitespace, take into account the string might have commas (,) or slash (/) or a dash (-) but then the negative sign (-) seems to be taken as a separate token in numbers:
(-?\d+(\.\d+)?)|([\/?,?\.?\-?a-zA-Z\/ ]+)
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Expected output :
20110627 in one token
A B C in one token
-0.170130027949933 in one token
G C I /DE/ in one token
A - E, INC. in one token.. (of course all the other should be in separate tokens, in other words the strings in one token and the numbers in one token.. I cannot write every single one of them but I think it it straightforward)
2nd EDIT:
Brian found the right regex: /(-?\d+(?:.\d+)?)|([/,.-a-zA-Z]+(?:\s+[/,.-a-zA-Z]+)*)/ (see below). Thanks Brian ! I now have a follow up question: I am writing the results of the regex split to an Excel file, using the following code:
use warnings;
use strict;
use Spreadsheet::WriteExcel;
use Scalar::Util qw(looks_like_number);
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
use Spreadsheet::ParseExcel::Workbook;
if (($#ARGV < 1) || ($#ARGV > 2)) {
die("Usage: tab2xls tabfile.txt newfile.xls\n");
};
open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";
my $workbook = Spreadsheet::WriteExcel->new($ARGV[1]);
my $worksheet = $workbook->add_worksheet();
my $row = 0;
my $col = 0;
while (<TABFILE>) {
chomp;
# Split
my #Fld = split(/(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)/, $_);
$col = 0;
foreach my $token (#Fld) {
$worksheet->write($row, $col, $token);
$col++;
}
$row++;
}
The problem is I get empty cells when I use that code:
> "EMPTY CELL" "1000" "EMPTY CELL" "EMPTY CELL" "ABC DEG" "EMPTY CELL"
> "2500" "EMPTY CELL" "EMPTY CELL" "1500" "3500"
Why am I getting these empty cells? Any way to avoid that? Thanks a lot

This is a broad scoped regex that does whitespace trim.
For some reason Perl always inserts the captures.
Since the regex is basically \d or \D, it matches everything,
so running split results through grep removes empty elements.
I'm using Perl 5.10, they probably have a noemptyelements flag by now.
Regex
# \s*([-\d.]+|\D+)(?<!\s)\s*
\s*
( [-\d.]+ | \D+ )
(?<! \s )
\s*
Perl
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my #ary = grep { length($_) > 0 } split m/\s*([-\d.]+|\D+)(?<!\s)\s*/, $data;
for (#ary) {
print "'$_'\n";
}
__DATA__
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Output
'20110627'
'A B C'
'217722'
'1425'
'1767'
'0.654504367955466'
'0.811585416264778'
'-0.157081048309312'
'19950725'
'A C'
'16458'
'63'
'91'
'0.38279256288735'
'0.552922590837283'
'-0.170130027949933'
'19980323'
'G C I /DE/'
'20130516'
'A - E, INC.'
'33019'
'398'
'197'
'1.205366607105'
'0.596626184923832'
'0.608740422181168'
'20130516'
'A - E, INC.'
'24094'
'134'
'137'
'0.556155059350876'
'0.56860629202291'
'-0.0124512326720345'
'19960327'
'A F C /DE'
'38905'
'503'
'169'
'1.29289294435163'
'0.434391466392495'
'0.858501477959131'

Using your revised requirements that allow for /, ,, -, etc., here's a regex that will capture all numeric tokens in capture group #1 and alpha in capture group #2:
(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)
(see regex101 example)
Breakdown:
(-?\d+(?:\.\d+)?) (capture group #1) matches numbers, with possible negative sign and possible decimal places (in non-capturing group)
([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*) (capture group #2) matches alpha strings with possible embedded whitespace

(-?\d+(\.\d+)?)|([a-zA-Z ]+)
Try this.See demo.Grab the captures.Remove the empty ones.
http://regex101.com/r/lZ5mN8/35

Related

perl Regex replace for specific string length

I am using Perl to do some prototyping.
I need an expression to replace e by [ee] if the string is exactly 2 chars and finishes by "e".
le -> l [ee]
me -> m [ee]
elle -> elle : no change
I cannot test the length of the string, I need one expression to do the whole job.
I tried:
`s/(?=^.{0,2}\z).*e\z%/[ee]/g` but this is replacing the whole string
`s/^[c|d|j|l|m|n|s|t]e$/[ee]/g` same result (I listed the possible letters that could precede my "e")
`^(?<=[c|d|j|l|m|n|s|t])e$/[ee]/g` but I have no match, not sure I can use ^ on a positive look behind
EDIT
Guys you're amazing, hours of search on the web and here I get answers minutes after I posted.
I tried all your solutions and they are working perfectly directly in my script, i.e. this one:
my $test2="le";
$test2=~ s/^(\S)e$/\1\[ee\]/g;
print "test2:".$test2."\n";
-> test2:l[ee]
But I am loading these regex from a text file (using Perl for proto, the idea is to reuse it with any language implementing regex):
In the text file I store for example (I used % to split the line between match and replace):
^(\S)e$% \1\[ee\]
and then I parse and apply all regex like that:
my $test="le";
while (my $row = <$fh>) {
chomp $row;
if( $row =~ /%/){
my #reg = split /%/, $row;
#if no replacement, put empty string
if($#reg == 0){
push(#reg,"");
}
print "reg found, reg:".$reg[0].", replace:".$reg[1]."\n";
push #regs, [ #reg ];
}
}
print "orgine:".$test."\n";
for my $i (0 .. $#regs){
my $p=$regs[$i][0];
my $r=$regs[$i][1];
$test=~ s/$p/$r/g;
}
print "final:".$test."\n";
This technique is working well with my other regex, but not yet when I have a $1 or \1 in the replace... here is what I am obtaining:
final:\1\ee\
PS: you answered to initial question, should I open another post ?
Something like s/(?i)^([a-z])e$/$1[ee]/
Why aren't you using a capture group to do the replacement?
`s/^([c|d|j|l|m|n|s|t])e$/\1 [ee]/g`
If those are the characters you need and if it is indeed one word to a line with no whitespace before it or after it, then this will work.
Here's another option depending on what you are looking for. It will match a two character string consisting of one a-z character followed by one 'e' on its own line with possible whitespace before or after. It will replace this will the single a-z character followed by ' [ee]'
`s/^\s*([a-z])e\s*$/\1 [ee]/`
^(\S)e$
Try this.Replace by $1 [ee].See demo.
https://regex101.com/r/hR7tH4/28
I'd do something like this
$word =~ s/^(\w{1})(e)$/$1$2e/;
You can use following regex which match 2 character and then you can replace it with $1\[$2$2\]:
^([a-zA-Z])([a-zA-Z])$
Demo :
$my_string =~ s/^([a-zA-Z])([a-zA-Z])$/$1[$2$2]/;
See demo https://regex101.com/r/iD9oN4/1

Matching incremental digits in numbers

After googling many days about the issue, finally I am posting this question here and hoping to get it solved by experts here; I am looking for the regex pattern that can match incremental back references. Let me explain:
For number 9422512322, the pattern (\d)\1 will match 22 two times, and I want the pattern (something like (\d)\1+1) that matches 12 (second digit is equal to first digit + 1)
In short the pattern should match all occurrence like 12, 23, 34, 45, 56, etc... There is no replacement, just matches required.
What about something like this?
/01|12|23|34|45|56|67|78|89/
It isn't sexy but it gets the job done.
You can use this regex:
(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9))+.
This will match:
Any 0s which are followed by 1s, or
Any 1s which are followed by 2s, or
Any 2s which are followed by 3s, ...
Multiple times +, then match the corresponding character ..
Here is a regex demo, and the match is:
12345555567877785
You can run code within Perl regular expressions that can
control the regex execution flow. But, this is not likely
to be implemented anywhere else to this degree.
PCRE has some program variable interaction, but not like Perl.
(Note - to do overlap finds, replace the second ( \d ) with (?=( \d ))
then change print statement to print "Overlap Found $1$3\n";
If you use Perl, you can do all kinds of math-character relationships that can't be
done with brute force permutations.
- Good luck!
Perl example:
use strict;
use warnings;
my $dig2;
while ( "9342251232288 6709090156" =~
/
(
( \d )
(?{ $dig2 = $^N + 1 })
( \d )
(?(?{
$dig2 != $^N
})
(?!)
)
)
/xg )
{
print "Found $1\n";
}
Output:
Found 34
Found 12
Found 67
Found 01
Found 56
Here is one way to do it in Perl, using positive lookahead assertions:
#!/usr/bin/env perl
use strict;
use warnings;
my $number = "9422512322";
my #matches = $number =~ /(0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9))/g;
# We have only matched the first digit of each pair of digits.
# Add "1" to the first digit to generate the complete pair.
foreach my $first (#matches) {
my $pair = $first . ($first + 1);
print "Found: $pair\n";
}
Output:
Found: 12
Found: 23

How to change nested quotes?

I'm looking for way how to change quotes for fancy ones: "abc" -> «abc».
It works for me in simple situations and next step i am looking for is how to get it work also with nested quotes: "abc "d e f" ghi" -> «abc «d e f» ghi»
$pk =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # at least one word-char or point
.*?\b[\.,?!]*? # any char followed boundary + opt. punctuation
)" # stop capture, ending quote
/«$1»/xg; # change to fancy
I hoped regex will match 1st and 3rd quote and changes them. And it does. Problem is: i hoped then match again 2nd and 4th, but it wont, because 2nd is already left behind. One solution is to run same replacement again until there is less than 2 quote chars in.
Is there better way to achieve my goal? My approach won't work when there will be third level of nesting and this is not my goal, i stay with 2 levels.
NB! Changing startquote and enquote in separate replacement wont work because then will single doublequotes replaced too. I need to replace only when they appear as couple!
MORE examples:
"abc "d e f" -> «abc "d e f»
"abc"d e f" -> «abc"d e f»
This seems impossible:
"abc" d e f" -> «abc" d e f»
There is no general way to pair up nested double quotes. If your quotes are always next to the beginning or end of a word then this may work. It replaces a double quote that precedes a non-space character with an open quote, and one that succeeds a non-space character with an close quote.
use strict;
use warnings;
use utf8;
my $string = '"abc "d e f" ghi"';
$string =~ s/"(?=\S)/«/g;
$string =~ s/(?<=\S)"/»/g;
print $string;
output
«abc «d e f» ghi»
You can use negative lookaround assertions to find the matching directions on your fancy quotes. The double negations help handle the edge cases (e.g. end/beginning of line). I used << and >> instead of your fancy quotes here for simplicity.
use strict;
use warnings;
while (<DATA>) {
s/(?<!\S)"(?!\s)/<</g;
s/(?<!\s)"(?!\S)/>>/g;
print;
}
__DATA__
"abc "d e f" ghi"
Output:
<<abc <<d e f>> ghi>>

Count the number of matches of a particular character in a string matched by a regex wildcard

Can I keep a count of each different character matched in the regex itself ?
Suppose the regex goes looks like />(.*)[^a]+/
Can I keep a count of the occurrences of, say the letter p in the string captured by the group (.*)?
You would have to capture the string matched and process it separately.
This code demonstrates
use strict;
use warnings;
my $str = '> plantagenetgoosewagonattributes';
if ($str =~ />(.*)[^a]+/) {
my $substr = $1;
my %counts;
$counts{$_}++ for $substr =~ /./g;
print "'$_' - $counts{$_}\n" for sort keys %counts;
}
output
' ' - 1
'a' - 4
'b' - 1
'e' - 4
'g' - 3
'i' - 1
'l' - 1
'n' - 3
'o' - 3
'p' - 1
'r' - 1
's' - 1
't' - 5
'u' - 1
'w' - 1
Outside of the regex :
my $p_count = map /p/g, />(.*)[^a]/;
Self-contained:
local our $p_count;
/
(?{ 0 })
>
(?: p (?{ $^R + 1 })
| [^p]
)*
[^a]
(?{ $p_count = $^R; })
/x;
In both cases, you can easily expand this to count all letters. For example,
my %counts;
if (my ($seq = />(.*)[^a]/) {
++$counts{$_} for split //, $seq;
}
my $p_count = $counts{'p'};
AFAIK, you can't. You can only capture some group by parentheses and later check the length of data captured by that group.
Going along the lines of Borodin's solution , here is a pure bash one :
let count=0
testarray=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
string="> plantagenetgoosewagonattributes" # the string
pattern=">(.*)[^a]+" # regex pattern
limitvar=${#testarray[#]} #array length
[[ $string =~ $pattern ]] &&
( while [ $count -lt $limitvar ] ; do sub="${BASH_REMATCH[1]//[^${testarray[$count]}]}" ; echo "${testarray[$count]} = ${#sub}" ; ((count++)) ; done )
Staring from bash 3.0 , bash has introduced the capture groups which can be accessed through BASH_REMATCH[n].
The Solution declares the characters to be counted as arrays [ Check out declare -a for array declaraton in complex cases] .A single character count would require no count variables ,no while construct but a variable for the character instead of an array .
If you are including ranges as in the code above , this array declaration does the exact thing .
testarray=(`echo {a..z}`)
An introduction of an if
loop will account for the display of 0 count characters . I wanted to keep the solution as simple as possible .
There is the experimental, don't-use-me, (?{ code }) construct...
From man perlre:
"(?{ code })"
WARNING: This extended regular expression feature is considered experimental, and may be
changed without notice. Code executed that has side effects may not perform identically
from version to version due to the effect of future optimisations in the regex engine.
If that didn't scare you off, here's an example that counts the number of "p"s
my $p_count;
">pppppbca" =~ /(?{ $p_count = 0 })>(p(?{$p_count++})|.)*[^a]+/;
print "$p_count\n";
First a remark: Due to the greediness of *, the last [^a]+ will never match more than one non-a character -- i.e., you might as well drop the +.
And as #mvf said, you need to capture the string that the wildcard matches to be able to count the characters in it. Perl regular expressions do not have a way to return a count of how many times a specific group matches -- the engine probably keeps the number around to support the {,n} mechanism, but you can't get at it.

Regular expression for selecting single spaced phrases but not whitespace

I need a rather complicated regular expression that will select words with one space between them and that can include the '-' symbol in them, it should not however select continuous whitespace.
'KENEDY JOHN G JR E' 'example' 'D-54'
I have tried the following regular expression:
\'([\s\w-]+)\'
but it selects continuous whitespace which I don't want it to do.
I want the expression to select
'KENEDY JOHN G JR E'
'example'
'D-54'
Perhaps,
\'([\w-]+(?:\s[\w-]+)*)\'
?
EDIT
If leading/trailing dashes (on the word boundaries) are not allowed, this should read:
/\'(\w+(?:[\s-]\w+)*)\'/
An expression like this should do it:
'[\w-]+(?:\s[\w-]+)*'
Try this:
my $data = "'KENEDY JOHN G JR E' 'example' 'D-54'";
# Sets of
# one or more word characters or dash
# followed by an optional space
# enclosed in single quotes
#
# The outermost ()s are optional. There just
# so i can print the match easily as $1.
while ($data =~ /(\'([\w-]+\s?)+\')/g)
{
print $1, "\n";
}
outputs
'KENEDY JOHN G JR E'
'example'
'D-54'
Not sure if this applies to you, since you asked for a regex specifically. However, if you want strings separated by two or more whitespace or dashes, you can use split
use strict;
use warnings;
use v5.10;
my $str = q('KENEDY JOHN G JR E' 'example' 'D-54');
my #match = split /\s{2,}/, $str;
say for #match;
A regex with similar functionality would be
my #match = $str =~ /(.*?)(?:\s{2,}|$)/g;
Note that you'll need the edge case of finding end of line $.
The benefit of using split or the wildcard . is that you rely on whitespace to define your fields, not the content of the fields themselves.
Your code actually works as is.
use feature qw( say );
$_ = "'KENEDY JOHN G JR E' 'example' 'D-54'";
say for /\'([\s\w-]+)\'/g;
output:
KENEDY JOHN G JR E
example
D-54
(Move the parens if you want the quotes too.)
I would simply use
my #data = /'([^']*)'/g;
If you have any validation to do, do it afterwards.