Removing spaces between single letters - regex

I have a string that may contain an arbitrary number of single-letters separated by spaces. I am looking for a regex (in Perl) that will remove spaces between all (unknown number) of single letters.
For example:
ab c d should become ab cd
a bcd e f gh should become a bcd ef gh
a b c should become abc
and
abc d should be unchanged (because there are no single letters followed by or preceded by a single space).
Thanks for any ideas.

Your description doesn't really match your examples. It looks to me like you want to remove any space that is (1) preceded by a letter which is not itself preceded by a letter, and (2) followed by a letter which is not itself followed by a letter. Those conditions can be expressed precisely as nested lookarounds:
/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))/
tested:
use strict;
use warnings;
use Test::Simple tests => 4;
sub clean {
(my $x = shift) =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g;
$x;
}
ok(clean('ab c d') eq 'ab cd');
ok(clean('a bcd e f gh') eq 'a bcd ef gh');
ok(clean('a b c') eq 'abc');
ok(clean('ab c d') eq 'ab cd');
output:
1..4
ok 1
ok 2
ok 3
ok 4
I'm assuming you really meant one space character (U+0020); if you want to match any whitespace, you might want to replace the space with \s+.

You can do this with lookahead and lookbehind assertions, as described in perldoc perlre:
use strict;
use warnings;
use Test::More;
is(tran('ab c d'), 'ab cd');
is(tran('a bcd e f gh'), 'a bcd ef gh');
is(tran('a b c'), 'abc');
is(tran('abc d'), 'abc d');
sub tran
{
my $input = shift;
(my $output = $input) =~ s/(?<![[:lower:]])([[:lower:]]) (?=[[:lower:]])/$1/g;
return $output;
}
done_testing;
Note the current code fails on the second test case, as the output is:
ok 1
not ok 2
# Failed test at test.pl line 7.
# got: 'abcd efgh'
# expected: 'a bcd ef gh'
ok 3
ok 4
1..4
# Looks like you failed 1 test of 4.
I left it like this as your second and third examples seem to contradict each other as to how leading single characters should be handled. However, this framework should be enough to allow you to experiment with different lookaheads and lookbehinds to get the exact results you are looking for.

This piece of code
#!/usr/bin/perl
use strict;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
foreach my $string (#strings) {
print "$string --> ";
$string =~ s/\b(\w)\s+(?=\w\b)/$1/g; # the only line that actually matters
print "$string\n";
}
prints this:
a b c --> abc
ab c d --> ab cd
a bcd e f gh --> a bcd ef gh
abc d --> abc d
I think/hope this is what you're looking for.

This should do the trick:
my $str = ...;
$str =~ s/ \b(\w) \s+ (\w)\b /$1$2/gx;
That removes the space between all single nonspace characters. Feel free to replace \S with a more restrictive character class if needed. There also may be some edge cases related to punctuation characters that you need to deal with, but I can't guess that from the info you have provided.
As Ether helpfully points out, this fails on one case. Here is a version that should work (though not quite as clean as the first):
s/ \b(\w) ( (?:\s+ \w\b)+ ) /$1 . join '', split m|\s+|, $2/gex;
I liked Ether's test based approach (imitation is the sincerest form of flattery and all):
use warnings;
use strict;
use Test::Magic tests => 4;
sub clean {
(my $x = shift) =~ s{\b(\w) ((?: \s+ (\w)\b)+)}
{$1 . join '', split m|\s+|, $2}gex;
$x
}
test 'space removal',
is clean('ab c d') eq 'ab cd',
is clean('a bcd e f gh') eq 'a bcd ef gh',
is clean('a b c') eq 'abc',
is clean('abc d') eq 'abc d';
returns:
1..4
ok 1 - space removal 1
ok 2 - space removal 2
ok 3 - space removal 3
ok 4 - space removal 4

It's not a regex but since I am lazy by nature I would it do this way.
#!/usr/bin/env perl
use warnings;
use 5.012;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
for my $string ( #strings ) {
my #s; my $t = '';
for my $el ( split /\s+/, $string ) {
if ( length $el > 1 ) {
push #s, $t if $t;
$t = '';
push #s, $el;
} else { $t .= $el; }
}
push #s, $t if $t;
say "#s";
}
OK, my way is the slowest:
no_regex 130619/s -- -60% -61% -63%
Alan_Moore 323328/s 148% -- -4% -8%
Eric_Storm 336748/s 158% 4% -- -5%
canavanin 352654/s 170% 9% 5% --
I didn't include Ether's code because ( as he has tested ) it returns different results.

Now I have the slowest and the fastest.
#!/usr/bin/perl
use 5.012;
use warnings;
use Benchmark qw(cmpthese);
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
cmpthese( 0, {
Eric_Storm => sub{ for my $string (#strings) { $string =~ s{\b(\w) ((?: \s+ (\w)\b)+)}{$1 . join '', split m|\s+|, $2}gex; } },
canavanin => sub{ for my $string (#strings) { $string =~ s/\b(\w)\s+(?=\w\b)/$1/g; } },
Alan_Moore => sub{ for my $string (#strings) { $string =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g; } },
keep_uni => sub{ for my $string (#strings) { $string =~ s/\PL\pL\K (?=\pL(?!\pL))//g; } },
keep_asc => sub{ for my $string (#strings) { $string =~ s/[^a-zA-Z][a-zA-Z]\K (?=[a-zA-Z](?![a-zA-Z]))//g; } },
no_regex => sub{ for my $string (#strings) { my #s; my $t = '';
for my $el (split /\s+/, $string) {if (length $el > 1) { push #s, $t if $t; $t = ''; push #s, $el; } else { $t .= $el; } }
push #s, $t if $t;
#say "#s";
} },
});
.
Rate no_regex Alan_Moore Eric_Storm canavanin keep_uni keep_asc
no_regex 98682/s -- -64% -65% -66% -81% -87%
Alan_Moore 274019/s 178% -- -3% -6% -48% -63%
Eric_Storm 282855/s 187% 3% -- -3% -46% -62%
canavanin 291585/s 195% 6% 3% -- -45% -60%
keep_uni 528014/s 435% 93% 87% 81% -- -28%
keep_asc 735254/s 645% 168% 160% 152% 39% --

This will do the job.
(?<=\b\w)\s(?=\w\b)

Hi I have written simple javascript to do this it's simple and you can convert into any language.
function compressSingleSpace(source){
let words = source.split(" ");
let finalWords = [];
let tempWord = "";
for(let i=0;i<words.length;i++){
if(tempWord!='' && words[i].length>1){
finalWords.push(tempWord);
tempWord = '';
}
if(words[i].length>1){
finalWords.push(words[i]);
}else{
tempWord += words[i];
}
}
if(tempWord!=''){
finalWords.push(tempWord);
}
source = finalWords.join(" ");
return source;
}
function convertInput(){
let str = document.getElementById("inputWords").value;
document.getElementById("firstInput").innerHTML = str;
let compressed = compressSingleSpace(str);
document.getElementById("finalOutput").innerHTML = compressed;
}
label{
font-size:20px;
margin:10px;
}
input{
margin:10px;
font-size:15px;
padding:10px;
}
input[type="button"]{
cursor:pointer;
background: #ccc;
}
#firstInput{
color:red;
font-size:20px;
margin:10px;
}
#finalOutput{
color:green;
font-size:20px;
margin:10px;
}
<label for="inputWords">Enter your input and press Convert</label><br>
<input id="inputWords" value="check this site p e t z l o v e r . c o m thanks">
<input type="button" onclick="convertInput(this.value)" value="Convert" >
<div id="firstInput">check this site p e t z l o v e r . c o m thanks</div>
<div id="finalOutput">check this site petzlover.com thanks</div>

Related

Regex matching characters of one string in another in any order

Consider the following string wizard. I want to find if it is in another string in any order and in any case.
I tried the following
while(<>){if($_=~/(?:([wizard])(?!.*\1)){6}/i){print"0"}else{print"1"}}
For the inputs
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
It printed 111111 but it must have been 111011.
So, I tried this instead (for the same inputs)
while(<>){if($_=~/(?=[wizard]{6})(?!.*(.).*\1).*/i){print"0"}else{print"1"}}
It again printed 111111.
In input number 4, we can make WaDriaz but only one a is needed. Anyway, we can spell wizard by rearranging and in any case. Why is it not working?
What is wrong with my code?
The following should be quite fast (especially if you inline the subs):
use feature qw( fc say );
sub make_key {
my %counts;
++$counts{$_} for split //, fc($_[0]) =~ s/\PL//rg;
return \%counts;
}
sub search {
my ($substr, $str) = #_;
$str = make_key($str);
no warnings qw( uninitialized );
return !( grep { $str->{$_} < $substr->{$_} } keys(%$substr) );
}
my $substr = make_key("wizard");
while (<>) {
chomp;
say search($substr, $_) ? 0 : 1;
}
Unlike virtually all of the previous solutions, this one doesn't consider latte to be in late.
The following is a regex based solution (with some prep). This should also be quite fast (especially if you inline the subs).
use feature qw( fc say );
sub make_re {
my $pat = join ".*?", map quotemeta, sort split //, fc($_[0]) =~ s/\PL//rg;
return qr/$pat/s;
}
sub search {
my ($substr, $str) = #_;
return ( join "", sort split //, $str ) =~ $substr;
}
my $substr = make_re("wizard"); # qr/a.*?d.*?i.*?r.*?w.*?z/is
while (<>) {
chomp;
say search($substr, $_) ? 0 : 1;
}
Finally, a purely regexp-based solution.
use feature qw( fc say );
sub make_re {
my %counts;
++$counts{$_} for split //, fc($_[0]) =~ s/\PL//rg;
my $pat =
join "",
map { "(?=".( ( ".*?" . quotemeta($_) ) x $counts{$_} ).")" }
#sort
keys(%counts);
return qr/^$pat/s;
}
my $re = make_re("wizard"); # qr/^(?=.*?a)(?=.*?d)(?=.*?i)(?=.*?r)(?=.*?w)(?=.*?z)/is
while (<>) {
say /$re/ ? 0 : 1;
}
Unlike virtually all of the previous solutions, none of mine consider latte to be in late.
Here is a pure regex: do a positive lookahead for each character
use warnings;
use strict;
use feature 'say';
use List::Util qw(uniq); # before v. 1.45 in List::MoreUtils
my $string = shift // q(wizard);
my $patt = join '', map { qq{(?=.*\Q$_\E)} } uniq split //, $string;
# say $patt;
#--> (?=.*w)(?=.*i)(?=.*z)(?=.*a)(?=.*r)(?=.*d) (for wizard)
while (<DATA>) {
say "Found '$string' in: $_" if /^$patt/is;
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Being all in one regex, with an anchored lookahead and no overhead, this should be very fast.
The \Q...\E are there in case the search string contains regex-sensitive characters.
This code considers words with repeated characters (latte, rare) to "fit" a word without that (later). It is clarified in comments that this is indeed the wanted behavior: repeated characters need only be found once in the target (letter matches later etc).
I find canonicalization of the inputs and the pattern to be a more generalizable and understandable approach:
#!/usr/bin/env perl
use strict;
use warnings;
sub canonchars {
my %c;
$c{$_} = undef for map lc, grep /\S/, split //, $_[0];
sort keys %c;
}
sub pattern {
map "$_.*", canonchars($_[0]);
}
my %canonical;
while (my $line = <DATA>) {
last unless $line =~ /\S/;
push $canonical{join '', canonchars($line)}->#*, $line;
}
my $pat = qr/#{[join '', pattern('wizard')]}/;
for my $k (keys %canonical) {
if ($k =~ $pat) {
print for $canonical{$k}->#*;
}
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Output:
C:\Temp> perl t.pl
Wayne Drimaz
There is a lot of logic you are trying to fit into a regular expression pattern, and every edge case you find and fix will make it more complicated and more fragile.
It's just a simple matter of using a positive lookahead for every character.
my #stars = (
'Garry Kasparov',
'Bobby Fischer',
'Vladimir Kramnik',
'Wayne Drimaz',
'Lionel Messi',
'La Signora'
);
say /^(?=.*w)(?=.*i)(?=.*z)(?=.*a)(?=.*r)(?=.*d)/i ? 0 : 1 for #stars;
This outputs 111011.
No need for regular expressions... they just complicate things, especially if you're not looking for a string you know ahead of time. Just look for each character in turn after normalizing their cases.
#!/usr/bin/env perl
use strict;
use warnings;
sub contains_chars {
my ($needle, $haystack) = #_;
$haystack = lc $haystack;
my %positions;
for my $char (split //, lc $needle) {
my $p = index $haystack, $char, $positions{$char}//0;
return 1 if $p < 0;
$positions{$char} = $p + 1;
}
return 0;
}
while (<DATA>) {
print contains_chars("wIzArD", $_);
}
print "\n";
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
This a black art way that out of order matching can occur.
Each character in the source string is visited only once.
No need to recurse the string from the beginning for each letter.
use strict;
use warnings;
while (my $line = <DATA>) {
if ( $line =~ /
(?:
.*?
(?:
(?(1)(?!))(w)
| (?(2)(?!))(i)
| (?(3)(?!))(z)
| (?(4)(?!))(a)
| (?(5)(?!))(r)
| (?(6)(?!))(d)
)
){6}/ix ) { print $line, "\n" }
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Wayne Drimaz
Captures can exist in two states, defined or undefined.
The essence of this black art is to use the state of captures as flags
to insure all the items are matched in the out-of-order state.
The above can also be written this way with the same result.
use strict;
use warnings;
while (my $line = <DATA>) {
if ( $line =~ /
(?im)
^
(?>
.*?
(?:
w ( ) # (1)
| i ( ) # (2)
| z ( ) # (3)
| a ( ) # (4)
| r ( ) # (5)
| d ( ) # (6)
)
)+
(?= \1 \2 \3 \4 \5 \6 )
/x ) { print $line, "\n" }
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Just adding a simple variant of /w/ && /i/ && /z/ ... solution I mentioned in comments. If you want this solution to match a lot of different strings, instead of lining the regexes together with &&, you can simply loop over the characters. A useful tool is to use the &&= operator to mimic the behaviour of a long string of conditions. This will also allow us to short-circuit the matching if we find a mismatch, giving us a speed benefit.
For example:
/a/ && /b/ && /c/
Equivalent to
my $match = 1;
for my $w (qw(a b c)) {
$match &&= (/$w/); # $match = ($match && /$w/)
}
To remember the count of letters, i.e. whether latte should be considered to be a substring of late, you can simply use the substitution operator s/// instead of match operator m//. I added the multi-letter criteria, and added two test cases to demonstrate.
I like this solution because of the simplicity, but thereby not said it is the best one.
use strict;
use warnings;
my $word = "wizzard";
while (<DATA>) {
print search($_, $word), " $_";
}
sub search {
my ($str, $substr) = #_;
my $match = 1; # assume true
for my $w (split //, $substr) { # for each char in substr...
$match &&= ($str =~ s/\Q$w//i); # ...remove character
return 0 if not $match; # ...return false if no match found
}
return 1 if $match;
}
__DATA__
wizard
wizzard
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Wayne Drimazz
Lionel Messi
La Signora
Output:
0 wizard
1 wizzard
0 Garry Kasparov
0 Bobby Fischer
0 Vladimir Kramnik
0 Wayne Drimaz
1 Wayne Drimazz
0 Lionel Messi
0 La Signora
If you do not care about multi-letter matching, just replace s/// with m//.

Delete everything except what matches a regex pattern

I have a string and I want to replace everything but the pattern.
Right now I know what I want to do is
$line =~ s/[A-Z]{4}[0-9]{4}//g;
but inverted so that it replaces everything with nothing except the pattern.
It is a bit unclear what you are asking, but you may wish to try something like the following, which captures the pattern and then replaces the line with the capture:
#!/usr/bin/env perl
use warnings;
use strict;
my #lines = (
'HELLO WORLD',
'HELLO ABCD1234 WORLD',
'HELLOABCD1234WORLD',
'H E L LO ABCD1234 WORLD',
);
my $re_match = qr([A-Z]{4}[0-9]{4});
for my $line (#lines) {
print "$line => ";
if ($line =~ $re_match) {
$line =~ s|^.*($re_match).*$|$1|;
print $line . "\n";
} else {
print "does not match pattern $re_match \n";
}
}
Output
HELLO WORLD => does not match pattern (?^:[A-Z]{4}[0-9]{4})
HELLO ABCD1234 WORLD => ABCD1234
HELLOABCD1234WORLD => ABCD1234
H E L LO ABCD1234 WORLD => ABCD1234
perl -E '$_="xxABCD1234xxABCD1234xx"; #m = $_ =~ /[A-Z]{4}[0-9]{4}/g; #m and $_ = join "", #m; say'
Output:
ABCD1234ABCD1234

Finding indexes of non-empty fields in text files

I have a very large text file whose lines are comma-separated values. Some values are missing. For each line, I'd like to print the index and value of all the non-empty fields.
For example, a line might look like
,,10.3,,,,5.2,3.1,,,,,,,
in which case the output I want is
2,10.3,6,5.2,7,3.1
I know how to accomplish this by first splitting the input into an array, and then going through the array with a for loop, but these are huge files (multi-gigabyte) and I'm wondering if there is a faster way. (e.g. using some advanced regexp)
I haven't benchmarked it yet, but I would assume
my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
is faster than some advanced regexp.
The problem is that as long as you have to know the index, you still have to keep track of those missing entries somehow.
Something like this might not be too slow though:
my ($i, #vars);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
print join ",", #vars;
You could probably leave out the first capturing group and use pos() to work out the index.
Here's a comparison of my two suggestions and sin's with 1M iterations:
Rate flesk1 sin flesk2
flesk1 87336/s -- -8% -27%
sin 94518/s 8% -- -21%
flesk2 120337/s 38% 27% --
Seems like my regex works better than I thought.
You might be able to mix and match regex and code -
$line =~ /(?{($cnt,#ary)=(0,)})^(?:([^,]+)(?{push #ary,$cnt; push #ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', #ary);
expanded -
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
and print join( ',', #ary);
some benchmarks
With a slight tweak of flesk's and sln's (look for fleskNew and slnNew),
the winner is the fleskNew when the substitution operator is removed.
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1 => sub{
$index = 0;
join ",",
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
} );
numbers -
Rate flesk1 sln1 flesk2 slnNew fleskNew
flesk1 20325/s -- -51% -52% -56% -60%
sln1 41312/s 103% -- -1% -10% -19%
flesk2 41916/s 106% 1% -- -9% -17%
slnNew 45978/s 126% 11% 10% -- -9%
fleskNew 50792/s 150% 23% 21% 10% --
some benchmarks 2
Adds Birei's in-line replacment and trim (all-in-one) solution.
Abberations:
Flesk1 is modified to remove the final 'join' as it is not included in
the other regex solutions. This gives it a chance to bench better.
Birei deviates in the bench as it modifies the original string to be the final solution.
That aspect can't be taken out. The difference between Birei1 and BireiNew is that the
new one removes the final ','.
Flesk2, Birei1 and BireiNew have the additional overhead of restoring the original string
due to the substitution operator.
The winner still looks like FleskNew ..
code -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1a => sub{
$index = 0;
map {join ",", #$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, #vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push #vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, #vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push #vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
^(?:
([^,]+) (?{push #ary,$cnt; push #ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,#ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push #ary, $cnt,$^N})
)+
/x
},
Birei1 => sub{
$i = -1;
$line =~
s/
(?(?=,+)
( (?: , (?{ ++$i }) )+ )
| (?<no_comma> [^,]+ ,? ) (?{ ++$i })
)
/
defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
/xge;
$line = $samp;
},
BireiNew => sub{
$i = 0;
$line =~
s/
(?: , (?{++$i}) )*
(?<data> [^,]* )
(?: ,*$ )?
(?= (?<trailing_comma> ,?) )
/
length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
/xeg;
$line = $samp;
},
} );
results -
Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew
BireiNew 6030/s -- -18% -74% -85% -86% -87% -88%
Birei1 7389/s 23% -- -68% -82% -82% -84% -85%
flesk1a 22931/s 280% 210% -- -44% -45% -51% -54%
flesk2 40933/s 579% 454% 79% -- -2% -13% -17%
sln1 41752/s 592% 465% 82% 2% -- -11% -16%
slnNew 47088/s 681% 537% 105% 15% 13% -- -5%
fleskNew 49563/s 722% 571% 116% 21% 19% 5% --
Using a regex (although I'm sure it can be simpler):
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
Explanation:
s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences
# e -> Evaluate replacement as a expression.
(?
(?=,+) # Check for one or more commas.
((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found.
|
(?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma} # insert the position just before it.
qq[] # If wasn't set, it means that pattern matched only commas, so remove then.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
our $i = -1;
chomp;
printf qq[Orig = $_\n];
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
# s/,\Z//;
printf qq[Mod = $_\n\n];
}
__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,
Run the script like:
perl script.pl
And output:
Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,
Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,
Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1
Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,
As you can see, it keeps last comma. I don't know how to remove it without an extra regex, just uncomment s/,\Z//; in previous code.

How can a Perl regex re-use part of the previous match for the next match?

I need some Perl regular expression help. The following snippet of code:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my #results = ();
1 while $str =~ s/(.{2}\b$word\b.{2})/push(#results,"$1\n")/e;
print #results;
Produces the following output:
A plus B
D plus E
2 plus F
H plus I
4 plus J
5 plus K
What I want to see is this, where a character already matched can appear in a new match in a different context:
A plus B
D plus E
E plus F
H plus I
I plus J
J plus K
How do I change the regular expression to get this result? Thanks --- Dan
General advice: Don't use s/// when you want m//. Be specific in what you match.
The answer is pos:
#!/usr/bin/perl -l
use strict;
use warnings;
my $str = 'In this example, ' . 'A plus B equals C, ' .
'D plus E plus F equals G ' .
'and H plus I plus J plus K equals L';
my $word = "plus";
my #results;
while ( $str =~ /([A-Z] $word [A-Z])/g ) {
push #results, $1;
pos($str) -= 1;
}
print "'$_'" for #results;
Output:
C:\Temp> b
'A plus B'
'D plus E'
'E plus F'
'H plus I'
'I plus J'
'J plus K'
You can use a m//g instead of s/// and assign to the pos function to rewind the match location before the second term:
use strict;
use warnings;
my $str = 'In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L';
my $word = 'plus';
my #results;
while ($str =~ /(.{2}\b$word\b(.{2}))/g) {
push #results, "$1\n";
pos $str -= length $2;
}
print #results;
Another option is to use a lookahead:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E "
. "plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my $chars = 2;
my #results = ();
push #results, $1
while $str =~ /(?=((.{0,$chars}?\b$word\b).{0,$chars}))\2/g;
print "'$_'\n" for #results;
Within the lookahead, capturing group 1 matches the word along with a variable number of leading and trailing context characters, up to whatever maximum you've set. When the lookahead finishes, the backreference \2 matches "for real" whatever was captured by group 2, which is the same as group 1 except that it stops at the end of the word. That sets pos where you want it, without requiring you to calculate how many characters you actually matched after the word.
Given the "Full Disclosure" comment (but assuming .{0,35}, not .{35}), I'd do
use List::Util qw/max min/;
my $context = 35;
while ( $str =~ /\b$word\b/g ) {
my $pre = substr( $str, max(0, $-[0] - $context), min( $-[0], $context ) );
my $post = substr( $str, $+[0], $context );
my $match = substr( $str, $-[0], $+[0] - $-[0] );
$pre =~ s/.*\n//s;
$post =~ s/\n.*//s;
push #results, "$pre$match$post";
}
print for #results;
You'd skip the substitutions if you really meant (?s:.{0,35}).
Here's one way to do it:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my #results = ();
my $i = 0;
while (substr($str, $i) =~ /(.{2}\b$word\b.{2})/) {
push #results, "$1\n";
$i += $-[0] + 1;
}
print #results;
It's not terribly Perl-ish, but it works and it doesn't use too many obscure regular expression tricks. However, you might have to look up the function of the special variable #- in perlvar.
don't have to use regex. basically, just split up the string, use a loop to go over each items, check for "plus" , then get the word from before and after.
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
#s = split /\s+/,$str;
for($i=0;$i<=scalar #s;$i++){
if ( "$s[$i]" eq "plus" ){
print "$s[$i-1] plus $s[$i+1]\n";
}
}

Replace multiple instances in a string in perl

I have the following use case, input present in the file as:
Line1 : AA BB CC DD EE
I want to replace this with
1 2 3 4 5
Output
Line1: 1 2 3 4 5
In one regular expression in Perl, can I do this
I was trying this but was unsucessful
my #arr1 = ("AA", "BB", "CC", "DD", "EE");
open F2, $file;
my $count = 0;
while (<F2>) {
my $str = $_;
$str =~ s/$arr[$count]/$count+1/g;
print to file
}
close(F2);
This doesn't do the trick any ideas
If I understand correctly, you want to replace every word with number (incremented by 1 after every word). Here is program with tests:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More qw(no_plan);
sub replace {
my $str=shift;
my $count=1;
$str=~s/\w+/$count++/ge;
return $str;
}
is(replace('AA AA DD EE'),'1 2 3 4');
is(replace('A B C D E'),'1 2 3 4 5');
You need to do something to modify the file in place, which you are not currently doing. The easiest option would be to use File::Inplace (or to output to a second file).
Additionally you are not looping over the array, but over the lines on the file so it'll replace only $arr[0] for 1 on each line.
use strict;
use warnings;
use File::Inplace;
my #replacees = ("AA", "BB", "CC", "DD", "EE");
my $editor = new File::Inplace(file => "file.txt", regex => "\n");
while (my ($line) = $editor->next_line) {
my $count = 1
for my $replacee (#replacees) {
if ($line =~ m/$replacee/) {
$line =~ s/$replacee/$count/g;
}
$count = $count + 1;
}
$editor->replace_line($line);
}
$editor->commit;
As for writing to the same file, please note Vinko answer. As for replacing strings, please check this snippet:
my #arr1 = ("AA", "BB", "CC", "DD", "EE");
my %replacements = map { ($arr1[$_] => $_ + 1) } (0..$#arr1);
my $regexp = join( '|', sort { length($b) <=> length($a) } #arr1);
open F2, $file;
while (<F2>) {
my $str = $_;
$str =~ s/($regexp)/$replacements{$1}/ge;
print $str;
}
close(F2);
Important parts:
my %replacements = map { ($arr1[$_] => $_ + 1) } (0..$#arr1);
It builds hash with keys from #arr1, and values are the index of given value in #arr1 incremented by 1.
For example, for #arr1 = ("a", "b", "d", "c"); %replacements will be: ("a" => 1, "b", => 2, "c" => 4, "d" => 3);
my $regexp = join( '|', sort { length($b) <=> length($a) } #arr1);
This builds base regexp for finding all words from #arr1. The sort part orders words by their length descending. So, for #arr1 = ("a", "ba", "bac") $regexp will be 'bac|ba|a'.
This ordering is important as otherwise there would be problems if any of the words would be prefix of any other word (as with "ba" and "bac" in my example).
As a last word, usage of filehandles as FH is rather discouraged, as these are globals, and generate "interesting" problems in more complex programs. Instead use open like this:
open my $fh, 'filename';
or better yet:
open my $fh, '<', 'filename';
First, a correction:
while (<F2>) {
my $str = $_;
If you want the line read to end up in $str, there is no reason to involve $_ in the process:
while ( my $str = ) {
which also brings up the point made by depesz that you should use lexical filehandles rather than package global bareword filehandles.
Now, looking at your loop:
my $count = 0;
while (my $str = <$input_fh>) {
$str =~ s/$arr[$count]/$count+1/g;
# ...
}
there seems to be an implicit assumption that there cannot be more lines in the file than the number of elements in #foo. In which case, you need not use $count: $. would do just fine. Say you are on the second line. Your code says you want to replace all occurrences of BB on that line with 2 which is different than what you describe verbally.
This is an important point: Any code you post ought to be consistent with the verbal description.
Anyway, here is one way:
rty.pl
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
my ($input) = #ARGV;
write_file(
$input, [
map { s/( ([A-Z]) \2 )/ord($2) - ord('A') + 1/gex; $_ } read_file $input
]
);
__END__
test.data:
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
$ rty.pl test.data
test.data after script invocation:
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5
Either way it will work
my %arr2 = ('AA'=>1, 'BB'=>2,'CC'=>3,'DD'=>4,'EE'=>5,'FF'=>6);
open F2, "t1.txt";
open F3, ">out.txt";
while () {
my $str = $;
print F3 join ' ' ,map {s/$/$arr2{$}/g; $} split / /,$str;
print F3 "\n";
}
close(F2);
close(F3);
or
my #arr1 = ("AA", "BB", "CC", "DD", "EE","FF");
my %hashArr = map { ($arr1[$] => $ + 1) } (0..$#arr1);
open F2, "t1.txt";
open F3, ">out.txt";
while () {
my $str = $;
print F3 join ' ' ,map {s/$/$hashArr{$}/g; $} split / /,$str;
print F3 "\n";
}
close(F2);
close(F3);