Replace multiple instances in a string in perl - regex

I have the following use case, input present in the file as:
Line1 : AA BB CC DD EE
I want to replace this with
1 2 3 4 5
Output
Line1: 1 2 3 4 5
In one regular expression in Perl, can I do this
I was trying this but was unsucessful
my #arr1 = ("AA", "BB", "CC", "DD", "EE");
open F2, $file;
my $count = 0;
while (<F2>) {
my $str = $_;
$str =~ s/$arr[$count]/$count+1/g;
print to file
}
close(F2);
This doesn't do the trick any ideas

If I understand correctly, you want to replace every word with number (incremented by 1 after every word). Here is program with tests:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More qw(no_plan);
sub replace {
my $str=shift;
my $count=1;
$str=~s/\w+/$count++/ge;
return $str;
}
is(replace('AA AA DD EE'),'1 2 3 4');
is(replace('A B C D E'),'1 2 3 4 5');

You need to do something to modify the file in place, which you are not currently doing. The easiest option would be to use File::Inplace (or to output to a second file).
Additionally you are not looping over the array, but over the lines on the file so it'll replace only $arr[0] for 1 on each line.
use strict;
use warnings;
use File::Inplace;
my #replacees = ("AA", "BB", "CC", "DD", "EE");
my $editor = new File::Inplace(file => "file.txt", regex => "\n");
while (my ($line) = $editor->next_line) {
my $count = 1
for my $replacee (#replacees) {
if ($line =~ m/$replacee/) {
$line =~ s/$replacee/$count/g;
}
$count = $count + 1;
}
$editor->replace_line($line);
}
$editor->commit;

As for writing to the same file, please note Vinko answer. As for replacing strings, please check this snippet:
my #arr1 = ("AA", "BB", "CC", "DD", "EE");
my %replacements = map { ($arr1[$_] => $_ + 1) } (0..$#arr1);
my $regexp = join( '|', sort { length($b) <=> length($a) } #arr1);
open F2, $file;
while (<F2>) {
my $str = $_;
$str =~ s/($regexp)/$replacements{$1}/ge;
print $str;
}
close(F2);
Important parts:
my %replacements = map { ($arr1[$_] => $_ + 1) } (0..$#arr1);
It builds hash with keys from #arr1, and values are the index of given value in #arr1 incremented by 1.
For example, for #arr1 = ("a", "b", "d", "c"); %replacements will be: ("a" => 1, "b", => 2, "c" => 4, "d" => 3);
my $regexp = join( '|', sort { length($b) <=> length($a) } #arr1);
This builds base regexp for finding all words from #arr1. The sort part orders words by their length descending. So, for #arr1 = ("a", "ba", "bac") $regexp will be 'bac|ba|a'.
This ordering is important as otherwise there would be problems if any of the words would be prefix of any other word (as with "ba" and "bac" in my example).
As a last word, usage of filehandles as FH is rather discouraged, as these are globals, and generate "interesting" problems in more complex programs. Instead use open like this:
open my $fh, 'filename';
or better yet:
open my $fh, '<', 'filename';

First, a correction:
while (<F2>) {
my $str = $_;
If you want the line read to end up in $str, there is no reason to involve $_ in the process:
while ( my $str = ) {
which also brings up the point made by depesz that you should use lexical filehandles rather than package global bareword filehandles.
Now, looking at your loop:
my $count = 0;
while (my $str = <$input_fh>) {
$str =~ s/$arr[$count]/$count+1/g;
# ...
}
there seems to be an implicit assumption that there cannot be more lines in the file than the number of elements in #foo. In which case, you need not use $count: $. would do just fine. Say you are on the second line. Your code says you want to replace all occurrences of BB on that line with 2 which is different than what you describe verbally.
This is an important point: Any code you post ought to be consistent with the verbal description.
Anyway, here is one way:
rty.pl
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
my ($input) = #ARGV;
write_file(
$input, [
map { s/( ([A-Z]) \2 )/ord($2) - ord('A') + 1/gex; $_ } read_file $input
]
);
__END__
test.data:
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
Line1 : AA BB CC DD EE
$ rty.pl test.data
test.data after script invocation:
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5
Line1 : 1 2 3 4 5

Either way it will work
my %arr2 = ('AA'=>1, 'BB'=>2,'CC'=>3,'DD'=>4,'EE'=>5,'FF'=>6);
open F2, "t1.txt";
open F3, ">out.txt";
while () {
my $str = $;
print F3 join ' ' ,map {s/$/$arr2{$}/g; $} split / /,$str;
print F3 "\n";
}
close(F2);
close(F3);
or
my #arr1 = ("AA", "BB", "CC", "DD", "EE","FF");
my %hashArr = map { ($arr1[$] => $ + 1) } (0..$#arr1);
open F2, "t1.txt";
open F3, ">out.txt";
while () {
my $str = $;
print F3 join ' ' ,map {s/$/$hashArr{$}/g; $} split / /,$str;
print F3 "\n";
}
close(F2);
close(F3);

Related

Perl Regex - Getting Text Before and After Match

I am parsing a tab delimited file line by line:
Root rootrank 1 Bacteria domain .72 Firmicutes phylum 1 Clostridia class 1 etc.
=
while (my $line = <$fh>) {
chomp($line);
}
On every line, I want to capture the 1st entry before and after a particular match. For example, for the match phylum, I want to capture the entries Firmicutes and 1. For the match domain, I want to capture the entries Bacteria and .72. How would I write the regex to do this?
Sidenote: I can't simply split the line by tab into an array and use the index because sometimes a category is missing or there are extra categories, and that causes the entries to be shifted by one or two indices. And I want to avoid writing blocks of if statements.
You can still split the input, then map the words to indices, and use than use the indices corresponding to the matches to extract the neighbouring cells:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #matches = qw( phylum domain );
while (<>) {
chomp;
my #cells = split /\t/;
my %indices;
#indices{ #cells } = 0 .. $#cells;
for my $match (#matches) {
if (defined( my $index = $indices{$match} )) {
say join "\t", #cells[ $index - 1 .. $index + 1 ];
}
}
}
What's missing:
You should handle the case when $index == 0 or $index == $#cells.
You should handle the case where some words are repeated in one line.
my $file = "file2.txt";
open my $fh, '<', $file or die "Unable to Open the file $file for reading: $!\n";
while (my $line = <$fh>) {
chomp $line;
while ($line =~ /(\w+)\s+(\w+)\s+(\.?\d+)/g) {
my ($before, $match, $after) = ($1, $2, $3);
print "Before: $before Match: $match After: $after\n";
}
}
You can just simply use the following regex to capture the words before and after of a matched word:
(?<LSH>[\w.]+)[\s\t](?<MATCH>.*?)[\s\t](?<RHS>[\w.]+)
see demo / explanation
You could do:
#!/usr/bin/perl
use Modern::Perl;
my #words = qw(phylum domain);
while(<DATA>) {
chomp;
for my $word (#words) {
my ($before, $after) = $_ =~ /(\S+)(?:\t\Q$word\E\t)(\S+)/i;
say "word: $word\tbefore: $before\tafter: $after";
}
}
__DATA__
Root rootrank 1 Bacteria domain .72 Firmicutes phylum 1 Clostridia class 1 etc.
Output:
word: phylum before: Firmicutes after: 1
word: domain before: Bacteria after: .72

Perl: Matching 3 pairs of numbers from 4 consecutive numbers

I am writing some code and I need to do the following:
Given a 4 digit number like "1234" I need to get 3 pairs of numbers (the first 2, the 2 in the middle, and the last 2), in this example I need to get "12" "23" and "34".
I am new to perl and don't know anything about regex. In fact, I am writing a script for personal use and I've started reading about Perl some days ago because I figured it was going to be a better language for the task at hand (need to do some statistics with the numbers and find patterns)
I have the following code but when testing I processed 6 digit numbers, because I "forgot" that the numbers I would be processing are 4 digits, so it failed with the real data, of course
foreach $item (#totaldata)
{
my $match;
$match = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
if ($match)
{
($arr1[$i], $arr2[$i], $arr3[$i]) = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
$processednums++;
$i++;
}
}
Thank you.
You can move last matching position with pos()
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset..
my $item = 1234;
my #arr;
while ($item =~ /(\d\d)/g) {
push #arr, $1;
pos($item)--;
}
print "#arr\n"; # 12 23 34
The simplest way would be to use a global regex pattern search
It is nearly always best to separate verificaton of the input data from processing, so the program below first rejects any values that are not four characters long or that contain a non-digit character
Then the regex pattern finds all points in the string that are followed by two digits, and captures them
use strict;
use warnings 'all';
for my $val ( qw/ 1234 6572 / ) {
next if length($val) != 4 or $val =~ /\D/;
my #pairs = $val =~ /(?=(\d\d))/g;
print "#pairs\n";
}
output
12 23 34
65 57 72
Here's a pretty loud example demonstrating how you can use substr() to fetch out the portions of the number, while ensuring that what you're dealing with is in fact exactly a four-digit number.
use warnings;
use strict;
my ($one, $two, $three);
while (my $item = <DATA>){
if ($item =~ /^\d{4}$/){
$one = substr $item, 0, 2;
$two = substr $item, 1, 2;
$three = substr $item, 2, 2;
print "one: $one, two: $two, three: $three\n";
}
}
__DATA__
1234
abcd
a1b2c3
4567
891011
Output:
one: 12, two: 23, three: 34
one: 45, two: 56, three: 67
foreach $item (#totaldata) {
if ( my #match = $item =~ m/(?=(\d\d))/ ) {
($heads[$i], $middles[$i], $tails[$i]) = #match;
$processednums++;
$i++;
}
}

Find n occurrences from group of characters

Given a string, I am suppose to print "two" if i find exactly two characters from the group xyz.
Given jxyl print two
Given jxyzl print nothing
Given jxxl print two
I am very new to perl so this is my approach.
my $word = "jxyl";
#char = split //, $word;
my $size = $#char;
for ( $i = 0; $i < $size - 1; $i++ ) {
if ( $char[i] eq "x" || $char[i] eq "y" || $char eq "z" ) {
print "two";
}
}
Can anyone tell me why this is isn't working correctly?
From the FAQ:
perldoc -q count
How can I count the number of occurrences of a substring within a string?
use warnings;
use strict;
while (<DATA>) {
chomp;
my $count = () = $_ =~ /[xyz]/g;
print "$_ two\n" if $count == 2;
}
__DATA__
jxyl
jxyzl
jxxl
Outputs:
jxyl two
jxxl two
You basically want to count the number of specific characters in a string.
You can use tr:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my $count = $_ =~ tr/xyz//;
print "$_ - $count\n";
}
__DATA__
jxyl
jxyzl
jxxl
Outputs:
jxyl - 2
jxyzl - 3
jxxl - 2
Determining if there are exactly 2 can be done after the counting.
Definitely not the best way to do it, but here is a regex for fun and to show there is more than one way to do things.
perl -e'$word = "jxyl"; print "two" if $word =~ /^[^xyz]*[xyz][^xyz]*[xyz][^xyz]*$/'

replace {x} with param in string

I want to replace {x} where x is a number from 1-10 with a string from an array.
The array is populated by splitting a string with whitespace.
I have put together some code but the regex is probably wrong.
my #params = split(' ', "Paramtest: {0} {1} {2}");
my $count = #params;
for (my $i = 0; $i <= $count; $i++) {
my $param = #params->[$i];
$cmd_data =~ s/{"$i"}/"$param"/;
if(!$cmd_data) {
$server->command(sprintf("msg $target %s incorrect syntax for %s.", $nick, "!params p1 p2 p3"));
return;
}
}
$server->command(sprintf("msg $target %s.", $cmd_data));
Update
I've tried using the below code as a modified version of Miller's (the first answer)
my #params = split(' ', "!fruit oranges apples");
my $cmd_data = "Fruits: {0} {1}";
$cmd_data =~ s{\{(\d+)\}}{
$params[$1] // die "Not found $1" #line 160
}eg;
$server->command(sprintf("msg $target %s.", $cmd_data));
Output
Not found 1 at myscript.pl line 160.
Perhaps a more generalized search and replace will serve you better:
use strict;
use warnings;
my #params = qw(zero one two three four five six seven eight);
my $string = 'My String: {0} {1} {2}';
$string =~ s{\{(\d+)\}}{
$params[$1] // die "Not found $1"
}eg;
print $string;
Outputs:
My String: zero one two

Removing spaces between single letters

I have a string that may contain an arbitrary number of single-letters separated by spaces. I am looking for a regex (in Perl) that will remove spaces between all (unknown number) of single letters.
For example:
ab c d should become ab cd
a bcd e f gh should become a bcd ef gh
a b c should become abc
and
abc d should be unchanged (because there are no single letters followed by or preceded by a single space).
Thanks for any ideas.
Your description doesn't really match your examples. It looks to me like you want to remove any space that is (1) preceded by a letter which is not itself preceded by a letter, and (2) followed by a letter which is not itself followed by a letter. Those conditions can be expressed precisely as nested lookarounds:
/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))/
tested:
use strict;
use warnings;
use Test::Simple tests => 4;
sub clean {
(my $x = shift) =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g;
$x;
}
ok(clean('ab c d') eq 'ab cd');
ok(clean('a bcd e f gh') eq 'a bcd ef gh');
ok(clean('a b c') eq 'abc');
ok(clean('ab c d') eq 'ab cd');
output:
1..4
ok 1
ok 2
ok 3
ok 4
I'm assuming you really meant one space character (U+0020); if you want to match any whitespace, you might want to replace the space with \s+.
You can do this with lookahead and lookbehind assertions, as described in perldoc perlre:
use strict;
use warnings;
use Test::More;
is(tran('ab c d'), 'ab cd');
is(tran('a bcd e f gh'), 'a bcd ef gh');
is(tran('a b c'), 'abc');
is(tran('abc d'), 'abc d');
sub tran
{
my $input = shift;
(my $output = $input) =~ s/(?<![[:lower:]])([[:lower:]]) (?=[[:lower:]])/$1/g;
return $output;
}
done_testing;
Note the current code fails on the second test case, as the output is:
ok 1
not ok 2
# Failed test at test.pl line 7.
# got: 'abcd efgh'
# expected: 'a bcd ef gh'
ok 3
ok 4
1..4
# Looks like you failed 1 test of 4.
I left it like this as your second and third examples seem to contradict each other as to how leading single characters should be handled. However, this framework should be enough to allow you to experiment with different lookaheads and lookbehinds to get the exact results you are looking for.
This piece of code
#!/usr/bin/perl
use strict;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
foreach my $string (#strings) {
print "$string --> ";
$string =~ s/\b(\w)\s+(?=\w\b)/$1/g; # the only line that actually matters
print "$string\n";
}
prints this:
a b c --> abc
ab c d --> ab cd
a bcd e f gh --> a bcd ef gh
abc d --> abc d
I think/hope this is what you're looking for.
This should do the trick:
my $str = ...;
$str =~ s/ \b(\w) \s+ (\w)\b /$1$2/gx;
That removes the space between all single nonspace characters. Feel free to replace \S with a more restrictive character class if needed. There also may be some edge cases related to punctuation characters that you need to deal with, but I can't guess that from the info you have provided.
As Ether helpfully points out, this fails on one case. Here is a version that should work (though not quite as clean as the first):
s/ \b(\w) ( (?:\s+ \w\b)+ ) /$1 . join '', split m|\s+|, $2/gex;
I liked Ether's test based approach (imitation is the sincerest form of flattery and all):
use warnings;
use strict;
use Test::Magic tests => 4;
sub clean {
(my $x = shift) =~ s{\b(\w) ((?: \s+ (\w)\b)+)}
{$1 . join '', split m|\s+|, $2}gex;
$x
}
test 'space removal',
is clean('ab c d') eq 'ab cd',
is clean('a bcd e f gh') eq 'a bcd ef gh',
is clean('a b c') eq 'abc',
is clean('abc d') eq 'abc d';
returns:
1..4
ok 1 - space removal 1
ok 2 - space removal 2
ok 3 - space removal 3
ok 4 - space removal 4
It's not a regex but since I am lazy by nature I would it do this way.
#!/usr/bin/env perl
use warnings;
use 5.012;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
for my $string ( #strings ) {
my #s; my $t = '';
for my $el ( split /\s+/, $string ) {
if ( length $el > 1 ) {
push #s, $t if $t;
$t = '';
push #s, $el;
} else { $t .= $el; }
}
push #s, $t if $t;
say "#s";
}
OK, my way is the slowest:
no_regex 130619/s -- -60% -61% -63%
Alan_Moore 323328/s 148% -- -4% -8%
Eric_Storm 336748/s 158% 4% -- -5%
canavanin 352654/s 170% 9% 5% --
I didn't include Ether's code because ( as he has tested ) it returns different results.
Now I have the slowest and the fastest.
#!/usr/bin/perl
use 5.012;
use warnings;
use Benchmark qw(cmpthese);
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
cmpthese( 0, {
Eric_Storm => sub{ for my $string (#strings) { $string =~ s{\b(\w) ((?: \s+ (\w)\b)+)}{$1 . join '', split m|\s+|, $2}gex; } },
canavanin => sub{ for my $string (#strings) { $string =~ s/\b(\w)\s+(?=\w\b)/$1/g; } },
Alan_Moore => sub{ for my $string (#strings) { $string =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g; } },
keep_uni => sub{ for my $string (#strings) { $string =~ s/\PL\pL\K (?=\pL(?!\pL))//g; } },
keep_asc => sub{ for my $string (#strings) { $string =~ s/[^a-zA-Z][a-zA-Z]\K (?=[a-zA-Z](?![a-zA-Z]))//g; } },
no_regex => sub{ for my $string (#strings) { my #s; my $t = '';
for my $el (split /\s+/, $string) {if (length $el > 1) { push #s, $t if $t; $t = ''; push #s, $el; } else { $t .= $el; } }
push #s, $t if $t;
#say "#s";
} },
});
.
Rate no_regex Alan_Moore Eric_Storm canavanin keep_uni keep_asc
no_regex 98682/s -- -64% -65% -66% -81% -87%
Alan_Moore 274019/s 178% -- -3% -6% -48% -63%
Eric_Storm 282855/s 187% 3% -- -3% -46% -62%
canavanin 291585/s 195% 6% 3% -- -45% -60%
keep_uni 528014/s 435% 93% 87% 81% -- -28%
keep_asc 735254/s 645% 168% 160% 152% 39% --
This will do the job.
(?<=\b\w)\s(?=\w\b)
Hi I have written simple javascript to do this it's simple and you can convert into any language.
function compressSingleSpace(source){
let words = source.split(" ");
let finalWords = [];
let tempWord = "";
for(let i=0;i<words.length;i++){
if(tempWord!='' && words[i].length>1){
finalWords.push(tempWord);
tempWord = '';
}
if(words[i].length>1){
finalWords.push(words[i]);
}else{
tempWord += words[i];
}
}
if(tempWord!=''){
finalWords.push(tempWord);
}
source = finalWords.join(" ");
return source;
}
function convertInput(){
let str = document.getElementById("inputWords").value;
document.getElementById("firstInput").innerHTML = str;
let compressed = compressSingleSpace(str);
document.getElementById("finalOutput").innerHTML = compressed;
}
label{
font-size:20px;
margin:10px;
}
input{
margin:10px;
font-size:15px;
padding:10px;
}
input[type="button"]{
cursor:pointer;
background: #ccc;
}
#firstInput{
color:red;
font-size:20px;
margin:10px;
}
#finalOutput{
color:green;
font-size:20px;
margin:10px;
}
<label for="inputWords">Enter your input and press Convert</label><br>
<input id="inputWords" value="check this site p e t z l o v e r . c o m thanks">
<input type="button" onclick="convertInput(this.value)" value="Convert" >
<div id="firstInput">check this site p e t z l o v e r . c o m thanks</div>
<div id="finalOutput">check this site petzlover.com thanks</div>