Select characters that appear only once in a string - regex

Is it possible to select characters who appear only once?
I am familiar with negative look-behind, and tried the following
/(.)(?<!\1.*)/
but could not get it to work.
examples:
given AXXDBD it should output ADBD
^^ - this is unacceptable
given 123558 it should output 1238
^^ - this is unacceptable
thanks in advance for the help

There are probably a lot of approaches to this, but I think you're looking for something like
(.)\1{1,}
That is, any character followed by the same character at least once.
Your question is tagged with both PHP and JS, so:
PHP:
$str = preg_replace('/(.)\1{1,}/', '', $str);
JS:
str = str.replace(/(.)\1{1,}/g, '');

Without using a regular expression:
function not_twice ($str) {
$str = (string)$str;
$new_str = '';
$prev = false;
for ($i=0; $i < strlen($str); $i++) {
if ($str[$i] !== $prev) {
$new_str .= $str[$i];
}
$prev = $str[$i];
}
return $new_str;
}
Removes consecutives characters (1+) and casts numbers to string in case you need that too.
Testing:
$string = [
'AXXDBD',
'123558',
12333
];
$string = array_map('not_twice', $string);
echo '<pre>' . print_r($string, true) . '</pre>';
Outputs:
Array
(
[0] => AXDBD
[1] => 12358
[2] => 123
)

Related

RegEx for capping multiple groups between two words

Consider the following strings:
targethelloluketestlukeluketestluktestingendtarget
sourcehelloluketestlukeluketestluktestingendsource
I want to replace all instances of luke with something else, but only if it's between target...endtarget, not when it's between source...nonsource. The result should be that all three instances of luke in the top string are replaced with whatever I want.
I got this far, but this will only cap one instance of luke. How do I replace all of them?
(?<=target)(?:.*?(luke).*?)(?=target)
SOLUTION
Thanks to the help of this great community, I arrived at the following solution. I find RegEx really convoluted when it comes to this, but in PHP the following works great and is a lot easier to understand:
function replaceBetweenTags($starttag, $endtag, $replace, $with, $text) {
$starttag = escapeStringToRegEx($starttag);
$endtag = escapeStringToRegEx($endtag);
$text = preg_replace_callback(
'/' . $starttag . '.*?' . $endtag . '/',
function ($matches) use ($replace, $with) {
return str_replace($replace, $with, $matches[0]);
},
$text
);
return $text;
}
function escapeStringToRegEx($string)
{
$string = str_replace('\\', '\\\\', $string);
$string = str_replace('.', '\.', $string);
$string = str_replace('^', '\^', $string);
$string = str_replace('$', '\$', $string);
$string = str_replace('*', '\*.', $string);
$string = str_replace('+', '\+', $string);
$string = str_replace('-', '\-', $string);
$string = str_replace('?', '\?', $string);
$string = str_replace('(', '\(', $string);
$string = str_replace(')', '\)', $string);
$string = str_replace('[', '\[', $string);
$string = str_replace(']', '\]', $string);
$string = str_replace('{', '\{', $string);
$string = str_replace('}', '\}', $string);
$string = str_replace('|', '\|', $string);
$string = str_replace(' ', '\s', $string);
$string = str_replace('/', '\/', $string);
return $string;
}
I'm aware of the fact that the escapeStringToRegEx is really quick and dirty, and maybe not even entirely correct, but it's a good starting point to work from.
Here is a solution using a PHP regex callback function:
$input = "luke is here and targethelloluketestlukeluketestluktestingendtarget and luke is also here";
$output = preg_replace_callback(
"/target.*?endtarget/",
function ($matches) {
return str_replace("luke", "peter", $matches[0]);
},
$input
);
echo $output;
This prints:
luke is here and targethellopetertestpeterpetertestluktestingendtarget and luke is also here
Note that occurrences of luke have been replaced with peter only inside the target ... endtarget bounds.
You can use
(?:\G(?!\A)|target)(?:(?!luke|(?:end)?target).)*\Kluke(?=(?:(?!(?:end)?target).)*endtarget)
See the regex demo. If the string has line breaks, you need to use the s flag, or prepend the pattern with (?s) inline PCRE_DOTALL modifier.
Regex details:
(?:\G(?!\A)|target) - either the end of the previous successful match or target string
(?:(?!luke|(?:end)?target).)* - any one char, zero or more occurrences but as many as possible that is not a starting point for the endtarget, target or `luke char sequence
\K - a match reset operator that discards the text matched so far
luke - string to replace
(?=(?:(?!(?:end)?target).)*endtarget) - a positive lookahead that matches a location that must be immediately followed with
(?:(?!(?:end)?target).)* - any one char, zero or more occurrences but as many as possible that is not a starting point for the endtarget or target char sequence
endtarget - an endtarget string.
If you can use preg_replace_callback, use it:
preg_replace_callback('/target.*?endtarget/s', function ($m) {
return str_replace("luke", "<SOME>", $m[0]);
}, $input)
Or, unrolling the loop:
preg_replace_callback('/target[^e]*(?:e(?!ndtarget)[^e]*)*endtarget/', function ($m) {
return str_replace("luke", "<SOME>", $m[0]);
}, $input)

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

Join, split and map using perl for creating new attribs

my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str =~ s#<SampleElement[^>]*oldattribs="([^"]*)"#
my $fulcnt=$&;
my $afids=$1;
my #affs = ();
if($afids =~ m/\s+/) {
#affs = split /\s/, $afids;
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
($fulcnt." newattribs=\"$jnafs\"");
}
else {
($fulcnt);
}
#eg;
My Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,1,1">
Expected Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,2,3">
Someone could point out me where I am doing wrong. Thanks in advance.
Where you're going wrong is earlier than you think - you're parsing XML using regular expressions. XML is contextual, and regex isn't, so it's NEVER going to be better than a dirty hack.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
my $sample_elt = $twig -> get_xpath('//SampleElement',0);
my #old_att = split ( ' ', $sample_elt -> att('oldattribs') );
$sample_elt -> set_att('newattribs', join " ", map { /(\d+)/ } #old_att);
$twig -> set_pretty_print ( 'indented_a' );
$twig -> print;
__DATA__
<XML>
<SampleElement oldattribs="sa1 sa2 sa3">
</SampleElement>
</XML>
But to answer the core of your problem - you're misusing map as an iterator here.
map { $_=~s/[a-z]*//i, } #affs;
Because what that is doing is iterating all the elements in #affs and modifying those... but map is just returning the result of the expression - which is 1 because it worked.
If you want to change #affs you'd:
s/[a-z]*//i for #affs;
But if you didn't want to, then the easy answer is to use the r regex flag:
map { s/[a-z]*//ir } #affs;
Or as I've done in my example:
map { /(\d+)/ } #affs;
Which regex matches and captures the numeric part of the string, but as a result the 'captured' text is what's returned.
Here is a simple way to build shown output from the input $str.
Note: The input is in single quotes, not double. Then the \" isn't a problem in the regex.
my $str = '<SampleElement oldattribs=\"sa1 sa2 sa3\">';
# Pull 'sa1 sa2 sa3' string out of it
my ($attrs) = $str =~ /=\\"([^\\]+)/; # " # (turn off bad syntax highlight)
# Build '1,2,3' string from it
my $indices = join ',', map { /(\d+)/ } split ' ', $attrs;
# Extract content between < > so to add to it, put it back together
my ($content) = $str =~ /<(.*)>/;
my $outout = '<' . $content . " newattribs=\"$indices\"" . '>';
This gives the required output.
Some of these can be combined into single statements, if you are into that. For example
my $indices =
join ',', map { /(\d+)/ } split ' ', ($str =~ /"([^\\]+)/)[0]; # "
$str =~ s/<(.*)>/<$1 newattribs=\"$indices\">/;
All of this can be rolled into one regex, but it becomes just unwieldy and hard to maintain.
Above all – this appears to be XML or such ... please don't do it by hand, unless there is literally just a snippet or two. There are excellent parsers.
Found solution on this by searching map function:
my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str=~s#<SampleElement[^>]*oldattribs="([^"]*)"#my $fulcnt=$&; my $afids=$1;
my #affs = ();
if($afids=~m/\s+/)
{
#affs = split /\s/, $afids;
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
($fulcnt." newattribs=\"#newas\"");
}
else
{
($fulcnt);
}
#eg;
I have updated the below line on my code:
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
Instead of
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
Its working thanks for all.

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Regular expression (/<(\w+)\s+(.*?)>/) need improvement

There is a sub to handle the Type and Value.
sub parse_type_value_specifier {
my $tvs = shift;
my ($type, $value) = $tvs =~ /<(\w+)\s+(.*?)>/;
return $type, $value;
}
It should suit for three formats below.
<B 0> - works, return $type = (B) and $value = (0)
<A[1..80] ""> - doesn't work, need return $type = A[1..80] and $value = () # empty
<A[1..80] "hello"> - doesn't work. need return $type = A[1..80] and $value = (hello)
/<(\w+)\s+(.*?)>/ Thank you.
How about
/<([\w\[\].]+)\s*"?([^">]*)"?>/
or /<(\w+)\s*"?([^">]*)"?>/ if your A[1..80] means \w length 1 to 80
The following "works" for the input you show but you should provide a more complete spec:
#!/usr/bin/perl
use strict; use warnings;
while ( <DATA> ) {
if ( my ($type, $value) = /^<([A-Z])(?:\[.+\])?\s+"?(\w*)"?>/ ) {
print "\$type = $type\t\$value = $value\n";
}
}
__DATA__
<B 0>
<A[1..80] "">
<A[1..80] "hello">
Output:
$type = B $value = 0
$type = A $value =
$type = A $value = hello
Try this:
/<(\w{1,80})\s*(?:\s([^\s">]+|"[^"]*"))?>/
Now if the match of the second grouping starts with a ", remove it from the start and the end and you have the plain value.
It sounds like you want to ignore "s. Run it through another regex to strip those out first.
Try this
<(.+) +"?(.*?)"?>
Your regex is 99% correct, problem is that \w will not match literal square braces []. just repace \w with a suitable character class [\w\[\]\.]+
<([\w\[\]\.]+)\s+(.*?)>