Perl regular expression removing duplicate consecutive substrings in a string

Perl regular expression removing duplicate consecutive substrings in a string - regex

I tried to do a search on this particular problem, but all I get is either removal of duplicate lines or removal of repeated strings where they are separated by a delimiter.
My problem is slightly different. I have a string such as
"comp name1 comp name2 comp name2 comp name3"
where I want to remove the repeated comp name2 and return only
"comp name1 comp name2 comp name3"
They are not consecutive duplicate words, but consecutive duplicate substrings. Is there a way to solve this using regular expressions?

s/(.*)\1/$1/g
Be warned that the running time of this regular expression is quadratic in the length of the string.

This works for me (MacOS X 10.6.7, Perl 5.13.4):
use strict;
use warnings;
my $input = "comp name1 comp name2 comp name2 comp name3" ;
my $output = "comp name1 comp name2 comp name3" ;
my $result = $input;
$result =~ s/(.*)\1/$1/g;
print "In: <<$input>>\n";
print "Want: <<$output>>\n";
print "Got: <<$result>>\n";
The key point is the '\1' in the matching.

To avoid removing duplicate characters within the terms (e.g. comm1 -> com1) bracket .* in regular expression with \b.
s/(\b.*\b)\1/$1/g

I never work with languages that support this but since you are using Perl ...
Go here .. and see this section....
Useful Example: Checking for Doubled Words
When editing text, doubled words such as "the the" easily creep in. Using the regex \b(\w+)\s+\1\b in your text editor, you can easily find them. To delete the second word, simply type in \1 as the replacement text and click the Replace button.

If you need something running in linear time, you could split the string and iterate through the list:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "comp name1 comp name2 comp name2 comp name3";
my #elems = split("\\s", $str);
my $prevComp;
my $prevFlag = -1;
foreach my $elemIdx (0..(scalar #elems - 1)) {
if ($elemIdx % 2 == 1) {
if (defined $prevComp) {
if ($prevComp ne $elems[$elemIdx]) {
print " $elems[$elemIdx]";
$prevFlag = 0;
}
else {
$prevFlag = 1;
}
}
else {
print " $elems[$elemIdx]";
}
$prevComp = $elems[$elemIdx];
}
elsif ($prevFlag == -1) {
print "$elems[$elemIdx]";
$prevFlag = 0;
}
elsif ($prevFlag == 0) {
print " $elems[$elemIdx]";
}
}
print "\n";
Dirty, perhaps, but should run faster.

Related

Regex - How to capture all iterations of a repeating pattern? [duplicate]

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.

No, there is not.

I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S

Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|

TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.

The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.

An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'

/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665

You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

How do I exclude a directory using regular expressions?

I asked a question a little while ago about using regular expressions to extract a match from a URL in a particular directory.
eg: www.domain.com/shop/widgets/match/
The solution given was ^/shop.*/([^/]+)/?$
This would return "match"
However, my file structure has changed and I now need an expression that instead returns "match" in any directory excluding "pages" and "system"
Basically I need an expression that will return "match" for the following:
www.domain.com/shop/widgets/match/
www.domain.com/match/
But not:
www.domain.com/pages/widgets/match/
www.domain.com/pages/
www.domain.com/system/widgets/match/
www.domain.com/system/
I've been struggling for days without any luck.
Thanks

This is just an alternative to Grahams great answer above. Code in C# (but fot the regex part, that doesn't matter):
void MatchDemo()
{
var reg = new Regex("( " +
" (\\w+[.]) " +
" | " +
" (\\w+[/])+ " +
") " +
"(shop[/]|\\w+[/]) " + //the URL-string must contain the sequence "shop"
"(match) " ,
RegexOptions.IgnorePatternWhitespace);
var url = #"www.domain.com/shop/widgets/match/";
var retVal = reg.Match(url).Groups[5]; //do we have anything in the fifth parentheses?
Console.WriteLine(retVal);
Console.ReadLine();
}
/Hans

BRE and ERE do not provide a way to negate a portion of the RE, except within a square bracket expression. That is, you can [^a-z], but you can't express not /(abc|def)/. If your regex dialiect is ERE, then you must use two regexps. If you're using PREG, you can use a negative look-ahead.
For example, here's some PHP:
#!/usr/local/bin/php
<?php
$re = '/^www\.example\.com\/(?!(system|pages)\/)([^\/]+\/)*([^\/]+)\/$/';
$test = array(
'www.example.com/foo/bar/baz/match/',
'www.example.com/shop/widgets/match/',
'www.example.com/match/',
'www.example.com/pages/widgets/match/',
'www.example.com/pages/',
'www.example.com/system/widgets/match/',
'www.example.com/system/',
);
foreach ($test as $one) {
preg_match($re, $one, $matches);
printf(">> %-50s\t%s\n", $one, $matches[3]);
}
And the output:
[ghoti#pc ~]$ ./phptest
>> www.example.com/foo/bar/baz/match/ match
>> www.example.com/shop/widgets/match/ match
>> www.example.com/match/ match
>> www.example.com/pages/widgets/match/
>> www.example.com/pages/
>> www.example.com/system/widgets/match/
>> www.example.com/system/
Is that what you're looking for?

RegEx and split camelCase

I want to get an array of all the words with capital letters that are included in the string. But only if the line begins with "set".
For example:
- string "setUserId", result array("User", "Id")
- string "getUserId", result false
Without limitation about "set" RegEx look like /([A-Z][a-z]+)/

$str ='setUserId';
$rep_str = preg_replace('/^set/','',$str);
if($str != $rep_str) {
$array = preg_split('/(?<=[a-z])(?=[A-Z])/',$rep_str);
var_dump($array);
}
See it
Also your regex will also work.:
$str = 'setUserId';
if(preg_match('/^set/',$str) && preg_match_all('/([A-Z][a-z]*)/',$str,$match)) {
var_dump($match[1]);
}
See it

Is there a way to have a capture repeat an arbitrary number of times in a regex?

I'm using the C++ tr1::regex with the ECMA regex grammar. What I'm trying to do is parse a header and return values associated with each item in the header.
Header:
-Testing some text
-Numbers 1 2 5
-MoreStuff some more text
-Numbers 1 10
What I would like to do is find all of the "-Numbers" lines and put each number into its own result with a single regex. As you can see, the "-Numbers" lines can have an arbitrary number of values on the line. Currently, I'm just searching for "-Numbers([\s0-9]+)" and then tokenizing that result. I was just wondering if there was any way to both find and tokenize the results in a single regex.

No, there is not.

I was about to ask this exact same question, and I kind of found a solution.
Let's say you have an arbitrary number of words you want to capture.
"there are four lights"
and
"captain picard is the bomb"
You might think that the solution is:
/((\w+)\s?)+/
But this will only match the whole input string and the last captured group.
What you can do is use the "g" switch.
So, an example in Perl:
use strict;
use warnings;
my $str1 = "there are four lights";
my $str2 = "captain picard is the bomb";
foreach ( $str1, $str2 ) {
my #a = ( $_ =~ /(\w+)\s?/g );
print "captured groups are: " . join( "|", #a ) . "\n";
}
Output is:
captured groups are: there|are|four|lights
captured groups are: captain|picard|is|the|bomb
So, there is a solution if your language of choice supports an equivalent of "g" (and I guess most do...).
Hope this helps someone who was in the same position as me!
S

Problem is that desired solution insists on use of capture groups. C++ provides tool regex_token_iterator to handle this in better way (C++11 example):
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::regex e (R"((?:^-Numbers)?\s*(\d+))");
string input;
while (getline(cin, input)) {
std::regex_token_iterator<std::string::iterator> a{
input.begin(), input.end(),
e, 1,
regex_constants::match_continuous
};
std::regex_token_iterator<std::string::iterator> end;
while (a != end) {
cout << *a << " - ";
++a;
}
cout << '\n';
}
return 0;
}
https://wandbox.org/permlink/TzVEqykXP1eYdo1c

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regular expression removing duplicate consecutive substrings in a string - regex

s/(.*)\1/$1/g Be warned that the running time of this regular expression is quadratic in the length of the string.

To avoid removing duplicate characters within the terms (e.g. comm1 -> com1) bracket .* in regular expression with \b. s/(\b.*\b)\1/$1/g

Related

Regex - How to capture all iterations of a repeating pattern? [duplicate]

regular expression help: catch this: |TrxId=475665|

How do I exclude a directory using regular expressions?

RegEx and split camelCase

Is there a way to have a capture repeat an arbitrary number of times in a regex?

Categories

Resources