How to remove duplicate substrings from an undelimited string in perl? - regex

I have an odd situation where I want to remove all but the first match of a substring inside of a very long undelimited string. I have found some similar topics here, but none quite like mine.
For simplicities sake, here are some sudo before and after strings.
I have an undelimited file where "c" could be thousands of random characters but "bbb" is a unique string:
aaabbbbbbccccccbbbccccccbbbccccccaaa
I want to remove all but the first bbb:
aaabbbccccccccccccccccccaaa
Also, I would like to be able to use this as a perl script I can pipe through:
cat file.in | something | perl -pe 's/bbb//g' | somethingelse > file.out
But, unlike my example above, I want to leave the first occurrence of "bbb" intact."
This seems like it should be fairly easy, but it is stumping me.
Any ideas?
Thanks in advance!

Perhaps the following will be helpful:
use strict;
use warnings;
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
$string =~ s/(?<=bbb).*?\Kbbb//g;
print $string;
Output:
aaabbbccccccccccccccccccaaa

my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
my $seen;
sub first {
$seen++;
return $_[0] if $seen eq 1;
return '';
}
$string =~ s/(bbb)/first($1)/ge;
say $string;
Outputs:
aaabbbccccccccccccccccccaaa

Related

Replace strings only within a regex match in perl

I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.
So I'm trying to replace \r\n in attribute values with entity references
I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)
Here's is what I have so far:
perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml
This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.
Example:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Should go to:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Solution
Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:
perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml
. should already match \n (because you specify the /s flag) and \r.
To do the replacement in the results, use /e:
perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!
!g; $replacement=~s!\r!
!g; $replacement /sge' tmp.xml
I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m.
OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.
But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $text = do { local $/ ; <DATA> };
#filter text for 'description' text:
my #matches = $text =~ m{m_description=\"([^\"]+)\"}gms;
print Dumper \#matches;
#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/
/gr } #matches;
print Dumper \%replace;
#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace );
$search = qr/\"($search)\"/ms;
print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;
print "New text:\n";
print $text;
__DATA__
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines

Perl - Match string between two colons

My string looks like this
important stuff: some text 2: some text 3.
I want to only print "important stuff". So basically I want to print everything up to the first colon. I'm sure this is simple, but my regex foo is not so good.
Edit: Sorry I was doing something stupid and gave you a bad example line. It has been corrected.
Just restrict what you're matching to non-colons, [^:]*. Note, the ^ and : boundaries aren't actually needed, but they help document the intent behind the regex.
my $text = "important stuff: some text 2: some text 3."
if ($text =~ /^([^:]*):/) {
print "$1";
}
Consider just splitting on the colon:
use strict;
use warnings;
my $string = 'important stuff: some text 2: some text 3.';
my $important = ( split /:/, $string )[0];
print $important;
Output:
important stuff
Well, assume its a string
$test = "sass sg22gssg 22222 2222: important important :"
Assume you want all characters between.
Wrong answer: $test =~ /:(.+):/; # thank you for the change from .{1,}
Corrected.
$test =~ /:([^:]*):/;
print $1; #perl memory u can assign to a string ;
$found = $1;
As a cheat sheet of regex in perl. cheat sheet
I did test it.

How do I return all characters that begin and end with certain characters in Perl (Or C++)?

note: I'm running Perl 5 on Linux
I'm currently doing a project where I have to input a few words and then return words that begin with "d" and end with "e". I'm not using a pre-done list, for example I input into the console Done, Dish, Dome, and Death. I want it to return Done and Dome, but not the other words. I hope to receive help how to do this in Perl, but C++ would help if Perl doesn't work out.
perl -ne ' print if /^d/i && /e$/i ' < words
Since you are using Linux, it may be simpler to use grep(1):
grep -i '^d.*e$' < words
That's almost trivial in Perl:
$ perl -nE 'say "ok" if /^d.*e$/i'
Done
ok
Dish
Dome
ok
Death
It reads from STDIN and says ok if the line matched. This is useful while debugging regular expressions. You just want to output matching lines, so you could simply replace say "ok" by say
$ perl -nlE 'say if /^d.*e$/i' words
while words is the filename of your words file. It magically reads its lines. Short explanation of that regular expression match:
^ # start of the line
d # the literal character 'd' (case-insensitive because of the i switch)
.* # everything allowed here
$ # end of the line
Not often I answer perl questions, but I think this does the trick.
my #words = ...;
#words = grep(/^d.*e$/i, #words);
grep uses a regular expression to filter the words.
How about:
#!/usr/bin/perl -Tw
use strict;
use warnings;
for my $word (#ARGV) {
if ( $word =~ m{\A d .* e \z}xmsi ) {
print "$word\n";
}
}

Need to replace part of a string with another string

I'm still pretty new to perl and regex and need some help getting started. I would love to provide some code, but that's kinda where I'm stuck.
What I'm trying to do is that I have this string in a file like this:
dn: CN=doe\, john,OU=Users,DC=domain,DC=com
and a string like this:
uid: d12345
I need to do a search and replace to get the following result.
dn: uid= d12345,OU=Users,DC=domain,DC=com
Can anyone help me get started with this one? Much thanks!
So you want to replace CN=doe\, john with uid= d12345? Try this:
$uidString = "uid: d12345";
$dnString = "dn: uid= d12345,OU=Users,DC=domain,DC=com";
if( $uidString =~ /uid: (\w+)/ ) {
$uid = $1;
$dnString =~ s/CN=.+?[^\\],/uid= $uid,/;
}
That will replace everything from CN= to the first unescaped comma with the uid.
Won't a one line regex do the trick?
use strict;
use warnings;
my $a = "dn: CN=doe\, john,OU=Users,DC=domain,DC=com";
my $b= "uid: d12345";
#the regex
$a =~ s/CN(.*?), .*?,/$b,/;
print "$a";
I suspect your DNs and uids will be dynamic. Here is something that will help. The regex will substitute CN= all the way until the comma with whatever string you put in $uid.
#!/usr/bin/env perl
use strict;
use warnings;
my $string = 'dn: CN=doe\, john,OU=Users,DC=domain,DC=com';
my $uid_str = 'uid: d12345';
my ($uid) = $uid_str =~ m/^uid:(.+)$/;
$string =~ s/CN=.+(,OU=.+$)/uid=$uid$1/;
print "String is: $string\n";
Output: String is: dn: uid= d12345,OU=Users,DC=domain,DC=com

Question about reg exps in perl

I need to write regular expression that will parse strings like this:
Build-Depends: cdbs, debhelper (>=5), smthelse
I want to extract package names (without version numbers and brackets).
I wrote something like this:
$line =~ /^Build-Depends:\s*(\S+)\s$/
But it's not exactly what I want.
Does someone know how to manage it?
P.S. I just want to get the list: "cdbs debhelper smthelse" as a result
This regex should do what you want: /\s(\S*)(?:\s\(.*?\))?(?:,|$)/g
Edit: You'd call it like this to loop through all the results:
while ($str =~ /\s(\S*)(?:\s\(.*?\))?(?:,|$)/g) {
print "$1 is one of the packages.\n";
}
With your regex /^Build-Depends:\s*(\S+)\s$/ you are matching until the end of string.
Try /^Build-Depends:\s*(\S+)\s/ instead.
This will work for the types of package names listed here.
use warnings;
use strict;
my #packs;
my $line = "Build-Depends: cdbs, debhelper (>=5), smthelse";
if ( $line =~ /^Build-Depends: (.+)$/ ) { # get everything
#packs = split /,+\s*/, $1;
s/\([^)]+\)//g for #packs; # remove version stuff
}
print "$_\n" for #packs;
How about splitting the input on whitespace and print each element if a ( is not present?
Something like this perhaps
perl -lane 'foreach $_ (#F[1..scalar(#F)]) {print if not m/\(/}'
cdbs,
debhelper
smthelse