Question about reg exps in perl - regex

I need to write regular expression that will parse strings like this:
Build-Depends: cdbs, debhelper (>=5), smthelse
I want to extract package names (without version numbers and brackets).
I wrote something like this:
$line =~ /^Build-Depends:\s*(\S+)\s$/
But it's not exactly what I want.
Does someone know how to manage it?
P.S. I just want to get the list: "cdbs debhelper smthelse" as a result

This regex should do what you want: /\s(\S*)(?:\s\(.*?\))?(?:,|$)/g
Edit: You'd call it like this to loop through all the results:
while ($str =~ /\s(\S*)(?:\s\(.*?\))?(?:,|$)/g) {
print "$1 is one of the packages.\n";
}

With your regex /^Build-Depends:\s*(\S+)\s$/ you are matching until the end of string.
Try /^Build-Depends:\s*(\S+)\s/ instead.

This will work for the types of package names listed here.
use warnings;
use strict;
my #packs;
my $line = "Build-Depends: cdbs, debhelper (>=5), smthelse";
if ( $line =~ /^Build-Depends: (.+)$/ ) { # get everything
#packs = split /,+\s*/, $1;
s/\([^)]+\)//g for #packs; # remove version stuff
}
print "$_\n" for #packs;

How about splitting the input on whitespace and print each element if a ( is not present?
Something like this perhaps
perl -lane 'foreach $_ (#F[1..scalar(#F)]) {print if not m/\(/}'
cdbs,
debhelper
smthelse

Related

Replace strings only within a regex match in perl

I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.
So I'm trying to replace \r\n in attribute values with entity references
I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)
Here's is what I have so far:
perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml
This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.
Example:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Should go to:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Solution
Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:
perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml
. should already match \n (because you specify the /s flag) and \r.
To do the replacement in the results, use /e:
perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!
!g; $replacement=~s!\r!
!g; $replacement /sge' tmp.xml
I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m.
OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.
But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $text = do { local $/ ; <DATA> };
#filter text for 'description' text:
my #matches = $text =~ m{m_description=\"([^\"]+)\"}gms;
print Dumper \#matches;
#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/
/gr } #matches;
print Dumper \%replace;
#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace );
$search = qr/\"($search)\"/ms;
print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;
print "New text:\n";
print $text;
__DATA__
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines

How to remove duplicate substrings from an undelimited string in perl?

I have an odd situation where I want to remove all but the first match of a substring inside of a very long undelimited string. I have found some similar topics here, but none quite like mine.
For simplicities sake, here are some sudo before and after strings.
I have an undelimited file where "c" could be thousands of random characters but "bbb" is a unique string:
aaabbbbbbccccccbbbccccccbbbccccccaaa
I want to remove all but the first bbb:
aaabbbccccccccccccccccccaaa
Also, I would like to be able to use this as a perl script I can pipe through:
cat file.in | something | perl -pe 's/bbb//g' | somethingelse > file.out
But, unlike my example above, I want to leave the first occurrence of "bbb" intact."
This seems like it should be fairly easy, but it is stumping me.
Any ideas?
Thanks in advance!
Perhaps the following will be helpful:
use strict;
use warnings;
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
$string =~ s/(?<=bbb).*?\Kbbb//g;
print $string;
Output:
aaabbbccccccccccccccccccaaa
my $string = 'aaabbbbbbccccccbbbccccccbbbccccccaaa';
my $seen;
sub first {
$seen++;
return $_[0] if $seen eq 1;
return '';
}
$string =~ s/(bbb)/first($1)/ge;
say $string;
Outputs:
aaabbbccccccccccccccccccaaa

perl regex replace only part of string

I need to write a perl regex to convert
site.company.com => dc=site,dc=company,dc=com
Unfortunately I am not able to remove the trailing "," using the regex I came with below. I could of course remove the trailing "," in the next statement but would prefer that to be handled as a part of the regex.
$data="site.company.com";
$data =~ s/([^.]+)\.?/dc=$1,/g;
print $data;
This above code prints:
dc=site,dc=company,dc=com,
Thanks in advance.
When handling urls it may be a good idea to use a module such as URI. However, I do not think it applies in this case.
This task is most easily solved with a split and join, I think:
my $url = "site.company.com";
my $string = join ",", # join the parts with comma
map "dc=$_", # add the dc= to each part
split /\./, $url; # split into parts
$data =~s/\./,dc=/g&&s/^/dc=/g;
tested below:
> echo "site.company.com" | perl -pe 's/\./,dc=/g&&s/^/dc=/g'
dc=site,dc=company,dc=com
Try doing this :
my $x = "site.company.com";
my #a = split /\./, $x;
map { s/^/dc=/; } #a;
print join",", #a;
just put like this,
$data="site.company.com";
$data =~ s/,dc=$1/dc=$1/g; #(or) $data =~ s/,dc/dc/g;
print $data;
I'm going to try the /ge route:
$data =~ s{^|(\.)}{
( $1 && ',' ) . 'dc='
}ge;
e = evaluate replacement as Perl code.
So, it says given the start of the string, or a dot, make the following replacement. If it captured a period, then emit a ','. Regardless of this result, insert 'dc='.
Note, that I like to use a brace style of delimiter on all my evaluated replacements.

Opposite of (foo|bar|baz)

I'd like a regex to match everything but a few specific options within a broader expression.
The following example will match test_foo.pl or test_bar.pl or test_baz.pl:
/test_(foo|bar|baz)\.pl/
But I'd like just the opposite:
match test_.*\.pl except for where .* = (foo|bar|baz)
I'm kind of limited in my options for this because this is not directly into a perl program, but an argument to cloc, a program that counts lines of code (that happens to be written in perl). So I'm looking for an answer that can be done in one regex, not multiple chained together.
You should be able to accomplish this by using a negative lookahead:
/test_(?!foo|bar|baz).*\.pl/
This will fail if foo, bar, or baz immediately follows test_.
Note that this could still match something like test_notfoo.pl, and would fail on test_fool.pl, if you do not want this behavior please clarify by adding some examples of what exactly should and should not match.
If you want to accept something like test_fool.pl or test_bart.pl, then you could change it to the following:
/test_(?!(foo|bar|baz)\.pl).*\.pl/
#!/usr/bin/env perl
use strict; use warnings;
my $pat = qr/\Atest_.+(?<!foo|bar|baz)[.]pl\z/;
while (my $line = <DATA>) {
chomp $line;
printf "%s %s\n", $line, $line =~ $pat ? 'matches' : "doesn't match";
}
__DATA__
test_bar.pl
test_foo.pl
test_baz.pl
test baz.pl
0test_bar.pl
test_me.pl
test_me_too.txt
Output:
test_bar.pl doesn't match
test_foo.pl doesn't match
test_baz.pl doesn't match
test baz.pl doesn't match
0test_bar.pl doesn't match
test_me.pl matches
test_me_too.txt doesn't match
(?:(?!STR).)*
is to
STR
as
[^CHAR]
is to
CHAR
So you want
if (/^test_(?:(?!foo|bar|baz).)*\.pl\z/s)
More readable:
my %bad = map { $_ => 1 } qw( foo bar baz );
if (/^test_(.*)\.pl\z/s && !$bad{$1})
Hmm, I might have misunderstood your question. Anyway, maybe this is helpful ...
You would negate the match operator. For example:
perl -lwe "print for grep ! m/(lwp|archive).*\.pl/, glob q(*.pl)"
# Note you'd use single-quotes on Linux but double-quotes on Windows.
# Nothing to do with Perl, just different shells (bash vs cmd.exe).
The ! negates the match. The above is shorthand for:
perl -lwe "print for grep ! ($_ =~ m/(lwp|archive).*\.pl/), glob q(*.pl)"
Which can also be written using the negated match operator !~, as follows:
perl -lwe "print for grep $_ !~ m/(lwp|archive).*\.pl/, glob q(*.pl)"
In case you're wondering, the glob is simply used to get an input list of filenames as per your example. I just substituted another match pattern suitable for the files I had handy in a directory.

Using a regular expression in Perl to list variables from another Perl script

My thoughts on how to grab all scalars and arrays out of a Perl file went along the lines of:
open (InFile, "SomeScript.pl");
#InArray = <InFile>;
#OutArray = {};
close (InFile);
$ArrayCount = #InArray;
open (OutFile, ">outfile.txt");
for ($x=0; $x<=$ArrayCount; $x++){
$Testline = #InArray[$x];
if($Testline =~ m/((#|\$)[A-Z]+)/i){
$Outline = "$1\n";
push #OutArray, $Outline;
}
}
print OutFile #OutArray;
close(OutFile);
...and this works fairly well. The problem is that if multiple variables appear on a line it will only grab the first variable. An example might be:
$FirstVar = $SecondVar + $ThirdVar;
The script would only grab $FirstVar and output to a file. This might still work though because $SecondVar and $ThirdVar have to be initialized somewhere else before the proceeding line has any meaning. I guess the exception to the rule would be a line in which multiple variables are initialized at the same time.
Could an example in real Perl code break this script?
Also, how to grab multiple items that match my regular expression's criteria from the same line?
Don't do that
You can't really parse Perl with regexes, so I wouldn't even try.
You can't even properly parse it without actually running it, but you can get close with PPI.
perl-variables.pl
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.1;
use PPI;
use PPI::Find;
my($filename) = (#ARGV, $0); # checks itself by default
my $Doc = PPI::Document->new($filename);
my $Find = PPI::Find->new( sub{
return 0 unless $_[0]->isa('PPI::Token::Symbol');
return 1;
});
$Find->start($Doc);
while( my $symbol = $Find->match ){
my $raw = $symbol->content;
my $var = $symbol->symbol;
if( $raw eq $var ){
say $var;
} else {
say "$var\t($raw)";
}
}
print "\n";
my #found = $Find->in($Doc);
my %found;
$found{$_}++ for #found;
say for sort keys %found;
Running it against itself, produces:
$filename
#ARGV
$0
$Doc
$filename
$Find
#_ ($_)
$Find
$Doc
$symbol
$Find
$raw
$symbol
$var
$symbol
$raw
$var
$var
#found
$Find
$Doc
%found
%found ($found)
$_
#found
%found
$0
$Doc
$Find
$_
$filename
$found
$raw
$symbol
$var
%found
#ARGV
#found
It looks like this will miss fully qualified variable names ($My::Package::Foo) and the rare but valid variable names enclosed with braces (${variable}, ${"varname!with#special+chars"}). Your script will also match element accesses of hashes and arrays ($array[4] ==> $array, $hash{$key} ==> $hash), and object method calls ($object->method() ==> $object), which may or may not be what you want.
You also mismatch variables with underscores ($my_var) and numbers ($var3), and you could get false positives from comments, quoted strings, pod, etc. (# report bugs to bob#company.org).
Matching multiple expressions is a matter of using the /g modifier, which will return a list of matches:
#vars = $Testline =~ /[#\$]\w+/gi;
if (#vars > 0) {
push #OutArray, #vars;
}
Time simple-minded answer is to the /g flag on your regexp.
The complex answer is that this sort of code analysis is very difficult for perl. Look at the module PPI for a better, more full featured, semantic analysis of perl code.
I can't answer either of your questions directly, but I will offer this: I don't know why you're trying to extract scalars, but the debugger package that comes with perl has to "know" about all variables, and the last time I looked it was written in Perl. You may be better off trying to evaluate a perl script using the debugger package or techniques borrowed from that package rather than reinventing the wheel.
Despite the limitations with the method, here is a slightly simpler version of the script above that reads from stdin.
#!/usr/bin/perl
use strict;
use warnings;
my %vars;
while (<>) {
$vars{$_}++ for (m'([$#]\w+)'g);
}
my #vars = keys %vars;
print "#vars\n";