How to replace ',' without replacing it in unit in a string - regex

I would like to replace "," to # in following strings, but without changing it in unit (10,000) format.
x,y,z to x#y#z
x1,y1,z1 to x1#y1#z1
x1,y1 10,000,z1 to x1#y1 10,000#z1
I used s/(\D),/\1#/g, but it won't work for 2 and 3. How to recognize the exclusion pattern is digit on both sides? Can someone help? thanks so much

You need a regex which says to match a comma that does not have a number to its left or right.
s/(?<!\d),|,(?!\d)/#/g
The negative lookbehind assertion (?<!\d) allows matches such as x,, since x is not a number. Using a negated expression allows this to also match beginning of line, e.g. ,x. The negative lookahead assertion (?!\d) allows matches against commas that are not followed by numbers. Neither of these expressions will match a comma surrounded by numbers.

Try the following alternative:
s/,(?<!\d)(?!\d)/\#/g;
sample script
use strict;
use warnings;
my #array = ( 'x,y,z', 'x1,y1,z1', 'x1,y1 10,000,z1');
for my $string (#array) {
$string =~ s/,(?<!\d)(?!\d)/\#/g;
print "$string\n";
}
#OUTPUT
#x#y#z
#x1#y1#z1
#x1#y1 10,000#z1

Related

PERL Regular Expression - Return result not including conditional statement

I'm new to regex and I have a scenario where regex will be useful.
My requirement is quite simple, I to want detect if the word NET is present in a string, and extract the digits that follow it without including the word NET or the spaces that follow it.
In my particular case following the word NET are several white space characters, and the number of these can vary as they're used as padding.
My Input string is as follows
NET 4.800 g
The reg ex I have concocted is as follows
(?<=NET)\s*(\d{0,4}\.\d{1,3})
This produces a result close to what I'm attempting to do.
It performs a positive look-ahead on the characters NET and then matches as many white space characters that follow. Finally I select up to four digits, a period and up to three more digits.
The problem lies in that I'm grabbing the indeterminate number of padding spaces before the number. All I actually want is the number it self.
I did attempt putting \s* into the lookahead, but this failed. Does anyone have any suggestions as to where I'm going wrong here?
I suspect that you are using $& to capture your string, and not $1. The variable $& contains the entire matching string, which then includes your spaces, but not your lookbehind assertion. This sounds like your problem description: That you need to exclude a variable amount of spaces, but you get the error about "variable length lookbehind assertions are not supported".
This would be quite an easy question to answer if you had included your code. You should always do that: Always show.
So... I assume you have something like:
if (/your_regex/) {
$match = $&;
}
Then you should change it to
if (/your_regex/) {
$match = $1;
}
This way, only the string inside the parenthesis will be captured, and \s* outside it will be discarded.
With this proper way of matching, which can also be made in a simpler way, you can simplify your regex. Showing a strict and a flexible version:
use strict;
use warnings;
use Data::Dumper;
my $str = "NET 4.800 g";
my ($number) = $str =~ /^NET\s*(\d{0,4}\.\d{0,3})\sg$/; # strict match
print Dumper $number; # $VAR1 = '4.800';
my ($simple) = $str =~ /NET\s*([\d.]+)/; # flexible match
print Dumper $simple; # $VAR1 = '4.800';
In the strict match, we use anchors at beginning ^ and end $. We make sure that the string starts with NET and ends with g, and account for the exact numbers and spaces we expect to find between.
The flexible match simply looks for NET and captures the number that comes after it. This can take place anywhere in the string, and even match partially.

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Regular expression for number search

I need a regular expression that will find a number(s) that is not inside parenthesis.
Example abcd 1 (35) (df)
It would only see the 1.
Is this very complex? I've tried and had no luck.
Thanks for any help
An easy solution is to first remove the unwanted values:
my $string = "abcd 12 (35) (df) 2311,22";
$string =~ s/\(\d+\)//g; # remove numbers within parens
my #numbers = $string =~ /\d+/g; # extract the numbers
This is quite hard but something like this will probably do:
^(?:\()(\d+)(?:[^)])|(?:[^(0-9]|^)(\d+)(?:[^)0-9]|^)|(?:[^(])(\d+)(?:\))$
The problem is to match (123, 123) and also to not match the string 123 as the number 2 between the non-parentheses characters 1 and 3. Also there are probably some edge cases for start of and end of string.
My suggestion is to not use a regex for this. Maybe a regex that matches numbers and then use the capture info to check if the surrounding characters are not parentheses.
The regular expression would be:
^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$
The result is the first (and only) matching group of the regex.
Maybe you want to remove the ^ and $ if the regex should not match only if it’s the content of a whole single line. You can also use [a-zA-Z] or [[:alpha:]]. This depends on the regular expression engine you use and, of course, the content you want to match.
Example perl code:
if (m/^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$/) {
print("$1\n");
}
Please note that your question contains not enough information to make a good answer possible (you did not say anything about the general format of your expression, for example if you want to match integers or floating points)
How about
/(?:^|[^\d(])(\d+)(?:[^\d)]|$)/
? This matches a string of digits (\d+) that are
preceded by the beginning of the string, or a character that is not a digit or an open parenthesis ((?:^|[^\d(]))
succeeded by the end of the string, or by a character that is not a digit or a close parenthesis ((?:[^\d)]|$))

Insertion with Regex to format a date (Perl)

Suppose I have a string 04032010.
I want it to be 04/03/2010. How would I insert the slashes with a regex?
To do this with a regex, try the following:
my $var = "04032010";
$var =~ s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
print $var;
The \d means match single digit. And {n} means the preceding matched character n times. Combined you get \d{2} to match two digits or \d{4} to match four digits. By surrounding each set in parenthesis the match will be stored in a variable, $1, $2, $3 ... etc.
Some of the prior answers used a . to match, this is not a good thing because it'll match any character. The one we've built here is much more strict in what it'll accept.
You'll notice I used extra spacing in the regex, I used the x modifier to tell the engine to ignore whitespace in my regex. It can be quite helpful to make the regex a bit more readable.
Compare s{(\d{2})(\d{2})(\d{4})}{$1/$2/$3}x; vs s{ (\d{2}) (\d{2}) (\d{4}) }{$1/$2/$3}x;
Well, a regular expression just matches, but you can try something like this:
s/(..)(..)(..)/$1/$2/$3/
#!/usr/bin/perl
$var = "04032010";
$var =~ s/(..)(..)(....)/$1\/$2\/$3/;
print $var, "\n";
Works for me:
$ perl perltest
04/03/2010
I always prefer to use a different delimiter if / is involved so I would go for
s| (\d\d) (\d\d) |$1/$2/|x ;