How to replace special characters to underscore(_) perl - regex

my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
if ($folder =~ s/[^\w_*\-]/_/g ) {
$folder =~ s/_+/_/g;
print "$folder : Got %\n" ;
}
Using above code i am not able to handle this "Monday_øå_Tuesday_Wednesday"
The output should be :
s_c
c_pp_p
Monday_øå_Tuesday_Wednesday
Monday_Tuesday
Monday_Tuesday_Wednesday

You can use \W to negate the \w character class, but the problem you've got is that \w doesn't match your non-ascii letters.
So you need to do something like this instead:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
s/[^\p{Alpha}]+/_/g for #folder;
print Dumper \#folder;
Outputs:
$VAR1 = [
's_c_',
'c_pp_p',
'Monday_øå_Tuesday_Wednesday',
'Monday_Tuesday',
'Monday_Tuesday_Wednesday'
];
This uses a unicode property - these are documented in perldoc perluniprop - but the long and short of it is, \p{Alpha} is the unicode alphanumeric set, so much like \w but internationalised.
Although, it does have a trailing _ on the first line. From your description, that seems to be what you wanted. If not, then... it's probably easier to:
s/_$// for #folder;
than make a more complicated pattern.

Related

Regex searching and adding characters

I'm trying to use regex to add $ to the start of words in a string such that:
Answer = partOne + partTwo
becomes
$Answer = $partOne + $partTwo
I'm using / [a-z]/ to locate them but not sure what I'm meant to replace it with.
Is there anyway to do it with regex or am I suppose to just split up my string and put in the $?
I'm using perl right now.
You can match word boundary \b, followed by word class \w
my $s = 'Answer = partOne + partTwo';
$s =~ s|\b (?= \w)|\$|xg;
print $s;
output
$Answer = $partOne + $partTwo
You could use a lookahead to match only the space or start of a line anchor which was immediately followed by an alphabet. Replace the matched space character or starting anchor with a $ symbol.
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/(^|\s)(?=[A-Za-z])/$1\$/g;
print $line;
}
__DATA__
Answer = partOne + partTwo
Output:
$Answer = $partOne + $partTwo
Perl's regexes have a word character class \w that is meant for exactly this sort of thing. It matches upper-case and lower-case letters, decimal digits, and the underscore _.
So if you prefix all ocurrences of one or more such characters with a dollar then it will achieve what you ask. It would look like this
use strict;
use warnings;
my $str = 'Answer = partOne + partTwo';
$str =~ s/(\w+)/\$$1/g;
print $str, "\n";
output
$Answer = $partOne + $partTwo
But please note that, if the text you're processing is a programming language, this will also process all comments and string literals in a way you probably don't want.
(\w+)
You can use this.Replace by \$$1.
See demo.
http://regex101.com/r/lS5tT3/40

Match string in Perl with space or with no-space in it

$search_buffer="this text has teststring in it, it has a Test String too";
#waitfor=('Test string','some other string');
foreach my $test (#waitfor)
{
eval ('if (lc $search_buffer =~ lc ' . $test . ') ' .
'{' .
' $prematch = $`;' .
' $match = $&; ' .
' $postmatch = ' . "\$';" .
'}');
print "prematch=$prematch\n";
print "match=$match\n"; #I want to match both "teststring" and "Test String"
print "postmatch=$postmatch\n";
}
I need to print both teststring and Test String, can you please help? thanks.
my $search_buffer="this text has teststring in it, it has a Test String too";
my $pattern = qr/test ?string/i;
say "Match found: $1" while $search_buffer =~ /($pattern)/g;
That is a horrible piece of code you have there. Why are you using eval and trying to concatenate strings into code, remembering to interpolate some variables and forgetting about some? There is no reason to use eval in that situation at all.
I assume that you by using lc are trying to make the match case-insensitive. This is best done by using the /i modifier on your regex:
$search_buffer =~ /$test/i; # case insensitive match
In your case, you are trying to match some strings against another string, and you want to compensate for case and for possible whitespace inside. I assume that your strings are generated in some way, and not hard coded.
What you could do is simply to make use of the /x modifier, which will make literal whitespace inside your regex ignored.
Something that you should take into consideration is meta characters inside your strings. If you for example have a string such as foo?, the meta character ? will alter the meaning of your regex. You can disable meta characters inside a regex with the \Q ... \E escape sequence.
So the solution is
use strict;
use warnings;
use feature 'say';
my $s = "this text has teststring in it, it has a Test String too";
my #waitfor= ('Test string','some other string', '#test string');
for my $str (#waitfor) {
if ($s =~ /\Q$str/xi) {
say "prematch = $`";
say "match = $&";
say "postmatch = $'";
}
}
Output:
prematch = this text has teststring in it, it has a
match = Test String
postmatch = too
Note that I use
use strict;
use warnings;
These two pragmas are vital to learning how to write good Perl code, and there is no (valid) reason you should ever write code without them.
This would work for your specific example.
test\s?string
Basically it marks the space as optional [\s]?.
The problem that I'm seeing with this is that it requires you to know where exactly there might be a space inside the string you're searching.
Note: You might also have to use the case-insensitive flag which would be /Test[\s]?String/i

Regex: How could I avoid a character class?

Is it possible to write this without the [^...] but with using the \P{...}?
#!/usr/bin/env perl
use warnings;
use 5.012;
use utf8;
my $string = '_${Hello}?${World}!';
$string =~ s/[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}]/-/g;
say "<$string>";
Well, it's possible but I don't think I'd call it an improvement:
#!/usr/bin/env perl
use warnings;
use 5.012;
use utf8;
my $string = '_${Hello}?${World}!';
$string =~ s/(?=\P{Alphabetic})
(?=\P{Mark})
(?=\P{Decimal_Number})
(?=\P{Connector_Punctuation}) . /-/xgs;
say "<$string>";
With multiple positive lookaheads, they all have to succeed. So it matches one character (the .) that is not Alphabetic and not Mark and not Decimal_Number and not Connector_Punctuation, just like the negated character class would.
I added the /s modifier because the original regex would match a newline (although your sample string doesn't have one). I added /x so I could add some whitespace and break it over multiple lines.
What do you have against character classes, anyway?

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_

How to match '(' using a regex?

When I do this
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = 'dfgdfg5 )';
my $a = '5 )';
my $b = '567';
$s =~ s/$a/$b/g;
print Dumper $s;
I get
Unmatched ) in regex; marked by <-- HERE in m/5 ) <-- HERE / at ./test.pl line 11.
The problem is that $a have a (.
How do I prevent the regex from failing?
Update
The string in $a do I get from a database query, so I can't change it. Or would it be possible to make an $a2 where "something" searches for ) and replaces them with \)?
You need to escape it. Either manually by adding backslash in front of it, or by using quotemeta or the \Q sequence inside the regex:
$a = quotemeta($a);
Or
$s =~ /\Q$a/$b/g;
ETA: This is a good option if you want to match literal strings from a database query.
You should also be aware that it is not a good idea to use $a and $b as variables, since they will mask the predefined variables that are used with sort. E.g. sort { $a <=> $b } #foo.
The simple answer is to backslash escape the paren. my $a = '5 \)'; In your case, as your post mentions, you aren't the one creating the strings, so literally escaping them isn't an option.
It may be simpler to just wrap the variable that's being interpolated by the regex inside of a \Q ... \E.
$s =~ s/\Q$a\E/$b/g;
The quotemeta() function may also be helpful to you, depending on how your code is factored. With that option you would pass $a through quotemeta before interpolating it in the regex. \Q...\E is probably easier in this situation, but if your code is simplified by using quotemeta instead, it's there for you.
Use \) instead of just ). ) is special because it's normally used for capturing patterns so you need to escape it first.
Escape the parentheses with a backslash:
my $a = '5 \)'oi;
Or use \Q inside the regexp:
$s =~ s/\Q$a/$b/g;
Also when storing regexps in a variable, you should look into the regexp quote operator: http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
my $a = qr/5 \)/oi;
In Perl regular expression you need to mask special chars with a backslash \.
Try
my $a = '5 \)';
my $b = '567';
$s =~ s/$a/$b/g;
For details and a good start see perldoc perlretut
Update: I didn't know the RE came from a database. Well, the code above works nevertheless. The hint for the tutorial still applies.
I think you just need to escape the brackets, ie replace ) with \)