Regex: How could I avoid a character class? - regex

Is it possible to write this without the [^...] but with using the \P{...}?
#!/usr/bin/env perl
use warnings;
use 5.012;
use utf8;
my $string = '_${Hello}?${World}!';
$string =~ s/[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}]/-/g;
say "<$string>";

Well, it's possible but I don't think I'd call it an improvement:
#!/usr/bin/env perl
use warnings;
use 5.012;
use utf8;
my $string = '_${Hello}?${World}!';
$string =~ s/(?=\P{Alphabetic})
(?=\P{Mark})
(?=\P{Decimal_Number})
(?=\P{Connector_Punctuation}) . /-/xgs;
say "<$string>";
With multiple positive lookaheads, they all have to succeed. So it matches one character (the .) that is not Alphabetic and not Mark and not Decimal_Number and not Connector_Punctuation, just like the negated character class would.
I added the /s modifier because the original regex would match a newline (although your sample string doesn't have one). I added /x so I could add some whitespace and break it over multiple lines.
What do you have against character classes, anyway?

Related

How to replace special characters to underscore(_) perl

my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
if ($folder =~ s/[^\w_*\-]/_/g ) {
$folder =~ s/_+/_/g;
print "$folder : Got %\n" ;
}
Using above code i am not able to handle this "Monday_øå_Tuesday_Wednesday"
The output should be :
s_c
c_pp_p
Monday_øå_Tuesday_Wednesday
Monday_Tuesday
Monday_Tuesday_Wednesday
You can use \W to negate the \w character class, but the problem you've got is that \w doesn't match your non-ascii letters.
So you need to do something like this instead:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
s/[^\p{Alpha}]+/_/g for #folder;
print Dumper \#folder;
Outputs:
$VAR1 = [
's_c_',
'c_pp_p',
'Monday_øå_Tuesday_Wednesday',
'Monday_Tuesday',
'Monday_Tuesday_Wednesday'
];
This uses a unicode property - these are documented in perldoc perluniprop - but the long and short of it is, \p{Alpha} is the unicode alphanumeric set, so much like \w but internationalised.
Although, it does have a trailing _ on the first line. From your description, that seems to be what you wanted. If not, then... it's probably easier to:
s/_$// for #folder;
than make a more complicated pattern.

Perl - Removing all special characters except a few

So i came across a Perl regex "term" which allows you to remove all punctuation. Here is the code:
$string =~ s/[[:punct:]]//g;.
However this proceeds to remove all special characters. Is there a way that particular regex expression can be modified so that for example, it removes all special characters except hyphens. As i stated on my previous question with Perl, i am new to the language, thus obvious things don't come obvious to me. Thanks for all the help :_
Change your code like below to remove all the punctuations except hyphen,
$string =~ s/(?!-)[[:punct:]]//g;
DEMO
use strict;
use warnings;
my $string = "foo;\"-bar'.,...*(){}[]----";
$string =~ s/(?!-)[[:punct:]]//g;
print "$string\n";
Output:
foo-bar----
You may also use unicode property:
$string =~ s/[^-\PP]+//g;

Perl simple regex uppercase words separated by underscore

Consider I have string like print_this_text_in_camel_case and I want to uppercase the first word and every word after the underscore, so the result will be Print_This_Text_In_Camel_Case. The below test does not work on the first word.
#!/usr/bin/perl
my $str = "print_this_text_in_camel_case";
$str =~ s/(_.)/uc($1)/ge;
print $str, "\n";
Just modify the regex to match the first char as well:
#!/usr/bin/perl
my $str = "print_this_text_in_camel_case";
$str =~ s/(_.|^.)/uc($1)/ge;
print $str, "\n";
will print out:
Print_This_Text_In_Camel_Case
You need to add a beginning-of-string anchor as an alternative to the underscore.
For Perl 5.10+, I'd use a \K (keep) escape to emulate variable-width look-behind and only uppercase the letter. I'd also use use \U to perform the uppercase in the replacement text instead of uc and the /e (eval) modifier.
$str =~ s/(?:^|_)\K(.)/\U$1/g;
If you're using an older version of Perl (without \K) you could do it this way:
$str =~ s/(^|_)(.)/$1\U$2/g;
Another alternative is using split and join instead of a regex:
$str = join '_', map { ucfirst } split /_/, $s;
It is tidiest to use a negative look-behind. This code fragment upper-cases all letters that aren't preceded by a letter.
my $str = "print_this_text_in_camel_case";
$str =~ s/ (?<!\p{alpha}) (\p{alpha}) /uc $1/xgei;
print $str, "\n";
output
Print_This_Text_In_Camel_Case
If you prefer, or if you have a very old copy of Perl that doesn't support Unicode properties, you can use [a-z] instead od \p{alpha}, like this
$str =~ s/ (?<![a-z]) ([a-z]) /uc $1/xige;
which produces the same result.
You could also use ucfirst
use feature 'say';
my $str = "print_this_text_in_camel_case";
my #split = map(ucfirst, (split/(_)/, $str));
say #split;

How can I use regex to remove /1 or /2?

Regex gurus,
Here is the following line of code I want to parse with regex:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
I want to obtain the following:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0
I have written the following regex on rubular.com:
(#.* *.)(!?(\/.))
My idea is to use negation to remove /1 by (!?(\/.)). However, this produces the entire line?
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
Why is (?!thisismystring) not removing /1? I googled the fire out of this, but they seemed to suggest similar things I am already trying? I deeply appreciate your help.
I think what you are trying to write is /(\#.* .*)(?=\/\d)/ (you need to escape the at sign # to prevent Perl from treating it as an array) but you need a positive look-ahead because you want to match everything up until the following characters are a slash followed by a digit.
Here is a program that demonstrates.
use strict;
use warnings;
use 5.010;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
$s =~ /(\#.* .*)(?=\/.)/;
print $1, "\n";
But you would be much better off copying the whole string and removing the slash and everything after it, like this
use strict;
use warnings;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
(my $fixed = $s) =~ s{/\d+$}{};
print $fixed, "\n";
output
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0

Perl Regular expression to replace the last matching string

I have a string as below:
$str = "/dir1/dir2/dir3/file.txt"
I want to remove the /file.txt from this string.
So that the $str will become.
$str = "/dir1/dir2/dir3"
I am using the following regex. But it is replacing everything.
$str =~ s/\/.*\.txt//;
How can I make regex to look for last '/' instead of first.
What is the correct regular expression for this?
Please note that file.txt is not fixed name. It can be anything like file1.txt, file2.txt, etc.
If you want to get the path from that string, you can use File::Basename. It is a core module since Perl version 5.
perl -MFile::Basename -le '$str = "/dir2/dir3/file.txt"; print dirname($str);'
In script form:
use strict;
use warnings; # always use these
use File::Basename;
my $str = "/dir1/dir2/dir3/file.txt";
print dirname($str);"
Your regex does not work because it is not anchored, and .* is greedy, so it matches as much as it can, starting from the first slash / it encounters. A working regex would look something like these:
$str =~ s#/[^/]*?\.txt$##;
Note the use of a non-greedy quantifier *?, which will match smallest possible string. Also note that I use another delimiter for the substitution to avoid the "leaning toothpick syndrome", e.g. s/\/\/\///.
Very simple regex : s/\/[^\/]*$//
In this regex
m/(.*)\/[^\/]*$/
the first submatch is the path you are looking for.
EDIT:
If you are looking for substitution user1215106's soultion is the way to go:
s/\/[^\/]*$//