#!/usr/bin/perl -T
use strict;
use warnings;
use utf8;
my $s = shift || die;
$s =~ s/[^A-Za-z ]//g;
print "$s\n";
exit;
> ./poc.pl "El Guapö"
El Guap
Is there a way to modify this Perl code so that various umlauts and character accents are not stripped out? Thanks!
For the direct question, you may simply need \p{L} (Letter) Unicode Character Property
However, more importantly, decode all input and encode output.
use warnings;
use strict;
use feature 'say';
use utf8; # allow non-ascii (UTF-8) characters in the source
use open ':std', ':encoding(UTF-8)'; # for standard streams
use Encode qw(decode_utf8); # #ARGV escapes the above
my $string = 'El Guapö';
if (#ARGV) {
$string = join ' ', map { decode_utf8($_) } #ARGV;
}
say "Input: $string";
$string =~ s/[^\p{L} ]//g;
say "Processed: $string";
When run as script.pl 123 El Guapö=_
Input: 123 El Guapö=_
Processed: El Guapö
I've used the "blanket" \p{L} property (Letter), as specific description is lacking; adjust if/as needed. The Unicode properties provide a lot, see the link above and the complete list at perluniprops.
The space between 123 El remains, perhaps strip leading (and trailing) spaces in the end.
Note that there is also \P{L}, where the capital P indicates negation.
The above simple-minded \pL won't work with Combining Diacritical Marks, as the mark will be removed as well. Thanks to jm666 for pointing this out.
This happens when an accented "logical" character (extended grapheme cluster, what appears as a single character) is written using separate characters for its base and for non-spacing mark(s) (combining accents). Often a single character for it with its codepoint also exists.
Example: in niño the ñ is U+OOF1 but it can also be written as "n\x{303}".
To keep accents written this way add \p{Mn} (\p{NonspacingMark}) to the character class
my $string = "El Guapö=_ ni\N{U+00F1}o.* nin\x{303}o+^";
say $string;
(my $nodiac = $string) =~ s/[^\pL ]//g; #/ naive, accent chars get removed
say $nodiac;
(my $full = $string) =~ s/[^\pL\p{Mn} ]//g; # add non-spacing mark
say $full;
Output
El Guapö=_ niño.* niño+^
El Guapö niño nino
El Guapö niño niño
So you want s/[^\p{L}\p{Mn} ]//g in order to keep the combining accents.
Related
I am trying to find a solution in perl that captures the filename in the following string -- between the tic marks.
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])‘/;
print $results if $results;
I need to end up with wapenc?T=mavodi-7-13b-2b-3-96-1e3431a
The final tick seems to be different in your regex than in the input string - char 8217 (RIGHT SINGLE QUOTATION MARK U+2019) versus 8216 (LEFT SINGLE QUOTATION MARK U+2018). Also, when using Unicode characters in the source, be sure to include
use utf8;
and save the file UTF-8 encoded.
After fixing these two issues, the code worked for me:
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])’/;
print $results if $results;
Your tic characters aren't in the 7-bit ASCII character set, so there is a whole character-encoding rabbit hole to go down here. But the quick and dirty solution is to capture everything in between extended characters.
($result) = $str =~ /[^\0-\x7f]+(.*?)[^\0-\x7f]/;
[^\0-\x7f] matches characters with character values not between 0 and 127, i.e., anything that is not a 7-bit ASCII character including new lines, tabs, and other control sequences. This regular expression will work whether your input is UTF-8 encoded or has already been decoded, and may work for other character encodings, too.
my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
if ($folder =~ s/[^\w_*\-]/_/g ) {
$folder =~ s/_+/_/g;
print "$folder : Got %\n" ;
}
Using above code i am not able to handle this "Monday_øå_Tuesday_Wednesday"
The output should be :
s_c
c_pp_p
Monday_øå_Tuesday_Wednesday
Monday_Tuesday
Monday_Tuesday_Wednesday
You can use \W to negate the \w character class, but the problem you've got is that \w doesn't match your non-ascii letters.
So you need to do something like this instead:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #folder = ('s,c%','c__pp_p','Monday_øå_Tuesday, Wednesday','Monday & Tuesday','Monday_Tuesday___Wednesday');
s/[^\p{Alpha}]+/_/g for #folder;
print Dumper \#folder;
Outputs:
$VAR1 = [
's_c_',
'c_pp_p',
'Monday_øå_Tuesday_Wednesday',
'Monday_Tuesday',
'Monday_Tuesday_Wednesday'
];
This uses a unicode property - these are documented in perldoc perluniprop - but the long and short of it is, \p{Alpha} is the unicode alphanumeric set, so much like \w but internationalised.
Although, it does have a trailing _ on the first line. From your description, that seems to be what you wanted. If not, then... it's probably easier to:
s/_$// for #folder;
than make a more complicated pattern.
$search_buffer="this text has teststring in it, it has a Test String too";
#waitfor=('Test string','some other string');
foreach my $test (#waitfor)
{
eval ('if (lc $search_buffer =~ lc ' . $test . ') ' .
'{' .
' $prematch = $`;' .
' $match = $&; ' .
' $postmatch = ' . "\$';" .
'}');
print "prematch=$prematch\n";
print "match=$match\n"; #I want to match both "teststring" and "Test String"
print "postmatch=$postmatch\n";
}
I need to print both teststring and Test String, can you please help? thanks.
my $search_buffer="this text has teststring in it, it has a Test String too";
my $pattern = qr/test ?string/i;
say "Match found: $1" while $search_buffer =~ /($pattern)/g;
That is a horrible piece of code you have there. Why are you using eval and trying to concatenate strings into code, remembering to interpolate some variables and forgetting about some? There is no reason to use eval in that situation at all.
I assume that you by using lc are trying to make the match case-insensitive. This is best done by using the /i modifier on your regex:
$search_buffer =~ /$test/i; # case insensitive match
In your case, you are trying to match some strings against another string, and you want to compensate for case and for possible whitespace inside. I assume that your strings are generated in some way, and not hard coded.
What you could do is simply to make use of the /x modifier, which will make literal whitespace inside your regex ignored.
Something that you should take into consideration is meta characters inside your strings. If you for example have a string such as foo?, the meta character ? will alter the meaning of your regex. You can disable meta characters inside a regex with the \Q ... \E escape sequence.
So the solution is
use strict;
use warnings;
use feature 'say';
my $s = "this text has teststring in it, it has a Test String too";
my #waitfor= ('Test string','some other string', '#test string');
for my $str (#waitfor) {
if ($s =~ /\Q$str/xi) {
say "prematch = $`";
say "match = $&";
say "postmatch = $'";
}
}
Output:
prematch = this text has teststring in it, it has a
match = Test String
postmatch = too
Note that I use
use strict;
use warnings;
These two pragmas are vital to learning how to write good Perl code, and there is no (valid) reason you should ever write code without them.
This would work for your specific example.
test\s?string
Basically it marks the space as optional [\s]?.
The problem that I'm seeing with this is that it requires you to know where exactly there might be a space inside the string you're searching.
Note: You might also have to use the case-insensitive flag which would be /Test[\s]?String/i
I am reading a file and am wondering how to skip lines that have Unicode NULL, U+0000? I have tried everything below, but none works:
if($line)
chomp($line)
$line =~ s/\s*$//g;
Your list of "everything" does not seem to include the obvious $line =~ m/\000/.
Because you asked about Unicode NULL (identical to ASCII NUL when encoded in UTF-8), let’s use the \N{U+...} form, described in the perlunicode documentation.
Unicode characters can also be added to a string by using the \N{U+...} notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces, after the U. For instance, a smiley face is \N{U+263A}.
You can also match against \N{U+...} in regexes. See below.
#! /usr/bin/env perl
use strict;
use warnings;
my $contents =
"line 1\n" .
"\N{U+0000}\n" .
"foo\N{U+0000}bar\n" .
"baz\N{U+0000}\n" .
"\N{U+0000}quux\n" .
"last\n";
open my $fh, "<", \$contents or die "$0: open: $!";
while (defined(my $line = <$fh>)) {
next if $line =~ /\N{U+0000}/;
print $line;
}
Output:
$ ./filter-nulls
line 1
last
Perl strings can contain arbitrary data, including NUL characters. Your if only checks for true or false (where "" and "0" are the two false strings, everything else being true including a string containing a single NUL "\x00"). Your chomp only removes the line separator, not NULs. A NUL character is not whitespace, so doesn't match \s.
You can explicitly match a NUL character by specifying it in a regex using octal or hex notation ("\000" or "\x00", respectively).
I'm seeking a solution to splitting a string which contains text in the following format:
"abcd efgh 'ijklm no pqrs' tuv"
which will produce the following results:
['abcd', 'efgh', 'ijklm no pqrs', 'tuv']
In other words, it splits by whitespace unless inside of a single quoted string. I think it could be done with .NET regexps using "Lookaround" operators, particularly balancing operators. I'm not so sure about Perl.
Use Text::ParseWords:
#!/usr/bin/perl
use strict; use warnings;
use Text::ParseWords;
my #words = parse_line('\s+', 0, "abcd efgh 'ijklm no pqrs' tuv");
use Data::Dumper;
print Dumper \#words;
Output:
C:\Temp> ff
$VAR1 = [
'abcd',
'efgh',
'ijklm no pqrs',
'tuv'
];
You can look at the source code for Text::ParseWords::parse_line to see the pattern used.
use strict; use warnings;
my $text = "abcd efgh 'ijklm no pqrs' tuv 'xwyz 1234 9999' 'blah'";
my #out;
my #parts = split /'/, $text;
for ( my $i = 1; $i < $#parts; $i += 2 ) {
push #out, split( /\s+/, $parts[$i - 1] ), $parts[$i];
}
push #out, $parts[-1];
use Data::Dumper;
print Dumper \#out;
So you've decided to use a regex? Now you have two problems.
Allow me to infer a little bit. You want an arbitrary number of fields, where a field is composed of text without containing a space, or it is separated by spaces and begins with a quote and ends with a quote (possibly with spaces inbetween).
In other words, you want to do what a command line shell does. You really should just reuse something. Failing that, you should capture a field at a time, with a regex something like:
^ *([^ ]+|'[^']*')(.*)
Where you append group one to your list, and continue the loop with the contents of group 2.
A single pass through a regex wouldn't be able to capture an arbitrarily large number of fields. You might be able to split on a regex (python will do this, not sure about perl), but since you are matching the stuff outside the spaces, I'm not sure that is even an option.