Malformed UTF-8 character when matching Non Breaking Space - regex

I am using utf8 in my perl program and I have got the following code line:
$$pstring =~ s/\xA0/ /g;
which should clean out non breaking spaces from the string.
Under Ubuntu 16.04 and perl v5.22.1 this is not an issue but under Ubuntu 14.04 and v5.18.2 I get this Error:
Malformed UTF-8 character (fatal)
Then I inspected the string I was trying to match and found that there were non-breaking spaces in there, which could be deleted by the regex
$$pstring =~ s/[\xC2\xA0]/ /g;
but not with
$$pstring =~ s/\xC2\xA0/ /g;
My question is : What is the difference between the last two (Why does it only work with brackets) and is there another way of solving this?

My guess is that you are dealing with a raw, UTF-8 encoded string. You haven't shown how you got it or said why you'd want to do that. A small and complete demonstration program that shows how you get the input, how you change it, and what ultimately complains, would help people find the problem. If you add that small demonstration program to your question I might be able to give a better (or even different) answer.
The non-breaking space has code number U+00A0. Under UTF-8 it encodes to the two octets \xC2 and \xA0. Everything with a code number above U+007F has a multi-octet encoding under UTF-8. Everything under U+007F is really just ASCII, so ASCII works as UTF-8.
If you have the UTF-8 encoded text with the non-breaking space and remove just the \xA0 octet, there's a lonely \xC2 left over. Depending on what comes after it, that may be a problem. UTF-8 is designed to recognize where the problem is and correct itself though. It could pick up at the next legally encoded character and leave a substitution character to mark the error. Or, the program can complain and give up.
When you use the character class [\xC2\xA0], I'm guessing that it gets rid of either of those octets anywhere they appear. Since you don't report any other errors, I'm guessing that \xC2 doesn't appear anywhere else. Otherwise, other characters might change. Or, you're dealing with extended ASCII and removing the \xC2 leaves the right Latin-1 encoding. Does the number of substitutions reported by s/// equal the number (or double that) of non-breaking spaces?
If you have UTF-8 encoded text, read it as UTF-8:
open my $fh, '<:utf8', $filename or die ...
After you've read the data, don't worry about the encoding. Use the code numbers and Perl will figure it out. Or use the code names so future programmers know what you are doing without looking up the character:
my $string =~ s/\x{00A0}/ /g;
my $string =~ s/\N{NO-BREAK SPACE}/ /g;
When you are done, write it as UTF-8 text:
open my $fh, '>:utf8', $filename or die ...
The latest Learning Perl has a Unicode primer in the back that covers quite a bit of this.
Good luck!

Related

Case insensitive regex for non-English characters

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).
I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.
What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?
It works perfectly fine
$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)
I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.
The /i modifier works nicely if the strings use Perl's internal encoding.
For example, this prints "yes":
perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'
The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.
If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).
It works for me. Do you need to use utf8;, maybe?
(Disclaimer: I don't know Perl.)
If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.
Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().
Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";

search document for non-ascii

An application on my computer needs to read in a text file. I have several, and one doesn't work; the program fails to read it and tells me that there is a bad character in it somewhere. My first guess is that there's a non-ascii character in there somewhere, but I have no idea how to find it. Perl or any generic regex would be nice. Any ideas?
You can use [^\x20-\x7E] to match a non-ASCII character.
e.g. grep -P '[^\x20-\x7E]' suspicious_file
perl -wne 'printf "byte %02X in line $.\n", ord $& while s/[^\t\n\x20-\x7E]//;'
will find every character that is not an ASCII glyphic character, tab, space, or newline.
If it reports 0Ds (carriage-returns) in files that are O.K., then change \t\n to \t\n\r.
If it only reports 0Ds in files that are bad, then you can probably fix those files by running dos2unix on them.
If you use tabulators in your source code as well, try this pattern:
[^\x08-\x7E]
Works also in Notepad++

word boundaries of UTF-8 text in perl

My perl script is provided with a string of characters in UTF-8 which could be in any language. I need to capitalize the first character of each word, and the remaining characters of the word converted to lower case. This must be done while leaving the text in UTF-8 format.
The following seems to work well enough when the text only contains latin characters
$my_string =~ s/([\w']+)/\u\L$1/g;
How can I get this to work in a UTF-8 string?
See perlunicode for an overview of the facilities you need to be familiar with. Basically, you are looking for something like \p{LC}.
Your problem space is not well-defined, though; not all scripts have a concept of character case. The LC property will only match on scripts which do, so it should get you there.

How to represent Unicode characters in an ASCII regex pattern?

RegEx flavor: wxRegEx in C++.
One of the strings that I need to match contains characters like '…' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emacs and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).
In the regex pattern itself I tried representing '…' as both \205 and \\205 to no avail.
What is the right way of approaching this problem?
Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.
I tried that, but for some reason it doesn't work for me (yet).
It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:
$ java -encoding UTF-8 SomeThing.java
or
$ perl -Mutf8 somescript
Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use autodie;
my $s = "Où se trouve mon élève?";
if ($s =~ /élève/) { ... }
# although of course this also works fine:
while ($s =~ /\b(\w+)\b/g) {
print "Found <$1>\n";
}
That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.
If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:
String s = "e\u0301le\u0300ve";
although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:
String s = "e\\u0301le\\u0300ve";
That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.
Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!
In Perl you could use:
my $s = "e\x{301}le\x{300}ve"; # NFD form
my $s = "\xE9l\xE8ve"; # NFC form
but you can also name them symbolically
use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";
The last one can be made much shorter if you’d prefer:
use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";
All of those are just about infinitely superior to hardcoding magic numbers into your code.
This all assumes your language supports Unicode, but many do not.

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.