How to represent Unicode characters in an ASCII regex pattern? - regex

RegEx flavor: wxRegEx in C++.
One of the strings that I need to match contains characters like '…' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emacs and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).
In the regex pattern itself I tried representing '…' as both \205 and \\205 to no avail.
What is the right way of approaching this problem?
Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.
I tried that, but for some reason it doesn't work for me (yet).

It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:
$ java -encoding UTF-8 SomeThing.java
or
$ perl -Mutf8 somescript
Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use autodie;
my $s = "Où se trouve mon élève?";
if ($s =~ /élève/) { ... }
# although of course this also works fine:
while ($s =~ /\b(\w+)\b/g) {
print "Found <$1>\n";
}
That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.
If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:
String s = "e\u0301le\u0300ve";
although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:
String s = "e\\u0301le\\u0300ve";
That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.
Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!
In Perl you could use:
my $s = "e\x{301}le\x{300}ve"; # NFD form
my $s = "\xE9l\xE8ve"; # NFC form
but you can also name them symbolically
use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";
The last one can be made much shorter if you’d prefer:
use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";
All of those are just about infinitely superior to hardcoding magic numbers into your code.
This all assumes your language supports Unicode, but many do not.

Related

Malformed UTF-8 character when matching Non Breaking Space

I am using utf8 in my perl program and I have got the following code line:
$$pstring =~ s/\xA0/ /g;
which should clean out non breaking spaces from the string.
Under Ubuntu 16.04 and perl v5.22.1 this is not an issue but under Ubuntu 14.04 and v5.18.2 I get this Error:
Malformed UTF-8 character (fatal)
Then I inspected the string I was trying to match and found that there were non-breaking spaces in there, which could be deleted by the regex
$$pstring =~ s/[\xC2\xA0]/ /g;
but not with
$$pstring =~ s/\xC2\xA0/ /g;
My question is : What is the difference between the last two (Why does it only work with brackets) and is there another way of solving this?
My guess is that you are dealing with a raw, UTF-8 encoded string. You haven't shown how you got it or said why you'd want to do that. A small and complete demonstration program that shows how you get the input, how you change it, and what ultimately complains, would help people find the problem. If you add that small demonstration program to your question I might be able to give a better (or even different) answer.
The non-breaking space has code number U+00A0. Under UTF-8 it encodes to the two octets \xC2 and \xA0. Everything with a code number above U+007F has a multi-octet encoding under UTF-8. Everything under U+007F is really just ASCII, so ASCII works as UTF-8.
If you have the UTF-8 encoded text with the non-breaking space and remove just the \xA0 octet, there's a lonely \xC2 left over. Depending on what comes after it, that may be a problem. UTF-8 is designed to recognize where the problem is and correct itself though. It could pick up at the next legally encoded character and leave a substitution character to mark the error. Or, the program can complain and give up.
When you use the character class [\xC2\xA0], I'm guessing that it gets rid of either of those octets anywhere they appear. Since you don't report any other errors, I'm guessing that \xC2 doesn't appear anywhere else. Otherwise, other characters might change. Or, you're dealing with extended ASCII and removing the \xC2 leaves the right Latin-1 encoding. Does the number of substitutions reported by s/// equal the number (or double that) of non-breaking spaces?
If you have UTF-8 encoded text, read it as UTF-8:
open my $fh, '<:utf8', $filename or die ...
After you've read the data, don't worry about the encoding. Use the code numbers and Perl will figure it out. Or use the code names so future programmers know what you are doing without looking up the character:
my $string =~ s/\x{00A0}/ /g;
my $string =~ s/\N{NO-BREAK SPACE}/ /g;
When you are done, write it as UTF-8 text:
open my $fh, '>:utf8', $filename or die ...
The latest Learning Perl has a Unicode primer in the back that covers quite a bit of this.
Good luck!

word start with uppercase (unicode) in laravel validation [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

Perl: replace binary content

I need to replace a binary content with some other content (textual or binary) in Perl.
If the content is textual I can use s/// to replace content.
my $txt = "tomturbo";
$txt =~ s/t/T/g; # TomTurbo
But that is not working if the content is binary.
Is there an easy way to handle that??
Thanks!
Perl's regex engine will work with arbitrary strings.
s/t/T/g
is no different than
s/\x{74}/\x{54}/g
meaning it will change the value of characters from 7416 to 5416. The regex engine doesn't care whether value 7416 means t or something else.
In binary data, character A16 doesn't indicate the end of a line, so you'll want to use the s flag, avoid the m flag, and avoid using $.
Furthermore, you'll find the i flag and the built-in character class (e.g. \d, \p{Letter}, etc) useless unless you are matching against strings of decoded text (strings of Unicode Code Points), but you may otherwise use any feature of the regex engine.

Case insensitive regex for non-English characters

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).
I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.
What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?
It works perfectly fine
$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)
I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.
The /i modifier works nicely if the strings use Perl's internal encoding.
For example, this prints "yes":
perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'
The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.
If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).
It works for me. Do you need to use utf8;, maybe?
(Disclaimer: I don't know Perl.)
If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.
Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().
Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.