(grep) Regex to match non-ASCII characters? - regex

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?

This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**

No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!

You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.

You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.

[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.

To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+

I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.

You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)

This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.

Related

Why does \d not match digits when using awk?

I'm finding a behaviour I can't really explain with awk. Maybe it's a silly mistake but I can't figure it out.
I have a file called files with some random filenames.
$ cat -e files
3beds.txt$
file4.txt$
file3.txt$
dedo$
file5.txt$
texto5.txt$
metoo.txt$
34lions$
texto2.txt$
file1.txt$
7hello$
summer$
missing$
hello.mundo$
helloWorld.txt$
texto3$
awkvars$
texto4$
yes$
file2.txt$
I want to print only the filenames containing digits. I used the command:
awk '/\d/{print $0}' files
But my result was:
$ awk '/\d/{print $0}' files
3beds.txt
dedo
hello.mundo
helloWorld.txt
I'd really appreciate it if someone can explain to me why those lines are being printed. Thank you!
Clue: the four lines that matched were the four lines that contained "d".
So, clearly \d was being interpreted as a literal "d".
Why? Because awk's regex syntax is POSIX Extended Regular Expressions, not the Perl, PCRE or Ecma you might be used to. So \d does not stand for "digit" as you were expecting. You ended up using a backslash escape to force a literal "d".
The equivalent for \d in awk depends on the semantics you want[1]. [0-9] will match only the ten ASCII digits. You could also use the POSIX character class for digit inside a POSIX Bracket Expression, [[:digit:]]:
When used on strings with non-ASCII characters, the [:digit:] class may include digits in other scripts, depending on the locale.
My quotations are from regular-expressions.info, which has a wealth of info on the many syntaxes. This page took info from that page and turned it into a convenient table that compares 15 of them in great detail.
[1]: Even for regex engines that support the shorthand \d, the semantics can differ:
Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9]. In most flavors that support Unicode, \d includes all digits from all scripts. Notable exceptions are Java, JavaScript, and PCRE. These Unicode flavors match only ASCII digits with \d.
with awk, if you want to print only the lines containing digits, you can use this regexp instead:
awk '/[[:digit:]]/' file
3beds.txt$
file4.txt$
file3.txt$
file5.txt$
texto5.txt$
34lions$
texto2.txt$
file1.txt$
7hello$
texto3$
texto4$
file2.txt$

Perl: replace binary content

I need to replace a binary content with some other content (textual or binary) in Perl.
If the content is textual I can use s/// to replace content.
my $txt = "tomturbo";
$txt =~ s/t/T/g; # TomTurbo
But that is not working if the content is binary.
Is there an easy way to handle that??
Thanks!
Perl's regex engine will work with arbitrary strings.
s/t/T/g
is no different than
s/\x{74}/\x{54}/g
meaning it will change the value of characters from 7416 to 5416. The regex engine doesn't care whether value 7416 means t or something else.
In binary data, character A16 doesn't indicate the end of a line, so you'll want to use the s flag, avoid the m flag, and avoid using $.
Furthermore, you'll find the i flag and the built-in character class (e.g. \d, \p{Letter}, etc) useless unless you are matching against strings of decoded text (strings of Unicode Code Points), but you may otherwise use any feature of the regex engine.

How to check for and replace non UTF-8 characters in tcl?

What's the best way to search if a given string contains non UTF-8 characters in tcl? Is regexp'ing "^[\x00-\x7f]+$" the only way forward?
I'm trying to write a tcl proc to check if a given variable contains non UTF-8 characters and if it does replace it with "Not supported"
All Tcl's characters are Unicode characters.
OK, that's not helpful. You actually appear to be asking about non-ASCII characters. Supposing you wanted to replace each non-ASCII character with a ?, you might use a regular expression substitution, like this:
regsub -all {[\u0080-\uffff]} $inputString "?" outputString
The key here is that the RE is in braces (virtually always strongly recommended) and that we're using \uXXXX escape sequences (which the RE engine also understands). That'll put many ?s in potentially, but I'm sure you can adjust.

Case insensitive regex for non-English characters

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).
I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.
What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?
It works perfectly fine
$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)
I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.
The /i modifier works nicely if the strings use Perl's internal encoding.
For example, this prints "yes":
perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'
The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.
If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).
It works for me. Do you need to use utf8;, maybe?
(Disclaimer: I don't know Perl.)
If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.
Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().
Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.