How can I change extended latin characters to their unaccented ASCII equivalents? - regex

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...
é becomes e
ê becomes e
á becomes a
ç becomes c
Ď becomes D
and so on, but things like ‡ or Ω or ‰ just get striped away.

Use Unicode::Normalize to get the NFD($str). In this form all the characters with diacritics will be turned into a base character followed by a combining diacritic character. Then simply remove all the non-ASCII characters.

Maybe a CPAN module might be of help?
Text::Unidecode looks promising, though it does not strip ‡ or Ω or ‰. Rather these are replaced by ++, O and %o. This might or might not be what you want.
Text::Unaccent, is another candidate but only for the part of getting rid of the accents.

Text::Unaccent or alternatively Text::Unaccent::PurePerl sounds like what you're asking for, at least the first half of it.
$unaccented = unac_string($charset, $string);
Removing all non-ASCII characters would be a relatively simple.
s/[^\000-\177]+//g;

All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.
In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).
I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters
For interested parties the final code is: (the tr is a single line)
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.

When I would like translate some string, not only chars, I'm using this approach:
my %trans = (
'é' => 'e',
'ê' => 'e',
'á' => 'a',
'ç' => 'c',
'Ď' => 'D',
map +($_=>''), qw(‡ Ω ‰)
};
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
s/($re)/$trans{$1}/ge;
If you want some more complicated you can use functions instead string constants. With this approach you can do anything what you want. But for your case tr should be more effective:
tr/éêáçĎ/eeacD/;
tr/‡Ω‰//d;

Related

Wide character in print when involve using special characters

I want to split a long sentence with the dot . character as long as the dot is not wrapped in any kinds of brackets, like (), (), 【】, 〔〕, etc. and there should be at least three words on its left. I use the following code. But it gives the Wide character in print error.
my $a = "hi hello world. "
$a .= "【 hi hello world. 】";
my #list = split /(?<!\.)(?<=(?:[\w'’]{1,30} ){2}[\w'’]{1,30})\. (?![^()〈〉【】()\[\]〔〕\{\}]*[\))\]〉】〕\}])/, $a;
The expected result would be $a splits into:
hi hello world
and
【 hi hello world. 】
I'm using perl v5.31.3 on macOS Big Sur.
p.s. In the project, I'm also using XML::LibXML::Reader. I'm not sure whether adding use utf8::all; is allowed.
Decode your inputs, encode your outputs
The warning is the result of your attempt to write something other than bytes[1] to a file handle.
You need to encode your outputs, either explicitly, or by adding an encoding layer to the file handle.
use open ':std', ':encoding(UTF-8)';
If your source code is encoded using UTF-8, you need to tell perl that by using use utf8;. Otherwise, it assumes the source code is encoded using ASCII.[2]
If you accept arguments, these are also inputs that need to be decoded. You can use the following:
use Encode qw( decode_utf8 );
#ARGV = map { decode_utf8($_) } #ARGV;
For this purpose, a byte is a value between 0 and 255 inclusive. And since we're talking about printing, we're talking about a character (which is to say string element) with such a value.
Although string and regex literals are 8-bit clean.

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:
my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);
I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".
I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.
I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.
Thanks!
My regex-fu is weak, so I'd go with a less magical solution.
First, you can remove all marks via samemark:
'חוּם'.samemark('a')
Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:
Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str
In case of mixed strings, stripping marks from Hebrew characters only could look like this:
$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
Here is a simple approach:
my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my #ords;
for $hum.ords {
#ords.push($_) if $min ≤ $_ ≤ $max;
}
say join('', #ords.map: { .chr });
Output:
חום

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')

regular expression matching all 8 character strings except "00000000"

I am trying to figure out a regular expression which matches any string with 8 symbols, which doesn't equal "00000000".
can any one help me?
thanks
In at least perl regexp using a negative lookahead assertion: ^(?!0{8}).{8}$, but personally i'd rather write it like so:
length $_ == 8 and $_ ne '00000000'
Also note that if you do use the regexp, depending on the language you might need a flag to make the dot match newlines as well, if you want that. In perl, that's the /s flag, for "single-line mode".
Unless you are being forced into it for some reason, this is not a regex problem. Just use len(s) == 8 && s != "00000000" or whatever your language uses to compare strings and lengths.
If you need a regex, ^(?!0{8})[A-Za-z0-9]{8}$ will match a string of exactly 8 characters. Changing the values inside the [] will allow you to set the accepted characters.
As mentioned in the other answers, regular expressions are not the right tool for this task. I suspect it is a homework, thus I'll only hint a solution, instead of stating it explicitly.
The regexp "any 8 symbols except 00000000" may be broken down as a sum of eight regexps in the form "8 symbols with non-zero symbol on the i-th position". Try to write down such an expression and then combine them into one using alternative ("|").
Unless you have unspecified requirements, you really don't need a regular expression for this:
if len(myString) == 8 and myString != "00000000":
...
(in the language of your choice, of course!)
If you need to extract all eight character strings not equal to "000000000" from a larger string, you could use
"(?=.{8})(?!0{8})."
to identify the first character of each sequence and extract eight characters starting with its index.
Of course, one would simply check
if stuff != '00000000'
...
but for the record, one could easily employ
heavyweight regex (in Perl) for that ;-)
...
use re 'eval';
my #strings = qw'00000000 00A00000 10000000 000000001 010000';
my $L = 8;
print map "$_ - ok\n",
grep /^(.{$L})$(??{$^Nne'0'x$L?'':'^$'})/,
#strings;
...
prints
00A00000 - ok
10000000 - ok
go figure ;-)
Regards
rbo
Wouldn't ([1-9]**|\D*){8} do it? Or am I missing something here (which is actually just the inverse of ndim's, which seems like it oughta work).
I am assuming the characters was chosen to include more than digits.
Ok so that was wrong, so Professor Bolo did I get a passing grade? (I love reg expressions so I am really curious).
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]2}?|[^0]{1}?)", '00000000'):
print 'match'
...
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10000000'):
... print 'match'
match
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10011100'):
... print 'match'
match
>>>
That work?

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.
Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.
Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".
I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers
What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.
My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}
What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.