Amazon Polly, how to read foreign languages with non ascii - amazon-web-services

So I am creating an app that lets you convert text to a desired language, and submit it to aws polly. The issue is that when you start getting weird characters in other languages, such as the é sign, or Japanese character, polly.synthesize_speech doesn't like them, and fails the call. How do you submit languages to be read that use non ascii characters?

Related

Trim unprintable characters using python

I have text content which is coming from different languages like chines, Hebrew and so on. By using google translator API converting the text into 'en'. Here problem is google translator is failing when its identifies some special characters like \x11,\x01(unable to display those characters over here) and dropping that set of records. Please suggest some safest way to do this conversion with out dropping records.
data = ''.join(c for c in data if c.printable)

USQL ACUTE ACCENT

I am new to U-SQL. I'm trying to do some basic queries and I have found a problem with how acute accents are handled.
When my data has acute accents, I get an error and I can´t continue. I'm Spanish so most of data I work with has acute accents.
Any idea? Do I need to follow some special coding protocol?
You are most likely running into an encoding issue.
Please check what encoding the file is in from which you are extracting (you can use notepad++ for example).
E.g., if the file is in some ANSI encoding, you will have to convert the file into UTF-8 before uploading it into the Data Lake.
The currently supported encodings are ASCII (which does not support accented characters), UTF-8 and Unicode (UTF-16) LE and BE. We are having support for ANSI code pages on our backlog. If you can provide the code page to the following uservoice item https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors and vote, that would help us with prioritization of the backlog feature.

Tokenizing japanese string and converting to hiragana

I am using string tokenizer and transform APIs to convert kanji characters to hiragana.
The code in query (What is the replacement for Language Analysis framework's Morpheme analysis deprecated APIs) converts most of kanji characters to hiragana but these APIs fails to convert kanji word having 3-4 characters.
like-
a) 現人神 is converted to latin - 'gen ren shen' and in hiragana- 'げんじんしん'
whereas it should be - in latin - 'Arahitogami ' and in hiragana- 'あらひとがみ'
b) 安本丹 is converted to latin - 'an ben dan' and in hiragana- 'やすもとまこと'
whereas it should be - in latin as - 'Yasumoto makoto ' and in hiragana- 'あんぽんたん'
My main purpose is to obtain the ruby text for given japanese text. I cant use lang analysis framework as its unavailable in 64-bit.
Any suggestions? Are there other APIs to perform such string conversion?
So in both cases your API uses onyomi but shouldn't. So I assume it just guesses "3 or more characters ? onyomi should be more appropriate in most cases, so I use it". Sounds like an actual dictionary is needed for your problem, which you can download.
Names ( for b) ) should still be a problem tho. I don't see how a computer should be able to get the correct name from kanjis, as even native japanese people sometimes fail at it. jisho.org doesn't even find a single name for 安本丹.
( Btw you mixed up your hiragana in b), and the latin for 'あんぽんたん'. I can't write comments yet with my rep so I'm leaving this here )

hyphen character and apostrophe character - the same ASCII code in different languages?

I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.
You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

Pulling bad chars from an xml feed

I need to figure out a way to remove any non standard ASCII char from a feed I am getting form a partners system.
The issue is they are sending a mix of data - HTML formatted and hard returns along with bad chars.
I am already using the UDF DeMoronize(text) to pull any Microsoft Latin-1 "Extentions" chars out.
The feed is coming over in utf-8 and we have a tag on the page to insure we are processing the feed in utf-8.
is there a simple way to code this to remove any NON UTF-8 char?