Taiwanese language and country codes - country-codes

I'm a bit uncertain between the two variations below:
zh-cht and zh-tw - it's for a site in traditional Chinese, mostly in Taiwan, but presence in Maccao and Hong Kong.
So zh-cht and zh-tw seem to represent the same language.
Possibly their are vernacular differences?
But zh-cht - seems to be an umbrella for the various vernacular differences?
If I try to compare to Spanish, it's difficult as it seems Spanish has less recent geopolitical upheavals.
I.e. es-co - is Spanish in Colombia but no one has to worry about whether we are speaking of "Grand Colombia - which would include Ecuador and Venezuela" that geopolitical issue is so far behind us, you know, they are now different countries officially and have been for a long time, so their's no issue so we all know es-co - refers to the country of Colombia and the fairly individual dialect they speak. No? Their is (googling this more) ES-419 which covers a range of Spanish's which is used to describe spanish of Latin America and the Carribean.
So how does this apply to zh-tw and zh-cht?
Is zh-cht the ES-419 of traditional Chinese?

In case it's useful:
zh-Hant is the correct code.
https://www.w3.org/International/articles/language-tags/
(Thank you andrewJames)

Related

NER model to recognize Indian names

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.
Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names:
http://nlp.stanford.edu/software/crf-faq.shtml#a
This website has done this for us!It provides with the solution for these problems:
Challenges in Indian Language NER
Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
The challenges in NER arise due to several factors. Some of the main factors are listed below
Morphologically rich - identification of root is difficult, require use of morphological analysers
No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".
The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you:
https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

Match browsers set to Scandinavian languages based on "Accept-Language"

Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.
Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.
That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.

Gettext/Django for german translations: formal/informal salutations

I maintain a pluggable Django app that contains translations. All strings in Python and HTML code are written in English. When translating the strings to German, I'm always fighting with the problem that German differentiates between formal and informal speech (see T–V distinction). Because the app is used on different sites, ranging from a social network to a banking website, I can't just support either the formal or informal version. And since the translations can differ quite a bit, there's no way I can parameterize it. E.g. the sentence "Do you want to log out?" would have these two translations:
Wollen Sie sich abmelden? (formal)
Willst du dich abmelden? (informal)
Is there anything in Gettext that could help me with this?
You can use contextual markers to give your translations additional context.
logout = pgettext('casual', 'Do you want to log out?')
...
logout = pgettext('formal', 'Do you want to log out?')
The best approach, used in other similar situations by gettext as well as UNIX is to use locale variants. For example, sr_RS is (or was, because Serbian is considered a metalanguage these days...) code used for Serbian written in Cyrillic. But it’s sometimes written in Latin script too, so sr_RS#latin is used as the language name and of course, the MO filename or directory as well.
Here, have a look at some translations I have present on my system:
$ find /usr/local/share/locale | grep /sr
/usr/local/share/locale/sr
/usr/local/share/locale/sr/LC_MESSAGES
/usr/local/share/locale/sr/LC_MESSAGES/bash.mo
/usr/local/share/locale/sr/LC_MESSAGES/bfd.mo
/usr/local/share/locale/sr/LC_MESSAGES/binutils.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-runtime.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-tools.mo
/usr/local/share/locale/sr/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr/LC_MESSAGES/wget.mo
/usr/local/share/locale/sr#ije
/usr/local/share/locale/sr#ije/LC_MESSAGES
/usr/local/share/locale/sr#ije/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr#latin
/usr/local/share/locale/sr#latin/LC_MESSAGES
/usr/local/share/locale/sr#latin/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr_RS
/usr/local/share/locale/sr_RS/LC_MESSAGES
/usr/local/share/locale/sr_RS/LC_MESSAGES/mkvtoolnix.mo
/usr/local/share/locale/sr_RS#latin
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES/mkvtoolnix.mo
$
So the best way to handle German variants is the same: use de (or de_DE) for the base informal variant and have a separate translation file de_DE#formal with the formal variant of the translation.
This is basically what WordPress does too. Of course, being WordPress, they have their own special flavour and don’t use the variant syntax, but instead add a third component to the filename: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant and de_DE_formal.mo contains the formal variant.

Country and language code detecting

I need to detect user's language and country code in Qt. That codes must be matching standards at http://standards.freedesktop.org/desktop-entry-spec/latest/ar01s04.html.
I've tried QLocale, but it returned full country and language name in countryToString and languageToString. (I need short code, like "en" instead of "English".)
One of the ways is creating map of QLocale::Language and QString. But is there any faster and simpler way?
See QLocale::name()
Returns the language and country of this locale as a string of the
form "language_country", where language is a lowercase, two-letter ISO 639
language code, and country is an uppercase, two- or three-letter
ISO 3166 country code.
In addition to Paul's answer, there are QLocale::uiLanguages() and QLocale::bcp47Name() which should give variations.
When we talk about correct detection of country actually set in user preferences (Control Panel/Location on Windows, Preferences/Region on OS X), you should be using https://github.com/crystalidea/qt-detect-user-country

Using flag to identify spoken language

In the webapp I am doing, I need to identify language people are speaking.
I wanted to use flag to do that. But I have some problems.
For example, if you speak French, you can put the French flag. But if you speak English you can put either the US or UK flag or a mix of both.
Which flag to choose for Arabic language ? Saudi Arabian flag ? Algeria ? Morocco ?
I think it's usual to use fragments of the language as a kind of graphic (text, instead of flags), for example:
english
français
русский язык
العربية
中文
The answer is to not use flags to identify languages. Not only there isn't a one-to-one mapping, and you won't cover all languages that way (Kurdish?), but some flags may be controversial (consider Taiwan flag for Traditional Chinese).
As many other answers stated, it's clearly a bad idea to use flags for languages.
See arguments here: Flag as a symbol of language - stupidity or insult?
Language and nationality are different terms, if your English translation is American English, you should use American flag, for British English use England flag and so on. There are lots of dialects in Arabic so which flag you should use depends on which language/dialect you use.
You know that the browser sends a list of locales that the user likes? And you can choose from them inside your webserver to select the one the person likes the most?
You can see here how the Debian project has solved this issue: http://www.debian.org/intro/cn