Country and language code detecting - c++

I need to detect user's language and country code in Qt. That codes must be matching standards at http://standards.freedesktop.org/desktop-entry-spec/latest/ar01s04.html.
I've tried QLocale, but it returned full country and language name in countryToString and languageToString. (I need short code, like "en" instead of "English".)
One of the ways is creating map of QLocale::Language and QString. But is there any faster and simpler way?

See QLocale::name()
Returns the language and country of this locale as a string of the
form "language_country", where language is a lowercase, two-letter ISO 639
language code, and country is an uppercase, two- or three-letter
ISO 3166 country code.

In addition to Paul's answer, there are QLocale::uiLanguages() and QLocale::bcp47Name() which should give variations.

When we talk about correct detection of country actually set in user preferences (Control Panel/Location on Windows, Preferences/Region on OS X), you should be using https://github.com/crystalidea/qt-detect-user-country

Related

Taiwanese language and country codes

I'm a bit uncertain between the two variations below:
zh-cht and zh-tw - it's for a site in traditional Chinese, mostly in Taiwan, but presence in Maccao and Hong Kong.
So zh-cht and zh-tw seem to represent the same language.
Possibly their are vernacular differences?
But zh-cht - seems to be an umbrella for the various vernacular differences?
If I try to compare to Spanish, it's difficult as it seems Spanish has less recent geopolitical upheavals.
I.e. es-co - is Spanish in Colombia but no one has to worry about whether we are speaking of "Grand Colombia - which would include Ecuador and Venezuela" that geopolitical issue is so far behind us, you know, they are now different countries officially and have been for a long time, so their's no issue so we all know es-co - refers to the country of Colombia and the fairly individual dialect they speak. No? Their is (googling this more) ES-419 which covers a range of Spanish's which is used to describe spanish of Latin America and the Carribean.
So how does this apply to zh-tw and zh-cht?
Is zh-cht the ES-419 of traditional Chinese?
In case it's useful:
zh-Hant is the correct code.
https://www.w3.org/International/articles/language-tags/
(Thank you andrewJames)

Identify keywords and commands in natural text

I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:
My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.
Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").
My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.
I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?
What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:
https://en.wikipedia.org/wiki/Information_extraction
To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.
Good luck!
If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.

Error while filtering english language tweets only

I am extracting tweets written only in English Language and I used the following filter
stream.filter(stall_warnings=True, track=['#brain'], languages=['en'])
But unfortunately this filter returns a tweet which is combination of English and some other language
Please see the tweet here
How can I extract a tweet which is written only in English Language?
Note: I am sorry if it is wrong for linking some other's tweet.
The tweets are classified by Twitter on one language or another. Their classification isn't always correct. If the tweet uses multiple languages they just assign it to one of them.
So you will need to filter them in your app against a dictionary or using some language detection libraries to be 100% sure that only English Language is used on the tweets you receive.
Source: https://blog.twitter.com/2013/introducing-new-metadata-for-tweets

Microsoft sublanguage string to locale identifier

I can't seem to find a way to convert, or find, a local identifier from a sublanguage string. This site shows the mappings:
http://msdn.microsoft.com/en-us/library/dd318693(v=VS.85).aspx
I want the user to enter a sublanguage string, such as "France (FR)" and to get the local identifier from this, which in this case would be 0x0484. Or the other way around, if a user enters 0x0480 then to return French (FR).
Has anyone encountered this problem before and can point me in the right direction?
Otherwise I'm going to be writing a few mapping statements to hard code it and maintain future releases if anything changes.
BTW, I'm coding in C++ for Windows platform.
Cheers
A good starting point would be the LCIDToLocaleName function and it's opposite - LocaleNameToLCID. Note that these allow converting between LCID and RFC4646 locale name; to get the humanreadable country and language names, use the GetLocaleInfoEx with the LOCALE_SENGLISH* flags. If you need localized names instead of English, use LOCALE_SLOCALIZED* constants instead.

Using flag to identify spoken language

In the webapp I am doing, I need to identify language people are speaking.
I wanted to use flag to do that. But I have some problems.
For example, if you speak French, you can put the French flag. But if you speak English you can put either the US or UK flag or a mix of both.
Which flag to choose for Arabic language ? Saudi Arabian flag ? Algeria ? Morocco ?
I think it's usual to use fragments of the language as a kind of graphic (text, instead of flags), for example:
english
français
русский язык
العربية
中文
The answer is to not use flags to identify languages. Not only there isn't a one-to-one mapping, and you won't cover all languages that way (Kurdish?), but some flags may be controversial (consider Taiwan flag for Traditional Chinese).
As many other answers stated, it's clearly a bad idea to use flags for languages.
See arguments here: Flag as a symbol of language - stupidity or insult?
Language and nationality are different terms, if your English translation is American English, you should use American flag, for British English use England flag and so on. There are lots of dialects in Arabic so which flag you should use depends on which language/dialect you use.
You know that the browser sends a list of locales that the user likes? And you can choose from them inside your webserver to select the one the person likes the most?
You can see here how the Debian project has solved this issue: http://www.debian.org/intro/cn