Undocumented outputs with aspell spell checking? - aspell

I am writing a spell checker for a load of language files in a large projekt using java and aspell.
'Problem is that the application is built in Swedish with an optional English client, so I must use language checks for both languages. This works fine with aspell-en and aspell-sv, but one problem appears when checking language files in Swedish.
According to the aspell documentation page: http://aspell.net/0.50-doc/man-html/6_Writing.html
down at the "Format of the Data Stream" headline, the possible outputs from a word check is either '*', '&' or '#'. For some reason I seem to receive the output '-' from some words in the Swedish aspell checker.
Does anybody have any experience with using aspell in languages other than English, do you know what this output means? Worth mentioning also is that the '-' does not come with any explanation or "failure message", so I don't know what it is supposed to symbolize.

Related

hyphen character and apostrophe character - the same ASCII code in different languages?

I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.
You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

How to get same result in C or C++ same as toLowerCase in Java or string.lower() in Python?

I need a C or C++ function(or library) which works like String.toLowerCase() in Java.
I know that I can use tolower() for English, but what I need is a function (or library) can cover global language. (Actually, It needs to cover 9 languages listed below.)
language list
Dutch
English
French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Add, These characters in the first line below are Input and the second line is expected result
LINE 1:
AÁÀÂÄĂĀÃÅÆBCĆČÇDEÉÈÊËĚĘFGHℏIÍÌÎÏJKLŁMNŃŇÑOÓÒÔÖÕØŒPQRŘSŚŠŞTŢUÚÙÛÜŪVWXYÝZŹŽΑΔΕΘΛΜΝΠΡΣΣΦΩАБВГҐДЕЁЄЖЗИЙІЇКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
LINE 2:
aáàâäăāãåæbcćčçdeéèêëěęfghℏiíìîïjklłmnńňñoóòôöõøœpqrřsśšştţuúùûüūvwxyýzźžαδεθλμνπρσςφωабвгґдеёєжзийіїклмнопрстуфхцчшщъыьэюя
I verified results from Java toLowerCase() and Python string.lower()
and both are correct.
Is there any way to translate to lowercase letter in C or C++?
And Important thing is that the letters are read from a file encoded 'UTF-8'!
Please help me. My English is not very good, so please use simple English as much as you can.
I think you will find what you need in the Boost libraries - see http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html
Quoting from their website:
Boost.Locale gives powerful tools for development of cross platform
localized software - the software that talks to user in its language.
Provided Features:
Correct case conversion, case folding and normalization.
Collation (sorting), including support for 4 Unicode collation levels.
....
You get the idea, I hope. The function you need is
Boost::Locale::to_lower(yourUTF8String)

Is there a set of "Lorem ipsums" files for testing character encoding issues?

For layouting we have our famous "Lorem ipsum" text to test how it looks like.
What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.
Example:
Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 8016 – 9F16. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.
Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)
Is there such a set of test-files for character-encoding issues out there?
The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:
Zażółć gęślą jaźń
which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:
in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).
in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).
List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:
public interface NationalCharacters {
String spanish();
String russian();
//...
}
library?
How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files
I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server
Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT
There are a few ready-to-use comprehensive unicode setups straight-forward downloadable.
From w3c
Here, there's a nice test file by w3.org including: maths, linguistics, Greek, Georgian, Russian, Thai, Runes, Braille among many others in a single file:
https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
Coming from w3.org should be nice to use, shouldn't it?
Cutting out the HTML part
If you want to get the "original txt file" without risk of your editor messing it up, 1) download, 2) tail+head it, 3) Check with a diff:
wget https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
tail +8 UTF-8-demo.html | head -n -3 > UTF-8-demo.txt
diff UTF-8-demo.html UTF-8-demo.txt
This generates a UTF-8-demo.txt without human intervention and without risk of loosing data.
More from w3c
There are many more files one level up in the directory structure, still inside the dir utf-8-test:
https://www.w3.org/2001/06/utf-8-test/
From github
There's a very interesting file here too with ALL printable chars (including Chinese, Braille, Arab, etc.)
https://raw.githubusercontent.com/bits/UTF-8-Unicode-Test-Documents/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_printable.txt
Want also non printable characters?
There are also many more test files in the same repo:
https://github.com/bits/UTF-8-Unicode-Test-Documents
and also a generator if you don't trust the committed file and you want to generate it by yourself.
My personal choice
I have decided that for my projects I'll start with 2 files: The specific one I pointed out from w3c and the specific one I pointed out from the github repo by bits.
Well, I had used an online tool to create my text char sets from Lorem Ipsum. I believe it can help you. I dont have one which has all the different charsets in a single page.
http://generator.lorem-ipsum.info/

How do I work with a C++ program containing non-Latin characters?

I have a C++ program that was written by a Russian-speaking developer and so it contains Cyrillic characters. When I open the sources they are displayed as garbage. How do I solve this in windows ?
The actual problem is your IDE/editor doesn't display Cyrillic characters correctly. You solve this by changing the IDE/editor settings to use a font that contains Cyrillic characters - for example, Courier New if you're on Windows.
Well, assuming they've actually used ISO C and not some weird Russian variant, the language constructs and standard library calls will be in English (or its strange cousin, American).
The only thing you'll really need to convert are the strings (such as for user output or logging), code comments and variable names.
And even the comments and variable names may not have to change. They may make the code harder to understand to a non-Russian reader however.
If the code contains characters that your current editor doesn't understand, well, you need to get yourself an editor that does. Or get your Russian friends to turn it into English for you.
Don't think that there is another C++ programming language in russia. So you just need to replace the strings to the other language, i.e. English. Care must be taken when processing input since here you can find handling of single characters.
A better approach would be to prepare a localization. You can read all strings from a ressource or file. In that case you can select the resource that matches you target language.
If you mean that the strings of the program are written in Russian and you want to add English texts, you need to first internationalize (i18n) your program, using instead of static strings a library like Gettext; then you need to add support for the English locale.
If you mean that the variables and the comments are in Russian and you want them in English, well.. find a translator ;)
Find a translator and give him the code.

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.