C++ Parsing Library with UTF-8 support

C++ Parsing Library with UTF-8 support - c++

Let's say I want to make a parser for a programming language (EBNF already known), and want it done with as little of a fuss as possible. Also, I want to support identifiers of any UTF-8 letters. And I want it in C++.
flex/bison have a non-existent UTF-8 support, as I read it. ANTLR seems not to have a working C++ output.
I've considered boost::spirit, they state on their site it's actually not meant for a full parser.
What else is left? Rolling it entirely per hand?

If you don't find something which has the support you want, don't forget that flex is mostly independant on the encoding. It lexes an octet stream and I've used it to lex pure binary data. Something encoded in UTF-8 is an octet stream and can be handled by flex is you accept to do manually some of the work. I.E. instead of having
idletter [a-zA-Z]
if you want to accept as letter everything in the range Latin1 supplement excepted the NBSP (in other words, in the range U00A1-U00FF) you have to do something like (I may have messed up the encoding, but you get the idea)
idletter [a-zA-Z]|\xC2[\xA1-\xFF]|\xC3[\x80-\xBF]
You could even write a preprocessor which does most of the work for you (i.e. replaces \u00A1 by \xC2\xA1 and replace [\u00A1-\u00FF] by \xC2[\xA1-\xFF]|\xC3[\x80-\xBF], how much work is the preprocessor depend on how generic you want your input to be, there will be a time when you'd probably better integrate the work in flex and contribute it upstream)

Parser works with tokens, it's not its duty to know the encoding. It will usually just compare the ids of the tokens, and in case you code your special rules you may compare the underlining UTF-8 strings the way you do it anywhere else.
So you need a UTF-8 lexer? Well, it highly depends on how you define your problem. If you define your identifiers to consist of ASCII alphanumerics and anything else non-ASCII, then flex will suit your needs just fine. If you want to actually feed Unicode ranges to the lexer, you'll need something more complicated. You can look at Quex. I'd never used it myself, but it claims to support Unicode. (Although I would kill somebody for "free tell/seek based on character indices")
EDIT: Here is a similar question, it claims that flex won't work because of bug that ignores that some implementations may have a signed char... It may be outdated though.

Related

Get all available characters?

Sometimes you might have string\text that is encoded, decrypted, Base64 converted and so on. Most of the time it works but from time to time you might have missed that some transformations might lose a bit of data. This can be hard to find because the truncation happens only with some characters. To make sure that everything works as intended a unitTest that "packs" and "unpacks" the string\text is a good validation but to be really sure you want a teststring that contain all possible characters no matter the culture.
Is there any way in c# to easily get all available characters in to a string from all cultures?
Regards

What to replace std::stringstream and boost::format with for std::u16string?

std::iostream classes lack specialization for char16_t and char32_t and boost::format depends on streams. What to replace streams with for utf16 strings (preferably with localization support)?

The fundamental entities streams work on are characters, not encoded characters. For Unicode it was decided that one character can be split across multiple entities making it inherently incompatible with the stream abstraction.
The addition of new character types intended to deal a standard way to deal with Unicode characters but it was deemed too complex to also redo the behavior of IOStreams and locales to keep with the added complexities. This is partly due to people not quite loving stream and partly due to being a large and non-trivial task. I would think that the required facets can be defined to be capable to deal with simple situations but I'm not sure if this would result in a fast solution and if it would cover languages where Unicode is needed: I can see how it can be made to work for European text but I don't know whether thing would really work for Asian text.

This is good. The encodings argument is over and pretty much settled. You do not want utf16 strings anywhere in your program except when communicating with legacy APIs, which is when you convert the whole formatted string, best done by boost::narrow and widen. Unless, of course, you are doing some rare edge-case optimizations.
See http://utf8everywhere.org.

The current stream are usually implemented as templates (I don't have a copy of the standard here, but I'm pretty sure that they have to be implemented as templates) so making them wide-string aware should be a simple matter of instantiating the templates with the appropriate character type.
Most likely your implementation will already have predefined specialisations for wide strings. Have a look for something like std::wstringstream.
That said, the various character types in C++ don't make any assumptions about the encoding of the strings you put in there so you'd to handle this as a "per convention" way - as in, your wide strings are encoded as utf16 by convention, but there is nothing in the library that enforces this convention.

What are the key decisions to get right when designing a fully Unicode-aware language or library?

Looking at Tom Christiansen's talk
🔫 Unicode Support Shootout
👍 The Good, the Bad, & the (mostly) Ugly 👎
working with text seems to be so incredibly hard, that there is no programming language (except Perl 6) which gets it even remotely correct.
What are the key design decisions to make to have a chance to implement Unicode support correctly on a clean table (i. e. no backward-compatibility requirements).
What about default file encodings, which transfer format and normalization format to use internally and for strings? What about case-mapping and case-folding? What about locale- and RTL-support? What about Regex engines as defined by UTS#18? How should common APIs look like?

EDIT: I'll add more as I think of them.
You need no existing code that you have to support. A legacy of code that requires that everything be in 8- or 16-bit unit code units is a royal pain. It makes even libraries awkward when you have to support pre-existing models that don't consider this.
You have to work with blind people only so fonts are no issue. :)
You have to follow the Unicode rules for identifier characters, and pattern syntax characters. You should normalize your identifiers internally. If your language is itself LTR, you may not wish to allow RTL idents; unclear here.
You need to provide primitives in your language that map to Unicode concepts, like instead of just uppercase and lowercase, you need uppercase, titlecase, lowercase, and foldcase (or lc, uc, tc, and fc).
You need to give full access to the Unicode Character Database, including all character properties, so that the various tech reports' algorithms can be easily built up using them.
You need a clear logical model that is easily extensible to graphemes as needed. Just as people have come to realize a code point interface is vastly more important than a code unit one, you have to be able to deal with graphemes, etc. For example, nobody in their right mind should be forced to rewrite:
printf "%-10.10s", $string;
as this every time:
# this library treats strings as sequences of
# extended grapheme clusters for indexing purposes etc.
use Unicode::GCString;
my $gcstring = Unicode::GCString->new($string);
my $colwidth = $gcstring->columns();
if ($colwidth > 10) {
print $gcstring->substr(0,10);
} else {
print " " x (10 - $colwidth);
print $gcstring;
}
You have to do it that way, BTW, because you have to have a notion of print columns, which can be 0 for combining and control characters, or 2 for characters with certain East Asian Width properties. Etc. It would be much better if there was no existing printf code so you could start from scratch and do it right. I have no idea what to do about RTL scripts' widths.
The operating system is a pre-existing code-unit library.
You need not to interact with the filesystem name space, as you have no control over whether filesystem A runs things through NFD (Linux, I believe), filesystem B runs things through NFC (HSF+, nearly), or filesystem C (traditional Unix) doesn't no any at all. Alternately, it is possible that you might be able to provide an abstraction layer here with local filters to hide some of that from the user if possible. Operating systems always have code-unit limits, not code-point ones, which is going to annoy you.
Other things with code-unit stipulations include databases that allocate fixed-size records. Fixed size just doesn't work: it's grapheme-hostile, and normalization form hostile.

Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.
Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.
Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!

Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.
Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:
int langs = dict["the"];
if (langs | mylang == mylang)
// no other language
Since there will be other languages, probably a more general approach is better.
For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).
Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.
To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.

I have found Google's CLD very helpful, it's written in C++, and from their web site:
"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."

Well,
Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.
In Java, I'd send you to Apache Tika. It has an Open-source statistical language detector.
For C++, you could use JNI to call it. Now, time for a disclaimer warning. Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.
http://www.basistech.com, the product name is RLI.

This will not work well one word at a time, as many words are shared. For instance, in several languages "the" means "tea."
Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.
However, the problem might not be too hard to solve yourself. See the Wikipedia article on the problem for ideas. Also a small support vector machine might do the trick quite handily. Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.

I wouldn't hold my breath. It is difficult enough to determine the language of a text automatically. If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.

Basically you need a huge database of all the major languages. To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text. This is not something you would want to have to implement on your laptop.

Spell check first 3 words of your text in all languages (the more words to spell check, the better). The spelling with least number of spelling errors "wins". With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable. It is not a perfect method, but I figure it would work in most cases.
Otherwise if there is equal number of errors in all languages, use the default language. Or randomly pick another 3 words until you have more clear result. Or expand the number of spell checked words to more than 3, until you get a more clear result as well.
As for the spell checking libraries, there are many, I personally prefer Hunspell. Nuspell is probably also good. It is a matter of personal opinion and/or technical capabilities which one to use.

I assume that you are working with text not with speech.
If you are working with UNICODE than it has provided slot for each languages.
So you can identify that all characters of particular word is fall in this language slot.
For more help about unicode language slot you can fine over here

Regular Expressions in Ada?

I'm very new to Ada, and I'm trying to do some simple work with some text. All I want to do is read in a file, and strip out anything that isn't a letter, space, or new line. so removing all the punctuation and numbers. In other languages I would just create a simple [^a-zA-Z] regular expression, look at each character and delete it if it fit the RegEx, but I can't seem to find any documentation on RegEx's in Ada. So, are there RegEx's in Ada? If not, what's the best way for me to go about simple text editing like this.
thanks much,
-jb

if you are using the GNAT compiler, there are a set of packages called GNAT.RegExp, GNAT.RegPat and GNAT.Spitbol made for this task.
beware that it is not standard regexp ala perl but is based on SNOBOL4. however, it should not be very difficult to convert from one type of regular expression to another.

You may want to go through this example, and just look for the characters you want to ignore and don't put them into the new string.
Which version of Ada are you using?
http://www.adaic.com/docs/95style/html/sec_8/8-4-7.html

I'd probably look at the Gnat snobol stuff in your shoes.
However, there is a project available for general lexical analysis (somewhat like Boot's Spirit) called OpenToken. For slighly more complex tasks, you may find it useful.
I haven't worked with the modern incarnation, but back when I was the lead on it the project was compiler-agnostic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js