How to compare a "basic_string" using an arbitary locale

How to compare a "basic_string" using an arbitary locale - c++

I'm re-posting a question I submitted earlier today but I'm now citing a specific example in response to the feedback I received. The original question can be found here (note that it's not a homework assignment):
I'm simply trying to determine if C++ makes it impossible to perform an (efficient) case-INsensitive comparison of a basic_string object that also factors in any arbitrary locale object. For instance, it doesn't appear to be possible to write an efficient function such as the following:
bool AreStringsEqualIgnoreCase(const string &str1, const string &str2, const locale &loc);
Based on my current understanding (but can someone confirm this), this function has to call both ctype::toupper() and collate::compare() for the given locale (extracted as always using use_facet()). However, because collate::compare() in particular requires 4 pointer args, you either need to pass these 4 args for every char you need to compare (after first calling ctype::toupper()), or alternatively, convert both strings to upppercase first and then make a single call to collate::compare().
The 1st approach is obviously inefficient (4 pointers to pass for each char tested), and the 2nd requires you to convert both strings to uppercase in their entirety (requiring allocation of memory and needless copying/converting of both strings to uppercase). Am I correct about this, i.e., it's not possible to do it efficiently (because there's no way around collate::compare()).

One of the little annoyances about trying to deal in a consistent way with all the world's writing systems is that practically nothing you think you know about characters is actually correct. This makes it tricky to do things like "case-insensitive comparison". Indeed, it is tricky to do any form of locale-aware comparison, and case-insensitivity is additionally thorny.
With some constraints, though, it is possible to accomplish. The algorithm needed can be implemented "efficiently" using normal programming practices (and precomputation of some static data), but it cannot be implemented as efficiently as an incorrect algorithm. It is often possible to trade off correctness for speed, but the results are not pleasant. Incorrect but fast locale implementations may appeal to those whose locales are implemented correctly, but are clearly unsatisfactory for the part of the audience whose locales produce unexpected results.
Lexicographical ordering doesn't work for human beings
Most locales (other than the "C" locale) for languages which have case already handle letter case in the manner expected, which is to use case differences only after all other differences have been taken into account. That is, if a list of words are sorted in the locale's collation order, then words in the list which differ only in case are going to be consecutive. Whether the words with upper case come before or after words with lower case is locale-dependent, but there won't be other words in between.
That result cannot be achieved by any single-pass left-to-right character-by-character comparison ("lexicographical ordering"). And most locales have other collation quirks which also don't yield to naïve lexicographical ordering.
Standard C++ collation should be able to deal with all of these issues, if you have appropriate locale definitions. But it cannot be reduced to lexicographical comparison just using a comparison function over pairs of whar_t, and consequently the C++ standard library doesn't provide that interface.
The following is just a few examples of why locale-aware collation is complicated; a longer explanation, with a lot more examples, is found in Unicode Technical Standard 10.
Where do the accents go?
Most romance languages (and also English, when dealing with borrowed words) consider accents over vowels to be a secondary characteristic; that is, words are first sorted as though the accents weren't present, and then a second pass is made in which unaccented letters come before accented letters. A third pass is necessary to deal with case, which is ignored in the first two passes.
But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.
In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M', Mc and Mac were all clumped together since they are all phonetically identical.)
One symbol, two letters. Or two letters, one symbol
German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.
In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)
Summary: comparing character strings requires multiple passes, and sometimes requires comparing groups of characters
Making it all case-insensitive
Since locales handle case correctly in comparison, it should not really be necessary to do case-insensitive ordering. It might be useful to do case-insensitive equivalence-class checking ("equality" testing), although that raises the question of what other imprecise equivalence classes might be useful. Unicode normalization, accent deletion, and even transcription to latin are all reasonable in some contexts, and highly annoying in others. But it turns out that case conversions are not as simple as you might think, either.
Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph ǆ (U+01C6; a single character), whose uppercase is Ǆ (U+01C4) and whose title case is ǅ (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.
C++ locales aren't quite up to the job
So, getting back to C++, the sad truth is that C++ locales do not have sufficient information to do accurate case-folding, because C++ locales work on the assumption that case-folding a string consists of nothing more than sequentially and individually upper-casing each codepoint in the string using a function which maps a codepoint to another codepoint. As we'll see, that just doesn't work, and consequently the question of its efficiency is irrelevant. On the other hand, the ICU library has an interface which does case-folding as correctly as the Unicode database allows, and its implementation has been crafted by some pretty good coders so it is probably just about as efficient as possible within the constraints. So I'd definitely recommend using it.
If you want a good overview of the difficulty of case-folding, you should read sections 5.18 and 5.19 of the Unicode standard (PDF for chapter 5). The following is just a few examples.
A case transform is not a mapping from single character to single character
The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS.
Not all ideographs (visible characters) are single character codes
Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar mapping. For example, ǰ can easily be capitalized to J̌, but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).
There is a further problem with glyphs like ǰ:
Naive character by character case-folding can denormalize
Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.
Leaving aside unicode wierdness, sometimes case-conversion is context-sensitive
Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.
In short
The best available correct way to case-fold is to apply the Unicode case-folding algorithm, which requires creating a temporary string for each source string. You could then do a simple bytewise comparison between the two transformed strings in order to verify that the original strings were in the same equivalence class. Doing a collation ordering on the transformed strings, while possible, is rather less efficient than collation ordering the original strings, and for sorting purposes, the untransformed comparison is probably as good or better than the transformed comparison.
In theory, if you are only interested in case-folded equality, you could do the transformations linearly, bearing in mind that the transformation is not necessarily context-free and is not a simple character-to-character mapping function. Unfortunately, C++ locales don't provide you the data you need to do this. The Unicode CLDR comes much closer, but it's a complex datastructure.
All of this stuff is really complicated, and rife with edge cases. (See the note in the Unicode standard about accented Lithuanian i's, for example.) You're really better off just using a well-maintained existing solution, of which the best example is ICU.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?

Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.

I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!

The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)

Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'

For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt

Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

Getting the upper or lower case of a unicode code point (as uint32_t)

Is there a way to get the upper or lower case character for a given unicode code point (or the equivalent utf-8 code unit sequence) ?
I read that this could be done with ICU, but that would be the only thing i'd need ICU for, so i don't want to import a whole huge library (with its licences and dependencies, if any) for a single feature.
I also read that upper and lower case depend on the locale. What does this mean exactly ?
Thanks for your help.
PS: Can't use C++11, using VS2005

ICU is the right tool for this. Case-folding (the idea that multiple symbols represent the same "letter") is a tricky concept in the general form.
What's the uppercase form of i? What country are we in and what language are we writing? English has the pair Ii. Turkish has two pairs: İi and Iı. So it's not so simple, and explains the "locale matters" part of the problem.
Another interesting case is the capital for the German ß (Eszett or "sharp S" in English). Its capital form is two letters, SS. So there's no promise that the uppercase form of a string will even have the same number of letters in it.
It's possible that there's some small library that just focuses on case folding, but I'm not aware of it. Generally to do Unicode reasonably, you have to do a lot of Unicode.

How can I determine Levenshtein distance for Mandarin Chinese characters?

We are developing a system to do fuzzy matching on over 50 international languages using the UTF-8, UTF-16, and UTF-32 Unicode character standard. So far, we have been able to use Levenshtein distance to detect misspellings of German Unicode extended character words.
We would like to extend this system to handle Mandarin Chinese ideographs represented in Unicode. How would we perform Levenshtein distance calculation between similar Chinese characters?

Firstly, just to clarify: A Chinese character is not as such equivalent to a German or English word. Most of the things you'd consider as words (using a semantic or syntactic definition of "word") consist of 1-3 characters. It is straightforward to apply Levenshtein distance to such character sequences by representing them as sequences of UCS-2 or UCS-4 code points. As most words are short (esp. words of length 1 or 2 characters), it may be of limited use, though.
However, as your question is specifically about the edit distance between individual characters, I believe a different approach is required, and it may be very difficult indeed.
For a start, you'd have to represent each character as a sequence of the components / strokes it consists of. There are two problems:
Some components consist themselves of even smaller components, so how to break a character down into "atomic" components is not uniquely defined. If you do it down to the level of individual strokes, you'd need a characterisation of every single stroke (position within the character, shape, direction etc.). I don't think anyone as every done this (I'd be most interested if anyone tells me otherwise).
You'd need to put the strokes or components into an order. The obvious candidate is the canonical stroke order of the character, which is described in lexica, and there are even dictionary websites with animated stroke order diagrams. However, the data sources I know (for Japanese), generate these animations as sequences of bitmap graphics; I have never seen human or machine readable codes that represent the sequence of strokes (or even the names of individual strokes) in a form that is suitable for edit distance calculation.
One final thing you could try, though, is to render the character glyphs and calculate the edit distance based on how many pixels (or vectors) need to be changed to turn one character into another. I once did this for Latin characters and character combinations (on pixel basis) in the context of OCR post-correction, and the results were quite encouraging.
A quick answer to larsmans comment below: There are two related concepts defined by the Unicode Standard (in the below I refer to the 6.0 version, chapter 12):
An index based on radicals and stroke counts. Each Han character consists of several components, one of which is the radical. A radical/stroke count index is a character list sorted by radical (i.e. all characters that share the same radical grouped together), and each radical-specific group internally sorted by the number of strokes used in the rest of the character. Unfortunately, even this is not uniquely defined – there are characters whose radical is defined differently by different traditional lexica, and stroke counting can also be difficult. Here is what the Unicode Standard says:
To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. [...]
The most influential authority for radical-stroke information is the eighteenth-century
KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214
KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. [...]
The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard
follows a number of different sources for radical-stroke classification. Where two sources
are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.
Note that even if we assume the radical/stroke index to be unambiguous and correct, it wouldn't suffice as a source of information to transform a character into a sequence of components, because the only component of the character fully described by this is the radical.
Ideographic description sequences (section 12.2): Unicode defines code points for the basic components of characters (most of them can themselves be used as standalone characters anyway), and there are codepoints used to glue those together to form a sequence of components that describes the composition of a more complex character. So this works in a way similar to combining characters, but there are important differences:
The order of components is not uniquely defined
There is no definition of a rendering mechanism for such sequences
There is no mapping from ordinary characters to corresponding ideographic description sequences (although the Standard mentions that such mappings, to some extent, exist in the sources they used to compile the Han character set).
The Standard suggests that ideographic description sequences be used to describe complex or rare charactes that are not represented by any existing code point; but it explicitly discourages the use of description sequences in place of ordinary characters:
In particular, Ideographic Description Sequences should not be used to provide alternative
graphic representations of encoded ideographs in data interchange. Searching, collation,
and other content-based text operations would then fail.

I wrote a python package fuzzychinese to correct misspellings of Chinese words.
As #jogojapan has said, if you really want to calculate Levenshtein distance, it makes more sense to use sub-character structures such as radicals or strokes. You can use the Stroke() or Radical() classes from fuzzychinese to break down characters, and then calculate Levenshtein distance.
However, I am not sure Levenshtein distance works well for correcting misspelling Chinese words. In the package I wrote, I calculated tf–idf vector for n-gram strokes and used cosine similarity to match words.

Damerau–Levenshtein distance for language specific quirks

To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".
For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.
I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths.
Does anyone have a suggestion/code fragment on how to solve this?
Thanks.

The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.
Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"?
Or German o-umlaut to "oe"?
In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?
I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.
How does the algorithm handle multi-codepoint characters?

Well the D-L distance itself isn't going to handle it for you, due to the way it measure distances.
As there is no code (or language) involved here, I can only leave you with a suggestion to ensure all strings adhere to the same structure.
To clarify the situation since your asking in general terms,
bear in mind that the D-L distance compares character for character and doesn't actually read your strings in themselves, as such you'll have to parse before compare, as cases where ij shouldn't be exchanged with y will cause other issues instead.

An idea is to translate each string into some sort of constructed orthographemic representation, where digraphs such as "ij" and the english "gh" "th" and friends are only one character long. The distance metric does not have to be equal for all types of replactements when doing Damerau-Levenshtein so you can use whatever penalties you want, but the table needs to be filled locally, therefore you really want each sound to be one cell in the table.
This however breaks when the "ij" was not intended as "ij" but a misspelling or at a word-segmentation border (I don't know if that can happen in Dutch), or in any other situation it is not actually (meant as) a digraph.
Otherwise you will need to do some lookaround, this will complicate things but should not change the growth order of the algorithm (I believe), provided you only look at constant number of cells around. The constant factors will still be much bigger though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js