Handling case-insensitivity without taking locale into account - c++

I am investigating the handling of case-insensitivity in my application. So far I realized that there are two different cases:
data is visualized to the user
data is internally handled
For case 1 you should use the user's locale, always. This means that e.g. when sorting items in a list and you want this to happen case-insensitively, then you should use locale-aware case-insensitive string compare functions.
For case 2 it seems logical that you don't want to use the user's locale, since this can have undesirable effects if you have users using a different locale, but still using the same data set (e.g. if you are managing library software, you could use the book's name as key for your book instance in the database, and want to handle this case-insensitive (this is a simplification, I know)).
When using STL containers (like std::map) I noticed that it's much more efficient to put the key in uppercase and then perform the lookup on the uppercase'd search-value. This is more efficient than performing a case-insensitive compare while looping over the map. For std::unordered_map it's probably required to do such a trick.
However, I realized that this may have strange effects as well, and I am wondering how Windows (also using case-insensitive file names) handles these situations.
E.g. The German character ß (ringel-S) is written as SS when put in uppercase. This seems to imply that ß.txt and SS.txt should denote the same file, and so ßs.txt and sß.txt and sss.txt and SSS.txt should also denote the same file. But in my experiments this doesn't seem to be the case in Windows.
So my questions:
Which C++ and/or Windows functions should be used to perform locale-independent case-insensitive string compares?
Which C++ and/or Windows functions should be used to make a string case-less (e.g. put it in uppercase) so compares are then more efficient when performing a lookup in an std::map (or even making hashing possible when using an std::unordered_map)?
Any other experience (or links to documents) regarding case-insensitive string handling for internal (i.e. non-visualization-related) data?

Related

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!
The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)
Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'
For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt
Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

How to compare a "basic_string" using an arbitary locale

I'm re-posting a question I submitted earlier today but I'm now citing a specific example in response to the feedback I received. The original question can be found here (note that it's not a homework assignment):
I'm simply trying to determine if C++ makes it impossible to perform an (efficient) case-INsensitive comparison of a basic_string object that also factors in any arbitrary locale object. For instance, it doesn't appear to be possible to write an efficient function such as the following:
bool AreStringsEqualIgnoreCase(const string &str1, const string &str2, const locale &loc);
Based on my current understanding (but can someone confirm this), this function has to call both ctype::toupper() and collate::compare() for the given locale (extracted as always using use_facet()). However, because collate::compare() in particular requires 4 pointer args, you either need to pass these 4 args for every char you need to compare (after first calling ctype::toupper()), or alternatively, convert both strings to upppercase first and then make a single call to collate::compare().
The 1st approach is obviously inefficient (4 pointers to pass for each char tested), and the 2nd requires you to convert both strings to uppercase in their entirety (requiring allocation of memory and needless copying/converting of both strings to uppercase). Am I correct about this, i.e., it's not possible to do it efficiently (because there's no way around collate::compare()).
One of the little annoyances about trying to deal in a consistent way with all the world's writing systems is that practically nothing you think you know about characters is actually correct. This makes it tricky to do things like "case-insensitive comparison". Indeed, it is tricky to do any form of locale-aware comparison, and case-insensitivity is additionally thorny.
With some constraints, though, it is possible to accomplish. The algorithm needed can be implemented "efficiently" using normal programming practices (and precomputation of some static data), but it cannot be implemented as efficiently as an incorrect algorithm. It is often possible to trade off correctness for speed, but the results are not pleasant. Incorrect but fast locale implementations may appeal to those whose locales are implemented correctly, but are clearly unsatisfactory for the part of the audience whose locales produce unexpected results.
Lexicographical ordering doesn't work for human beings
Most locales (other than the "C" locale) for languages which have case already handle letter case in the manner expected, which is to use case differences only after all other differences have been taken into account. That is, if a list of words are sorted in the locale's collation order, then words in the list which differ only in case are going to be consecutive. Whether the words with upper case come before or after words with lower case is locale-dependent, but there won't be other words in between.
That result cannot be achieved by any single-pass left-to-right character-by-character comparison ("lexicographical ordering"). And most locales have other collation quirks which also don't yield to naïve lexicographical ordering.
Standard C++ collation should be able to deal with all of these issues, if you have appropriate locale definitions. But it cannot be reduced to lexicographical comparison just using a comparison function over pairs of whar_t, and consequently the C++ standard library doesn't provide that interface.
The following is just a few examples of why locale-aware collation is complicated; a longer explanation, with a lot more examples, is found in Unicode Technical Standard 10.
Where do the accents go?
Most romance languages (and also English, when dealing with borrowed words) consider accents over vowels to be a secondary characteristic; that is, words are first sorted as though the accents weren't present, and then a second pass is made in which unaccented letters come before accented letters. A third pass is necessary to deal with case, which is ignored in the first two passes.
But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.
In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M', Mc and Mac were all clumped together since they are all phonetically identical.)
One symbol, two letters. Or two letters, one symbol
German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.
In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)
Summary: comparing character strings requires multiple passes, and sometimes requires comparing groups of characters
Making it all case-insensitive
Since locales handle case correctly in comparison, it should not really be necessary to do case-insensitive ordering. It might be useful to do case-insensitive equivalence-class checking ("equality" testing), although that raises the question of what other imprecise equivalence classes might be useful. Unicode normalization, accent deletion, and even transcription to latin are all reasonable in some contexts, and highly annoying in others. But it turns out that case conversions are not as simple as you might think, either.
Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph dž (U+01C6; a single character), whose uppercase is DŽ (U+01C4) and whose title case is Dž (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.
C++ locales aren't quite up to the job
So, getting back to C++, the sad truth is that C++ locales do not have sufficient information to do accurate case-folding, because C++ locales work on the assumption that case-folding a string consists of nothing more than sequentially and individually upper-casing each codepoint in the string using a function which maps a codepoint to another codepoint. As we'll see, that just doesn't work, and consequently the question of its efficiency is irrelevant. On the other hand, the ICU library has an interface which does case-folding as correctly as the Unicode database allows, and its implementation has been crafted by some pretty good coders so it is probably just about as efficient as possible within the constraints. So I'd definitely recommend using it.
If you want a good overview of the difficulty of case-folding, you should read sections 5.18 and 5.19 of the Unicode standard (PDF for chapter 5). The following is just a few examples.
A case transform is not a mapping from single character to single character
The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS.
Not all ideographs (visible characters) are single character codes
Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar mapping. For example, ǰ can easily be capitalized to J̌, but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).
There is a further problem with glyphs like ǰ:
Naive character by character case-folding can denormalize
Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.
Leaving aside unicode wierdness, sometimes case-conversion is context-sensitive
Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.
In short
The best available correct way to case-fold is to apply the Unicode case-folding algorithm, which requires creating a temporary string for each source string. You could then do a simple bytewise comparison between the two transformed strings in order to verify that the original strings were in the same equivalence class. Doing a collation ordering on the transformed strings, while possible, is rather less efficient than collation ordering the original strings, and for sorting purposes, the untransformed comparison is probably as good or better than the transformed comparison.
In theory, if you are only interested in case-folded equality, you could do the transformations linearly, bearing in mind that the transformation is not necessarily context-free and is not a simple character-to-character mapping function. Unfortunately, C++ locales don't provide you the data you need to do this. The Unicode CLDR comes much closer, but it's a complex datastructure.
All of this stuff is really complicated, and rife with edge cases. (See the note in the Unicode standard about accented Lithuanian i's, for example.) You're really better off just using a well-maintained existing solution, of which the best example is ICU.

How to parse numbers like "3.14" with scanf when locale expects "3,14"

Let's say I have to read a file, containing a bunch of floating-point numbers. The numbers can be like 1e+10, 5, -0.15 etc., i.e., any generic floating-point number, using decimal points (this is fixed!). However, my code is a plugin for another application, and I have no control over what's the current locale. It may be Russian, for example, and the LC_NUMERIC rules there call for a decimal comma to be used. Thus, Pi is expected to be spelled as "3,1415...", and
sscanf("3.14", "%f", &x);
returns "1", and x contains "3.0", since it refuses to parse past the '.' in the string.
I need to ignore the locale for such number-parsing tasks.
How does one do that?
I could write a parseFloat function, but this seems like a waste.
I could also save the current locale, reset it temporarily to "C", read the file, and restore to the saved one. What are the performance implications of this? Could setlocale() be very slow on some OS/libc combo, what does it really do under the hood?
Yet another way would be to use iostreams, but again their performance isn't stellar.
My personal preference is to never use LC_NUMERIC, i.e. just call setlocale with other categories, or, after calling setlocale with LC_ALL, use setlocale(LC_NUMERIC, "C");. Otherwise, you're completely out of luck if you want to use the standard library for printing or parsing numbers in a standared form for interchange.
If you're lucky enough to be on a POSIX 2008 conforming system, you can use the uselocale and *_l family of functions to make the situation somewhat better. There are at least 2 basic approaches:
Leave the default locale unset (at least the troublesome parts like LC_NUMERIC; LC_CTYPE should probably always be set), and pass a locale_t object for the user's locale to the appropriate *_l functions only when you want to present things to the user in a way that meets their own cultural expectations; otherwise use the default C locale.
Have your code that needs to work with data for interchange keep around a locale_t object for the C locale, and either switch back and forth using uselocale when you need to work with data in a standard form for interchange, or use the appropriate *_l functions (but there is no scanf_l).
Note that implementing your own floating point parser is not easy and is probably not the right solution to the problem unless you're an expert in numerical computing. Getting it right is very hard.
POSIX.1-2008 specifies isalnum_l(), isalpha_l(), isblank_l(), iscntrl_l(), isdigit_l(), isgraph_l(), islower_l(), isprint_l(), ispunct_l(), isspace_l(), isupper_l(), and isxdigit_l().
Here's what I've done with this stuff in the past.
The goal is to use locale-dependent numeric converters with a C-locale numeric representation. The ideal, of course, would be to use non-locale-dependent converters, or not change the locale, etc., etc., but sometimes you just have to live with what you've got. Locale support is seriously broken in several ways and this is one of them.</rant>
First, extract the number as a string using something like the C grammar's simple pattern for numeric preprocessing tokens. For use with scanf, I do an even simpler one:
" %1[-+0-9.]%[-+0-9A-Za-z.]"
This could be simplified even more, depending on how what else you might expect in the input stream. The only thing you need to do is to not read beyond the end of the number; as long as you don't allow numbers to be followed immediately by letters, without intervening whitespace, the above will work fine.
Now, get the struct lconv (man 7 locale) representing the current locale using localeconv(3). The first entry in that struct is const char* decimal_point; replace all of the '.' characters in your string with that value. (You might also need to replace '+' and '-' characters, although most locales don't change them, and the sign fields in the lconv struct are documented as only applying to currency conversions.) Finally, feed the resulting string through strtod and see if it passes.
This is not a perfect algorithm, particularly since it's not always easy to know how locale-compliant a given library actually is, so you might want to do some autoconf stuff to configure it for the library you're actually compiling with.
I am not sure how to solve it in C.
But C++ streams (can) have a unique locale object.
std::stringstream dataStream;
dataStream.imbue(std::locale("C"));
// Note: You must imbue the stream before you do anything wit it.
// If any operations have been performed then an imbue() can
// be silently ignored by the stream (which is a pain to debug).
dataStream << "3.14";
float x;
dataStream >> x;

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.
The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.
Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.
The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

Create a safe, escaped path base/file name, check if safe

I wonder if there is a generic way to produce filesystem safe filenames that is portable. That is, I have a user entered string and would like to produce a file with a name that as closely resembles the name they have chosen. The resulting name must not include any path reference or other special file-system special name or tag.
Currently I just replace a bunch of known bad characters with other characters, or empty strings. For example, given the name ABC / DEF* : A Company? I'd produce the string ABC - DEF - A Company. My choice for replacement characters is totally arbitrary as I don't know of a generic escape symbol.
So my related questions are:
Is there a method (perhaps in boost filesystem) that can tell me if the name refers strictly to a file without a path?
Is there a function that tells me if the name is "safe" to use as a file (this may be an additional check from 1 for some filesystems)?
Is there a function to convert a string into a reasonable safe name?
Addtional Notes
For #1 I thought to just compare a boost path::filename() to the original object, if they are the same then I have a file. However this still allows things like '..' and '.' But that might be okay if there is a good solution for #2
In theory I'd have to provide a directory in which the file would reside, since different file-systems may have different requirements. But a global solution for the OS would also be okay.
I already have a function that just replaces a bunch of commonly known unsafe characters.
Common file dialogs cannot be used to do the filtering since the interface may not always allow them and in some cases the user isn't directly aware of the relationship to the file (advanced users would however).
According to POSIX fully portable filenames, the only portable filenames are those that contain only A–Za–z0–9._- and are max 14 characters long.
That said, a more practical approach is to assume that modern filesystems can cope with longer filenames and to simply replace all characters which are not explicitly marked as "safe" with _. Sometimes, instead of replacing with _, those characters are hex-encoded, like in URLs: sample%20file.txt. KDE applications use this, for example.
As for implementation, it's as simple as s/[^A-Za-z0-9.-]/_/.
How portable is portable? Many systems had limits on length, and some
probably still do. Is disinguishing between names an issue? Some
systems distinguish case, and others don't. What about a final .xxx?
For some systems, it is significant, for others, it's just text.
Neglecting length, the safest bet is to take the opposite approach:
create a set of known safe characters, and convert everything outside of
that to a specific character. ASCII alphanumerics, and '_' seem
pretty safe, and you're probably OK (today) with '-', but I doubt the
list goes much further. And depending on what you're doing with these
names, you might want to force them to a single case, either upper or
lower.