Reasons for differences due to locale - sas

One of the 'features' of the SAS Stored Process server is that the locale setting can change according to the context of the client. In my case, the session may be en_gb or en_us depending on whether the same request was sent by Excel or Chrome.
This can cause different results for the same report, eg when using ANYDTDTE. style informats. The quick fix is to explicitly set options locale=xxx;, but in order to gauge the significance of this it would be good to understand:
What are the main (non-cosmetic) ways in which the same program can give different results according to the session locale?

The major ways that locale affects a program are in the character set/encoding and the date/time defaults.
Character set (or encoding) is determined in part by the locale, and can make a huge difference if one locale is something like en-us and one is utf8, for example. Not only will SAS often end up defaulting to the session encoding for reading in files (if they are not specified explicitly either in the program or the file's header), but how SAS can deal with characters once read into a SAS dataset changes. DBCS encodings will have two bytes storage per character, not one, and if the locale is en-us and you expect utf8 you may not be able to handle some characters that do not transcode between the two.
Date defaults are also highly relevant. en-gb likely assumes 10/8/2015 is August 10, 2015, while en-us assumes it is October 8, 2015. This is a good reason not to use anydtdte. when you can avoid it, of course. You can avoid issues with this by explicitly setting the DATESTYLE system option. You may also have some differences in default output formats, such as the separator in ddmmyy10. or similar.
To see the differences that are possible due to locale, see the documentation for the LOCALE system option. This mentions four settings:
DATESTYLE
DFLANG (similar to DATESTYLE, affects how dates are read)
ENCODING
PAPERSIZE
Also, TRANTAB is set as part of setting ENCODING.

Related

MFC CEdit converts non-ascii characters to ascii

We have an MFC Windows Application, written originally in VC++ 6 and over the years updated for newer IDE, currently developed in VS2017.
The application is built with MBCS (not unicode). Trying to switch to Unicode causes 3806 compile errors, and that is probably just a tip of an iceberg.
However we want to be able to run the application with different code page, ie. 1250 (Central European).
I tried to build a small test application, and managed to get it to work with special characters (čćšđž). I did this by setting dialog font to Microsoft Sans Serif with code page 1250.
The same approach in our application does not work. Note: dialogs in our application are created dynamically, and font is set using SetFont.
There is a difference how the special characters are treated in these two applications.
In test application, the special characters are displayed in the edit control, and GetWindowsText retrieves the right bytes. However, trying to write some characters from other languages, renders them as "????".
In our application, all special characters are rendered properly, but GetWindowText (or WM_GETTEXT) convert the special characters to the similar ascii counterpart (čćđ -> ccd).
I believe that Edit control in our application displays Unicode text, but GetWindowText converts it to ascii.
Does anyone have any idea what is happening here, and how I might solve it?
Note: I know how to convert project to Unicode. We are choosing not to commit resources to it at the moment, as it would probably take weeks or months to implement. The question is how I might get it to work with MBSC and why is edit control converting Č to C.
I believe it is absolutely possible to port the application to other languages/codepages, you only need to modify the .rc (resource) files, basically having one resource file for each language, which you may rather want to do anyway, as strings in menus and/or string-tables would be in a different language. And this is actually the only change needed, as far as the application part is concerned.
The other part is the system you are running it on. A window can be unicode or non-unicode. You can see this with the Spyxx utility, it tells you whether a window (procedure) is unicode or not (Window properties, General tab). And while unicode windows do work properly, non-unicode ones have to change encoding from/to unicode and mbcs when getting or setting the text. The conversion is based on the system (default) code-page. This can only be set globally (for the whole machine), and not per application or window. And of course, setting the font's codepage is not enough (and imo it's not needed at all, if you are runnign the application on a machine with the "correct" codepage). That is, for non-unicode applications, only one codepage will be working properly, the others won't.
I can see two options:
If you only need to update a small number of controls, it may be possible to change only these controls to unicode, and use the "wide" versions of the get/set window-test functions or messages - you will have to convert the text between unicode and your desired codepage. It requires writing some code, but has the advantage of the conversion being independent from the system default codepage, eg you can have the codepage in some configuration file, in the registry, or as a command-line option (in the application's shortcut). Some control types can be changed to unicode, some others not, so pls check the documentation. Used this technique successfully for a mbcs application displaying/editing translated strings in many different languages, but I only had one control, a List-View, which btw offers the LVM_SETUNICODEFORMAT message, thus allowing for unicode texts, even in a mbcs application.
The easiest method is simply run the application as is, but it will only be working on machines with the proper default codepage, as most non-unicode applications do.
The system default codepage can be changed by setting the "Language for non-Unicode programs" option, available in the regional settings, Administrative tab, and requires a reboot. Changing the Windows UI language will change this option as well, but by setting this option you don't need to change the UI language, eg you can have English UI and East-European codepage.
See a very similar post here.
Late to the party:
In our application, all special characters are rendered properly, but GetWindowText (or WM_GETTEXT) convert the special characters to the similar ascii counterpart (čćđ -> ccd).
That sounds like the ES_OEMCONVERT flag has been set for the control:
Converts text entered in the edit control. The text is converted from the Windows character set to the OEM character set and then back to the Windows character set. This ensures proper character conversion when the application calls the CharToOem function to convert a Windows string in the edit control to OEM characters. This style is most useful for edit controls that contain file names that will be used on file systems that do not support Unicode.
To change this style after the control has been created, use SetWindowLong.

Visual C++/MFC: getting Japanese characters to work without UNICODE

I have software originally developed 20 years ago in Visual C++ using MFC without UNICODE. Currently strings are held either in char[] or CString, and it works on English and Japanese Windows PCs until Japanese characters are used, as these tend to get converted to strange characters or empty boxes.
Setting UNICODE is presumably the way forward but will require a massive code change, whereas quite a lot seems to work simply by setting System Locale to Japan (in “Window’s Language for non-Unicode programs” setting). I have no idea how Windows does this, but some Japanese character things now work on my English Windows PC, e.g. I can open and save Japanese filenames with no code changes. And in Japan they set System Locale to English and again much works, but not everything.
I get the impression the problems are due to using a font that doesn’t include Japanese characters. Currently I am using Arial / MS Sans Serif and charset set to ANSI_CHARSET or DEFAULT_CHARSET. Is there a different font I should be using, or can I extend these fonts to include Japanese characters? Or am I barking up the wrong tree in which case what do I do next? Am very new to all this unfortunately …
That's a common question (OK I guess not so common any more in 2015, as MBCS programs luckily are a dying breed - I still maintain several though...)
Either way, I'm afraid that, depending on your definition of 'working', to get this working you'll have to bite the bullet and convert to a Unicode build. If you can't make a business case for that, then you'll have to set the right locale (well, worse, have the user set the 'right' one) and test what works and what doesn't, and ask more specific questions on what doesn't.
If your goal is to make one application that correctly displays strings in various encodings in the 'right' way regardless of the locale settings on the computer, and compatible with every input data set / database content without the user having to be aware of encoding issues, then you're out of luck with an MBCS build.
The font missing characters is most likely not the problem. Before you go any further and/or ask further questions, you should read http://www.joelonsoftware.com/articles/Unicode.html, read it again, sleep on it, read it again, explain to somebody else what the relationship is between 'encoding', 'locale', 'character set', 'font' and 'Unicode code point', because only after you can do that, you can decide on how to progress with your application. Sorry, it's not what you want to hear, but it's the reality if you've been tasked with handling internationalization.

How to correctly display characters from different languages?

I am finishing application in Visual C++/Windows API and I am using MySql C Connector.
Whole application code uses ANSI, MySql C Connector is in ANSI too.
This program will be used on Polish and German computers with Windows XP/Vista/7 or 8.
I want to correcly display german umlauts and polish accent characters on:
DialogBox controls (strings are loaded from language files)
Generated XHTML documents
Strings retrieved from MySql database displayed on controls and in XHTML documents
I have heard about MultiByteToWideChar and Unicode functions (MessageBoxW etc.), but application code is nearly finished, converting is a lot of work...
How to make character encoding correctly with the least work and time?
Maybe changing system code page for non-Unicode program?
First, of course: what code set is MySQL returning? Or perhaps:
what code set was used when writing the data into the data base?
Other than that, I don't think you'll be able to avoid using
either wide characters or multibyte characters: for single byte
characters, German would use ISO 8859-1 (code page 1252) or
ISO 8859-15, Polish ISO 8859-2 (code page 1250). But what are
you doing with the characters in your own code? You may be able
to get away with UTF-8 (code page 65001), without many changes.
The real question is where the characters originally come from
(although it might not be too difficult to translate them into
UTF-8 immediately at the source); I don't think that Windows
respects the code page for input.
Although it doesn't help you much to know it, you're dealing
with an almost impossible problem, since so much depends on
things outside your program: things like the encoding of the
display font, or the keyboard driver, for example. In fact,
it's not rare for programs to display one thing on the screen,
and something different when outputting to the printer, or to
display one thing on the screen, but something different if the
data is written to a file, and read with another program. The
situation is improving—modern Unix and the Internet are
gradually (very gradually) standardizing on UTF-8, everywhere
and for everything, and Windows normally uses UTF-16 for
everything that is pure Windows (but needs to support UTF-8 for
the Internet). But even using the platform standard won't help
if the human client has installed (and is using) fonts which
don't have the characters you need.

Language agnostic cookie encoding / decoding standards

I'm having difficulties to figure out what is the standard (or is there any?) for encoding/decoding cookie values regardless to backend platforms.
According to RFC 2109:
The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. "Opaque" implies that the content is of interest and relevance only to the origin server. The content may, in fact, be readable by anyone that examines the Set-Cookie header.
which sounds like "server is the boss" and it decides whatever the encoding will apply. This makes it quite difficult to set a cookie from, say PHP backend and read it from Python or Java or whatever, without writing any manual encode/decode handling on both sides.
Let's say we have a value needs to be encoded. Russian /"печенье (*} значения"/ means "cookie value" with some additional non alpha-numeric chars in it.
Python:
Almost every WSGI server does the same and uses Python's SimpleCookie class that encodes to / decodes from octal literals even though many says that octal literals are depreciated in ECMA-262, strict mode. Wtf?
So, our raw cookie value becomes "/\"\320\277\320\265\321\207\320\265\320\275\321\214\320\265 (*} \320\267\320\275\320\260\321\207\320\265\320\275\320\270\321\217\"/"
Node.js:
Haven't tested at all but I'm just guessing a JavaScript backend would do it with native encodeURIComponent and decodeURIComponent functions that use hexadecimal escaping / unescaping?
PHP:
PHP applies urlencode to the cookie values that is similar to encodeURIComponent but not exactly the same.
So the raw value becomes; %2F%22%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D1%8C%D0%B5+%28%2A%7D+%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%22%2F that is not even wrapped with double quotes.
However; if the JavaScript value variable has the PHP encoded value above, decodeURIComponent(value) gives /"печенье+(*}+значения"/, see "+" chars instead of spaces..
What is the situation in Java, Ruby, Perl and .NET? Which language is following (or closest) to the desired behaviour. Actually, is there any standard for this defined by W3?
I think you've got things a bit mixed up here. The server's encoding does not matter to the client, and it shouldn't. That is what RFC 2109 is trying to say here.
The concept of cookies in http is similar to this in real life: Upon paying the entrance fee to a club you get an ink stamp on your wrist. This allows you to leave and reenter the club without paying again. All you have to do is show your wrist to the bouncer. In this real life example, you don't care what it looks like, it might even be invisible in normal light - all that is important is that the bouncer recognises the thing. If you were to wash it off, you'll lose the privilege of reentering the club without paying again.
In HTTP the same thing is happening. The server sets a cookie with the browser. When the browser comes back to the server (read: the next HTTP request), it shows the cookie to the server. The server recognises the cookie, and acts accordingly. Such a cookie could be something as simple as a "WasHereBefore" marker. Again, it's not important that the browser understands what it is. If you delete your cookie, the server will just act as if it has never seen you before, just like the bouncer in that club would if you washed off that ink stamp.
Today, a lot of cookies store just one important piece of information: a session identifier. Everything else is stored server-side and associated with that session identifier. The advantage of this system is that the actual data never leaves the server and as such can be trusted. Everything that is stored client-side can be tampered with and shouldn't be trusted.
Edit: After reading your comment and reading your question yet again, I think I finally understood your situation, and why you're interested in the cookie's actual encoding rather than just leaving it to your programming language: If you have two different software environments on the same server (e.g.: Perl and PHP), you may want to decode a cookie that was set by the other language. In the above example, PHP has to decode the Perl cookie or vice versa.
There is no standard in how data is stored in a cookie. The standard only says that a browser will send the cookie back exactly as it was received. The encoding scheme used is whatever your programming language sees fit.
Going back to the real life example, you now have two bouncers one speaking English, the other speaking Russian. The two will have to agree on one type of ink stamp. More likely than not this will involve at least one of them learning the other's language.
Since the browser behaviour is standardized, you can either imitate one languages encoding scheme in all other languages used on your server, or simply create your own standardized encoding scheme in all languages being used. You may have to use lower level routines, such as PHP's header() instead of higher level routines, such as start_session() to achieve this.
BTW: In the same manner, it is the server side programming language that decides how to store server side session data. You cannot access Perl's CGI::Session by using PHP's $_SESSION array.
Regardless of the cookie being opaque to the client, it still needs to conform to the HTTP spec. rfc2616 specifies that all HTTP headers should be ASCII (ISO-8859-1). rfc5987 extends that to support other character sets, but I don't know how widely supported it is.
I prefer to encode into UTF8 and wrap with base64 encoding. It's fast, ubiquitous, and will never mangle your data at either end.
You will need to ensure an explicit conversion into UTF8 even when wrapping it. Other languages & runtimes, while supporting Unicode, may not store strings as UTF8 internally... like many Windows APIs. Python 2.x, in my experience, rarely gets Unicode strings right without explicit conversion.
ENCODE: nativeString -> utfEncode() -> base64Encode()
DECODE: base64Decode() -> utfDecode() -> nativeString
Almost every language I know of, these days, supports this. You can look for a universal single-function encode, but I err on the side of caution and choose the two-step approach... especially with foreign character sets.

What is the native narrow string encoding on Windows?

The Subversion API has a number of functions for converting from "natively-encoded" strings to strings that are encoded in UTF-8. My question is: what is this native encoding on Windows? Does it depend on locale?
"Natively encoded" strings are strings written in whatever code page the user is using. That is, they are numbers that are translated to the appropriate glyphs based on the correct code page. Assuming the file was saved that way and not as a UTF-8 file.
This is a candidate question for Joel's article on Unicode.
Specifically:
Eventually this OEM free-for-all got
codified in the ANSI standard. In the
ANSI standard, everybody agreed on
what to do below 128, which was pretty
much the same as ASCII, but there were
lots of different ways to handle the
characters from 128 and on up,
depending on where you lived. These
different systems were called code
pages. So for example in Israel DOS
used a code page called 862, while
Greek users used 737. They were the
same below 128 but different from 128
up, where all the funny letters
resided. The national versions of
MS-DOS had dozens of these code pages,
handling everything from English to
Icelandic and they even had a few
"multilingual" code pages that could
do Esperanto and Galician on the same
computer! Wow! But getting, say,
Hebrew and Greek on the same computer
was a complete impossibility unless
you wrote your own custom program that
displayed everything using bitmapped
graphics, because Hebrew and Greek
required different code pages with
different interpretations of the high
numbers.
Windows 1252. Jukka Korpela has an excellent page on character encodings, with an extensive discussion of the Windows character set.
From the header svn_string.h you can see that the relevant svn_strings are just plain old const char* + a length element.
I would guess that the "natively encoded" svn strings are interpreted according to your system locale (I do not know this for sure, but this is the convention). On Windows 7 you can check your locale by selecting "Start-->Control Panel-->Region and Language-->Administrative-->Change system locale" where any value of English would probably entail the character encoding Windows 1252. However, a different system locale, for example Hebrew (Israel), would entail a different character encoding (Windows 1255 for the case of Hebrew).
Sadly the MSVC version of the C library does not support UTF-8 and uses legacy codepages only, but cygwin provides a UTF-8 locale as part of its emulation layer. If your svn is built on cygwin, you should be able to use UTF-8 just fine.