One of the 'features' of the SAS Stored Process server is that the locale setting can change according to the context of the client. In my case, the session may be en_gb or en_us depending on whether the same request was sent by Excel or Chrome.
This can cause different results for the same report, eg when using ANYDTDTE. style informats. The quick fix is to explicitly set options locale=xxx;, but in order to gauge the significance of this it would be good to understand:
What are the main (non-cosmetic) ways in which the same program can give different results according to the session locale?
The major ways that locale affects a program are in the character set/encoding and the date/time defaults.
Character set (or encoding) is determined in part by the locale, and can make a huge difference if one locale is something like en-us and one is utf8, for example. Not only will SAS often end up defaulting to the session encoding for reading in files (if they are not specified explicitly either in the program or the file's header), but how SAS can deal with characters once read into a SAS dataset changes. DBCS encodings will have two bytes storage per character, not one, and if the locale is en-us and you expect utf8 you may not be able to handle some characters that do not transcode between the two.
Date defaults are also highly relevant. en-gb likely assumes 10/8/2015 is August 10, 2015, while en-us assumes it is October 8, 2015. This is a good reason not to use anydtdte. when you can avoid it, of course. You can avoid issues with this by explicitly setting the DATESTYLE system option. You may also have some differences in default output formats, such as the separator in ddmmyy10. or similar.
To see the differences that are possible due to locale, see the documentation for the LOCALE system option. This mentions four settings:
DATESTYLE
DFLANG (similar to DATESTYLE, affects how dates are read)
ENCODING
PAPERSIZE
Also, TRANTAB is set as part of setting ENCODING.
I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}
However, I am struggling with understanding the a mapping of those property names.
I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.
But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.
... am I missing a point here somewhere ?
I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !
Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.
If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.
If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am starting work on a new piece of software that will end up needing some robust and expandable file IO. There are a lot of formats out there. XML, JSON, INI, etc. However, there are always plusses and minuses so I thought I would ask for some community input.
Here are some rough requirements:
The format is a "standard"...I don't want to reinvent the wheel if I don't have to. It doesn't have to be a formal IEEE standard, but something you could Google and get some information on as a new user, may have some support tools (editors) beyond vi. (Though the software users will generally be computer savvy and happy to use vi.)
Easily integrates with C++. I don't want to have to pull along a 100mb library and three different compilers to get it up and running.
Supports tabular input (2d, n-dimensional)
Supports POD types
Can expand as more inputs are required, binds well to variables, etc.
Parsing speed is not terribly important
Ideally, as easy to write (reflect) as it is to read
Works well on Windows and Linux
Supports compositing (one file referencing another file to read, and so on.)
Human Readable
In a perfect world, I would use a header-only library or some clean STL implementation, but I'm fine with leveraging Boost or some small external library if it works well.
So, what are your thoughts on various formats? Drawbacks? Advantages?
Edit
Options to consider? Anything else to add?
XML
YAML
SQLite
Google Protocol Buffers
Boost Serialization
INI
JSON
There is one excellent format that meets all your criteria:
SQLite!
Please read article about using SQLite as an application file format. Also, please watch Google Tech Talk by D. Richard Hipp (SQLite author) about this very topic.
Now, lets see how SQLite meets your requirements:
The format is a "standard"
SQLite has become format of choice for most mobile environments, and for many desktop apps (Firefox, Thunderbird, Google Chrome, Adobe Reader, you name it).
Easily integrates with C++
SQLite has standard C interface, which is only one source file and one header file. There are C++ wrappers too.
Supports tabular input (2d, n-dimensional)
SQLite table is as tabular as you could possibly imagine. To represent say 3-dimensional data, create table with columns x,y,z,value and store your data as a set of rows like this:
x1,y1,z1,value1
x2,y2,z2,value2
...
Supports POD types
I assume by POD you meant Plain Old Data, or BLOB. SQLite lets you store BLOB fields as is.
Can expand as more inputs are required, binds well to variables
This is where it really shines.
Parsing speed is not terribly important
But SQLite speed is superb. In fact, parsing is basically transparent.
Ideally, as easy to write (reflect) as it is to read
Just use INSERT to write and SELECT to read - what could be easier?
Works well on Windows and Linux
You bet, and all other platforms as well.
Supports compositing (one file referencing another file to read)
You can ATTACH one database to another.
Human Readable
Not in binary, but there are many excellent SQLite browsers/editors out there. I like SQLite Expert Personal on Windows and sqliteman on Linux. There is also SQLite editor plugin for Firefox.
There are other advantages that SQLite gives you for free:
Data is indexable which makes it very fast to search. You just cannot do this using XML, JSON or any other text-only formats.
Data can be edited partially, even when amount of data is very large. You do not have to rewrite few gigabytes just to edit one value.
SQLite is fully transactional: it guarantees that your data is consistent at all times. Even if your application (or whole computer) crashes, your data will be automatically restored to last known consistent state on next first attempt to connect to the database.
SQLite stores your data verbatim: you do not need to worry about escaping junk characters in your data (including zero bytes embedded in your strings) - simply always use prepared statements, that's all it takes to make it transparent. This can be big and annoying problem when dealing with text data formats, XML in particular.
SQLite stores all strings in Unicode: UTF-8 (default) or UTF-16. In other words, you do not need to worry about text encodings or international support for your data format.
SQLite allows you to process data in small chunks (row by row in fact), thus it works well in low memory conditions. This can be a problem for any text based formats, because often they need to load all text into memory to parse it. Granted, there are few efficient stream-based XML parsers out there, but in general any XML parser will be quite memory greedy compared to SQLite.
Having worked quite a bit with both XML and json, here's my rather subjective opinion of both as extendable serialization formats:
The format is a "standard": Yes for both
Easily integrates with C++: Yes for both. In each case you'll probably wind up with some kind of library to handle it. On Linux, libxml2 is a standard, and libxml++ is a C++ wrapper for it; you should be able to get both of those from your distro's package manager. It will take some small effort to get those working on Windows. There appears to be some support in Boost for json, but I haven't used it; I've always dealt with json using libraries. Really, the library route is not very onerous for either.
Supports tabular input (2d, n-dimensional): Yes for both
Supports POD types: Yes for both
Can expand as more inputs are required: Yes for both - that's one big advantage to both of them.
Binds well to variables: If what you mean is some way inside the file itself to say "This piece of data must be automatically deserialized into this variable in my program", then no for both.
As easy to write (reflect) as it is to read: Depends on the library you use, but in my experience yes for both. (You can actually do a tolerable job of writing json using printf().)
Works well on Windows and Linux: Yes for both, and ditto Mac OS X for that matter.
Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't.
Human readable: Both are typically written in UTF-8, and permit line breaks and indentation, and thus can be human-readable. However, I've just been working with a 479 KB XML file that's all on one line, so I had to run it through a prettyprinter to make sense of it. json can also be pretty unreadable, but in my experience is often formatted better than XML.
When starting new projects, I generally prefer json; it's more compact and more human-readable. The main reason I might select XML over json would be if I were worried about receiving badly-formed documents, since XML supports automated document format validation, while you have to write your own validation code with json.
Check out google buffers. This handles most of your requirements.
From their documentation, the high level steps are:
Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C++ protocol buffer API to write and read messages.
For my purposes, I think the way to go is XML.
The format is a standard, but allows for modification and flexibility for the schema to change as the program requirements evolve.
There are several library options. Some are larger (Xerces-C) some are smaller (ezxml), but there are many options, so we won't be locked in to a single provider or very specific solution.
It can supports tabular input (2d, n-dimensional). This requires more parsing work on "our" end, and is likely the weakest point for XML.
Supports POD types: Absolutely.
Can expand as more inputs are required, binds well to variables, etc. through schema modifications and parser modifications.
Parsing speed is not terribly important, so processing a text file or files is not an issue.
XML can be programmatically written just as easily as read.
Works well on Windows and Linux or any other OS that supports C and text files.
Supports compositing (one file referencing another file to read, and so on.)
Human Readable with many text editors (Sublime, vi, etc.) supporting syntax highlighting out of the box. Many web browsers display the data well.
Thanks for all the great feedback! I think if we wanted a purely binary solution, Protocol Buffers or boost::serialization is likely the way that we would go.
Does anyone know if there is a free open-source solution to convert KORMARC (Korean MARC) into MARC21 (aka USMARC)?
While I'm not certain it has KORMARC support, you may want to try USEMARCON if you can find a mapping. From the USEMARCON page:
USEMARCON facilitates the conversion of catalogue records from one MARC format to another e.g. from UKMARC to UNIMARC. The software was designed as a toolbox-style application, allowing users with detailed knowledge of the source and target MARC formats to develop rules governing the behaviour of the conversion. Rules files may be supplemented by additional tables for more accurate conversion of MARC-specific character sets or coded information. The tables and rules files are simple ASCII text files and can be created using any standard text editor such as MS Windows Notepad.
Also, this thread from the Ask a Korean Studies Librarian Google Group might be useful, particularly the following message:
Library of Congress once tried to download records from the National
Library of Korea (NLK) to use as order records. LC wrote a
specification and developed a in-house program to convert KORMARC to
USMARC. Since NLK records only provide script, LC used a
transliterator to provide romanization for Voyager system developed by
non-LC programmer. The feedback of this method is not very positive
by LC staff. ... In stead of converting KORMARC to USMARC, a few research libraries
including LC is currently using MarcEdit with Excel spreadsheets which
are provided by Korean vendors based on contract. Vendors provide
both Korean script and romanization for several elements of MARC
fields (ISBN, title, author, publisher, place, series, etc.) in
different columns of spreadsheet for your order items. It sounds a
lot simpler to set up initially. And once MarcEdit is set up
properly, it creates MARC records.
I want to keep my replacement strings (German, French, etc) in a file format that is standard-ish and useable across Windows and Linux platforms. Thus VC++ resource files are ruled out right away.
What file format do others prefer to use for keeping these l10n resources? Two more features I'd like the format to support are:
the "key" for indexing l10n strings is itself an English string, rather than an enum.
the format can carry a message digest, so I could verify there has been no tampering.
My intent would be to use a function (e.g. wstring foo = GetString(L"I am %1% years old");) that feeds the boost::format or boost::wformat functions. Notice that the key fed to GetString is a string, not an enum.
Obviously I can use whatever XML format (or otherwise) I'd like to dream up. But I'd rather use something that is somewhat standard.
For a standard format, use gettext and msgfmt to make binary .mo files. The format comes from Unix, but is usable cross platform. Audacity, which is Linux/Mac/Windows, uses it.
1) The key is the English string.
2) The standard format doesn't come with an anti-tamper approach, so you will need to cook up your own.
There is also an editor, poEdit, and an emacs mode for working with the translations in the intermediate textual .po format.
We used a library called I18N (I'm sure there are a ton of implementations named the same way all over). The keys and translations were stored in a .txt file, and used some hash for faster lookups. We modified it some to improve the context - it used a context of filename, so you could not use multiple translations of the same string literal in the same file.
Usage was something like Translated = String::Format(I18N("I am %d years old"), years);
We would periodically run a separately executable against our sourcecode to parse out all the various I18N entries, rehash them, and update the file with any new additions.
Unfortunately I can't find any attributes to the author in the source.
Not sure if there is one, but if there is chances are it is mentioned somewhere at lisa.org: http://www.lisa.org/Standards.30.0.html