Different native character set across systems - ICHAR

Different native character set across systems - ICHAR - fortran

I'm using ICHAR in some code which will be used across different systems. The sample code is:
write(20,rec=iwrite) ichar(some_string(index))
where I'm trying to write some text to a binary file.
Looking at the the description on the GCC GNU website, it says:
ICHAR(C) returns the code for the character in the first character
position of C in the system’s native character set. The correspondence
between characters and their codes is not necessarily the same across
different GNU Fortran implementations.
It works fine on the few systems I've tried, e.g. a few computers with different versions of GCC, and a couple of clusters. But it seems that I won't even know it is writing the wrong data until I look at it, which is worrying.
Is there more detail about what is meant by different GNU Fortran implementations?
Is this compiler specific?
Is there a more robust way to write characters to a binary file, i.e. that will always write the same code to the file and always refers to the same character?

Related

How to get the platform-specific text line terminator string in C++20?

If there is a binary-mode std::ostream, how could one output a platform-specific line-terminator string (e.g. \r\n on Windows)? Alternatively, is there a standard way to get that terminator as a string?
So far, I have considered the following options:
Just use std::endl: this would just output \n regardless of the platform, and the implied std::flush hurts performance quite severely.
The old C-style #ifdef _WIN32 ... approach, with one branch per possible terminator string. Fragile and exhausting.
A more modern branching alternative with constexpr conditionals. Slightly less fragile, still exhausting.
Output std::endl to an std::stringstream and get the string: This does not seem to work when using mingw to compile a 64-bit Windows executable on Linux and testing using WINE - I still only get the single \n. That might be an environment problem, though.
Is there a somewhat portable way that would work?
UPDATE:
It was pointed out in the comments that having a line terminator is not a given; there are platforms that delimit text lines in other ways, like e.g. length-prefixed string records. So a truly generic solution would be one that answers the question:
How to write a single platform-specific text line to a binary-mode output stream?
Is there, for example, a way to wrap a text-mode stream around an existing binary stream?

It turns out that something like this may not actually be possible, intentionally. C++ needs to support a multitude of platforms, and some of them have fundamentally different models for text files.
E.g. on OpenVMS text files can be stored in a record-per-line format using the Record Management Services subsystem. This makes the concept of line terminator strings irrelevant, and it also means that text-mode files are not just binary files with some additional assumptions on top of them.

The proper way to handle Unicode with C++ in 2018?

I have tried searching stackoverflow to find an answer to this but the questions and answers I've found are around 10 years old and I can't seem to find consensus on the subject due to changes and possible progress.
There are several libraries that I know of outside of the stl that are supposed to handle unicode-
http://userguide.icu-project.org/
https://github.com/nemtrif/utfcpp
https://github.com/CaptainCrowbar/unicorn-lib
There are a few features of the stl (wstring,codecvt_utf8) that were included but people seem to be ambivalent about using because they deal with UTF-16 which this site: (utf-8 everywhere) says shouldn't be used and many people online seem agree with the premise.
The only thing I'm looking for is the ability to do 4 things with a unicode strings-
Read a string into memory
Search the string with regex using unicode or ascii, concatenate or do text replacement/formatting with it with either ascii+unicode numbers or characters.
Convert to ascii + the unicode number format for characters that don't fit in the ascii range.
Write a string to disk or send wherever.
From what I can tell icu handles this and more. What I would like to know is if there is a standard way of handling this on Linux, Windows, and MacOS.
Thank you for your time.

I will try to throw some ideas here:
most C++ programs/programmers just assume that a text is an almost opaque sequence of bytes. UTF-8 is probably guilty for that, and there is no surprise that many comments resume to: don't worry with Unicode, just process UTF-8 encoded strings
files only contains bytes. At a moment, if you try to internally process true Unicode code points, you will have to serialize that to bytes -> here again UTF-8 wins the point
as soon as you go out of the Basic Multilingual Plane (16 bits code points), things become more and more complex. The emoji is specifically awful to process: an emoji can be followed by a variation selector (U+FE0E VARIATION SELECTOR-15 (VS15) for text or U+FE0F VARIATION SELECTOR-16 (VS16) for emoji-style) to alter its display style, more or less the old i bs ^ that was used in 1970 ascii when one wanted to print î. That's not all, the characters U+1F3FB to U+1F3FF are use to provide a skin color for 102 human emoji spread across six blocks: Dingbats, Emoticons, Miscellaneous Symbols, Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, and Transport and Map Symbols.
That simply means that up to 3 consecutive unicode code points can represent one single glyph... So the idea that one character is one char32_t is still an approximation
My conclusion is that Unicode is a complex thing, and really requires a dedicated library like ICU. You can try to use simple tools like the converters of the standard library when you only deal with the BMP, but full support is far beyond that.
BTW: even other languages like Python that pretend to have a native unicode support (which is IMHO far better than current C++ one) ofter fails on some part:
the tkinter GUI library cannot display any code point outside the BMP - while it is the standard IDLE Python tool
different modules or the standard library are dedicated to Unicode in addition to the core language support (codecs and unicodedata), and other modules are available in the Python Package Index like the emoji support because the standard library does not meet all needs
So support for Unicode is poor for more than 10 years, and I do not really hope that things will go much better in the next 10 years...

Are non-latin numerals in Windows SBCS codepages used by any Microsoft libraries to represent numerical data in C strings?

I'm trying to write a parser for "text" files which I know will be encoded in one of the Windows single byte code pages. These files contain text representations of basic data types, and the spec I have for these representations is lacking, to say the least.
I noticed in Windows-874 ten little inconspicuous characters near the end called THAI DIGIT ZERO to THAI DIGIT NINE.
I'm trying to write this parser to be pretty robust but I'm working a bit in the dark as there are many different programs which can generate these data files and I don't have access to the sources.
What I want to know is: do any functions in Microsoft C++ libraries convert real number data types into a std::string or char const * (i.e. serialization) which would contain non-arabic-numerals?
I don't use Microsoft C++ libraries so can't reference any in particular but a made-up example could be char const * IntegerFunctions::ToString(int i).

These digits certainly could be created by Microsoft libraries. The properties LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS determine whether numbers formatted by the OS will use native (i.e. non-ASCII) digits. Those are initially Unicode, because that's what how Windows internally creates strings. When you have a Thai locale, and you convert Unicode to CP874, then those characters will be kept.
A simple function that demonstrates this behavior is GetNumberFormatA

Sort of the inverse answer, but this page seems to indicate that Microsoft's runtime libraries at understand quite a few (but not all) non-Latin numerals when doing what you want to do, i.e. parse a string into a number.
Thai is included, which seems to indicate that it's a good idea to support it in custom code, too.
To include more information here, the linked-to page states that Microsoft's msvcr100 runtime supports decoding numerals from the following character sets:
ASCII
Arabic-Indic
Extended Arabic
Devanagari
Bengali
Gurmukhi
Gujarati
Oriya
Telugu
Kannada
Malayalam
Thai
Lao
Tibetan
Myanmar
Khmer
Mongolian
Full Width
The full page includes more programming environments and more languages (there are plenty of negatives, too).

Where exactly is the boundary between a preprocessor and a compiler?

According to various sources (for example, the SE radio episode with Kevlin Henney, if I remember correctly), "C with classes" was implemented with preprocessor technology (with the output then being fed to a C compiler), whereas C++ has always been implemented with a compiler (that just happened to spit out C in the early days). This seems to cause some confusion, so I was wondering:
Where exactly is the boundary between a preprocessor and a compiler? When do you call a piece of software that implements a language "a preprocessor", and when do you call it "a compiler"?
By the way, is "a compiled language" an established term? If so, what exactly does it mean?

This is an interesting question. I don't know a definitive answer, but would say this, if pressed for one:
A preprocessor doesn't parse the code, but instead scans for embedded patterns and expands them
A compiler actually parses the code by building an AST (abstract syntax tree) and then transforms that into a different language

The language of the output of the preprocessor is a subset of the language of the input.
The language of the output of the compiler is (usually) very different (machine code) then the language of the input.

From a simplified, personal, point of view:
I consider the preprocessor to be any form of textual manipulation that has no concepts of the underlying language (ie: semantics or constructs), and thus only relies on its own set of rules to perform its duties.
The compiler starts when rules and regulation are applied to what is being processed (yes, it makes 'my' preprocessor a compiler, but why not :P), this includes symantical and lexical checking, and the included transforms from x (textual) to y (binary/intermediate form). as one of my professors would say: "its a system with inputs, processes and outputs".

The C/C++ compiler cares about type-correctness while the preprocessor simply expands symbols.

A compiler consist of serval processes (components). The preprocessor is only one of these and relatively most simple one.
From the Wikipedia article, Division of compiler processes:
All but the smallest of compilers have more than two phases. However,
these phases are usually regarded as being part of the front end or
the back end. The point at which these two ends meet is open to
debate.
The front end is generally considered to be where syntactic
and semantic processing takes place, along with translation to a lower
level of representation (than source code).
The middle end is usually
designed to perform optimizations on a form other than the source code
or machine code. This source code/machine code independence is
intended to enable generic optimizations to be shared between versions
of the compiler supporting different languages and target processors.
The back end takes the output from the middle. It may perform more
analysis, transformations and optimizations that are for a particular
computer. Then, it generates code for a particular processor and OS."
Preprocessing is only the small part of the front end job.
The first C++ compiler made by attaching additional process in front of existing C compiler toolset, not because it is good design but because limited time and resources.
Nowadays, I don't think such non-native C++ compiler can survive in the commercial field.
I dare say cfront for C++11 is impossible to make.

The answer is pretty simple.
A preprocessor works on text as input and has text as output. Examples for that are the old unix commands m4, cpp (the C Pre Processor), and also unix programs like roff and nroff and troff which where used (and still are) to format man pages (unix command "man") or format text for printing or typesetting.
Preprocessors are very simple, they don't know anything about the "language of the text" they process. In other words they usually process natural languages. The C preprocessor besides its name, e.g. only recognizes #define, #include, #ifdef, #ifndef, #else etc. and if you use #define MACRO it tries to "expand" that macro everywhere it finds it. But that does not need to be C or C++ program text, it can as well be a novel written in italian or greek.
Compilers that cross compile into a different language are usually called translators. So the old cfront "compiler" for C++ which emitted C code was a C++ translator.
Preprocessors and later translators are historically used because old machines simply lacked memory to be able to do everything in one program, but instead it was done by specialized programs and from disk to disk.
A typical C program would be compiled from various sources. And the build process would be managed with make. In our days the C preprocessor is usually build directly into the C/C++ compiler. A typical make run would call the CPP on the *.c files and write the output to a different directory, from there either the C compiler CC would compile it straight to machine code or more commonly would output assembler code as text. Note: the c compiler only checks syntax, it does not really care about type safety etc. Then the assembler would take that assembler code and would output a *.o file wich later can be linked with other *.o files and *.lib files into an executable program. OTOH you likely had a make rule that would not call the C compiler but the lint command, the C language analyser, which is looking for typical mistakes and errors (which are ignored by the c compiler).
It is quite interesting to look up about lint, nroff, troff, m4 etc. on wikipedia (or your machines terminal using man) ;D

Binary files and OS

I'm currently learning C++ and there are some (basic) things which I don't really know about and where I didn't find anything useful over different search engines.
Well as all operating systems have different "binary formats" for their executeables (Windows/Linux/Mac) - what are the differences? I mean all of them are binary but is there anything (beside all the OS APIs) that really differs?
(Windows) This is a dumb question - but are all the applications there really just binary (and I mean just 0's and 1's)? In which format are they stored? (As you don't see 0's and 1's in all the Text editors but mainly non-displayable characters)
Best regards,
lamas

Executable file formats for Windows (PE), Linux (ELF), OS/X etc (MACH-O), tend to be designed to solve common problems, so they all share common features. However, each platform specifies a different standard, so the files are not compatible across platforms, even if the platforms use the same type of CPU.
Executable file formats are not only used for executable files, but also libraries, which also contain code but are never run directly by the user - only loaded into memory to satisfy the needs to directly executable binaries.
Common Features of an executable file format:
One or more blocks of executable code
One or more blocks of read-only data such as text and numbers
One or more blocks of read/write data
Instructions on where to place these blocks in memory when the application is run
Instructions on what libraries (which are also in an 'executable file format') need to be loaded as well, and how they connect (link) up to this executable file.
One or more tables mapping code and data locations to strings or ids that describe them, useful for linking and debugging.
It's interesting to compare such formats to more basic formats, such as the venerable DOS .com file, which simply describes 64K of assorted 'stuff' to be loaded at the next available location, and has few of the features listed above.
Binary in this sense is used to compare them to 'source' files, which are written in text format. Binary format simply says that they are encoded in a non-text way, and doesn't really relate to the 0-and-1 sense of binary.

Executables for Windows/Linux differ in:
The format of the file headers, i.e. the part of the file that indexes where and what's what in the rest of the file;
the instructions required for system calls (interrupts, register contents, etc)
the actual format in which binary code is linked together; there are several different ones for Linux, and I think also for Windows.
Applications are data and machine language opcodes, crammed into a file. Most bytes in an executable file don't contain text and can therefore contain values between 0 and 255 inclusive, i.e. all possible values. People would say that's binary. There are 8 bits in a byte, so each of those bytes could be said to contain 8 binary digits, some of which will be 0 and some 1.

When you get down to it, every single file in a computer is "binary" in the sense that it is stored as a sequence of 1s and 0s on disk (even text files). When you open up a file in a text editor, it groups these characters up into characters based on various encoding rules. Now if the file is actually a text file, this will give you readable text. However, if the file is not, the text editor will faithfully try and decode the stream of bits, but will most likely end up with lots of non-displayable characters as the bits are not actually the encoded forms of characters, but of CPU instructions.
As for the other part of your question, about "binary formats": there are multiple formats for how to lay out the various parts of an executable, such as ELF or the Windows DLL/EXE format. These all specify exactly where in the file various parts of the executable are (i.e. where the metadata is, where the symbol table is, where the entry point is, where the static data and resources are, etc.)

The most common file-format for Windows is PE; for Linux is ELF. They both contain mostly the same things (data segment, code segment, etc) and are only different simply because they were designed separately.
It should be noted that even if both Windows and Linux used the same file-format, they would still not be able to run each others' binaries, because the system APIs and available DLLs/SOs are completely different.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js