MPI_FILE_READ && little endian on Bluegene

MPI_FILE_READ && little endian on Bluegene - fortran

I need to read (and write) some binary little endian file.
I am writing my fortran code on a PC using Intel FC and Intel MPI.
I/O works fine on PC, but final cause is running the program on Bluegene/P.
The Bluegene/P(XL Fortran Compiler) has big endianness. And when I need non-parallel I/O operations (like fortran REED & WRITE)
I am using
call SETRTEOPTS('ufmt_littleendian=8')
Unfortunately, when i need parallel I/O, for example MPI_FILE_READ, "SETRTEOPTS('ufmt_littleendian=8')" is ignored.
I am setting view with:
call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, MPI_FLOAT, 'native', MPI_INFO_NULL, ierr)
What should I do? I dont want to create my own DATAREP. Is there any other way? Speed is very important.

You need to use parallel-netcdf or HDF5. The learning curve for Parallel-HDF5 is a bit steep but you will get a self describing portable file format. It will help you down the road in ways you do not yet understand.
Performance overhead is negligible. Pnetcdf has a bit of an edge if you have lots of tiny datasets, but that's a rather pathological situation.
some applications do byteswapping, but as you mentioned fortran you will have to be very careful that the PC fortran compiler and the Blue Gene fortran compiler agree exactly on how much (if any) record padding to put in its fortran output. Bleah.

You can try adding "call setrteopts('ufmt_littleendian=8')" to your program instead of setting the environment variable.
Alternatively, you can instruct the Intel compiler to generate big endian data files. See the CONVERT= specifier in the OPEN statement, or the convert compiler option.
Conversion (either from big-endian to little-endian on the BG/P side, or from little-endian to big-endian on Intel side) has a runtime performance cost. So if you want the read side (BG/P) to be as fast as possible, creating big-endian data files on the write side (Intel) is best.

Related

Write fpos_t to disk

I have the following question: In parts of my software (mostly C++) I rely on reading precalculated binary data from a file to be used in a numerical simulation. This files can be quite big (like 32GB and upwards).
At the moment I read in the data with fread and store/navigate some important filepostions with fgetpos/fsetpos.
Mostly my software runs on HPC-Clusters and at the moment I'm trying to implement a restart feature in case I run out of Wallclocktime so that I can resume my calculations. To that end I dump a few key parameters in a binary file and would also need to store the position of my fileptr prior to abortion of my code.
So I checked around through the forum and I'm not quite sure whats the best solution to do this.
I guess I can't just write the whole fpos_t struct to disk as this can produce nonsense when I read it in again. ftell is limited to 2GB files if I'm correct?
Would ftello be a better option? Is this compatible with different compilers and OS(like intel, cray and so on)?
Thanks in advance

ARM7tdmi processor testing methodology

I'm currently writing an video game console emulator that is based on ARM7tdmi processor and I am almost in the stage that I wish to test if the processor is functioning correctly. I have only developed CPU and memory part of the entire console so only possible way to debug the processor is using logging (console) system. So far, I've only tested it simply by fetching dummy Opcodes and executing random instructions. Is there an actual ARM7 program (or other methodologies) that is specifically designed for this kind of purpose to make sure the processor is correctly functioning? Thanks in advance.
I used Dummy Opcodes such as,
ADD r0, r0, r1, LSL#2
MOV r1, #r0
But in 32 bit Opcode format.

I also wrote some tests and found some bugs in a GBA emulator. I have also written my own emulators (as well as work in the processor business testing processors and boards).
I have a few things that I do regularly. These are my general test methodologies.
There are a number of open source libraries out there, for example zlib and other compression libraries, jpeg, mp3, etc. It is not hard to bare metal these, fake an fopen, fread, fwrite with chunks of data and a pointer. the compression libs as well as encryption and hashes you can self test on the target processor. compress something, decompress it and compare the original with the uncompressed. I often will also run the code under test on a host, and compute the checksum of the compressed and decompressed versions, and give me a hardcoded check value which I then run on the target platform. For jpeg or mp3 or hash algorithms I use a host version of the code under test to produce a golden value that I then compare on the target platform.
Before doing any of that though the flags are very tricky to get right, the carry flag in particular (and signed overflow), some processors invert the carry out flag when it is a subtract operation (subtract is an add with the second operand ones complemented and the carry in ones complemented (normal add without carry is a carry in of zero, subtract without carry then is an add with second operand inverted and a carry in of 1)). And that inversion of the carry out affects the carry on if the instruction set has a subtract with borrow, whether or not carry is inverted on the way in or not.
It is sometimes obvious from the conditional branch definitions (if C is this and V is that, if C is this and Z is that) for unsigned and signed variations of less than, greater than, etc as to how that particular processor manages the carry (unsigned overflow) and the signed overflow flags without having to experiment on real silicon. I dont memorize what processor does what, I figure it out per instruction set, so I dont know what ARM does.
ARM has nuances with the shift operations that you have to be careful that were implemented properly, read the pseudo code under each instruction, if shift amount == 32 then do this if shift amount == 0 then do that, otherwise do this other thing. with the arm7 you could do unaligned accesses if the fault was disabled and it would rotate the data around within the 32 bits, or something like that. If the 32 bits at address 0 was 0x12345678, then a 16 bit read at address 1 would give you something like 0x78123456 on the bus and the destination would then get 0x3456. Hopefully most folks didnt rely on that. But that and other "UNPREDICTABLE RESULTS" comments in the ARM ARM, changed from ARM ARM to ARM ARM (If you have some of the different hardcopy manuals this will be more obvious, the old white covered one (the skinny one as well as the thick one) and the blue covered one). So depending on the manual you read (for those armv4 processors) you were sometimes allowed to do something and sometimes not allowed to do something. So you might find code/binaries that do things you think are unpredictable, if you only rely on one manual.
Different compilers generate differen code sequences so if you can find different arm compilers (clang/llvm and gcc being obvious first choices), get some eval copies of other compilers if you can (Kiel is probaby a good choice, now owned by arm I think it contains both Kiel and the RVCT arm compilers). Compile the same test code with different optimization settings, test every one of those variations, and repeat that for each compiler. If you only use one compiler for testing you will leave gaps in instruction sequences as well as a number of instructions or variations that will never be tested because the compiler never generates them. I hit this exact problem once. Using open source code you get different programmer habits too, whether it is asm or C or other languages different individuals have different programming habits and as a result generate different instruction sequences and mixes of instructions which can hide or expose processor bugs. If this is a single person hobby project you eventually will rely on others. The good thing here being a gba or ds or whatever emulator when you start using roms you will have a large volume of other peoples code, unfortunately debugging that is difficult.
I heard some hearsay ones that intel/x86 design validation folks use operating systems, various ones, to beat on their processors, it creates a lot of chaos and variation. Beats up on the processor but like the roms, extrememly difficult to debug if something goes wrong. I have personal experience with that with caches and such running linux on the processors I have worked on. Didnt find the bug until we had Linux ported and booting, and the bug was crazy hard to find...fortunately the arm7tdmi does not have a cache. If you have a cache then take those combinations I described above, test code multiplied by optimization level multiplied by different compilers, and then add to that in the bootstrap or other places compile a version with one, two, three, four, nops or other data such that the alignment of the binary changes relative to the cache lines causing the same program to exercise the cache differently.
In this case where there is real hardware you are trying to emulate you can do things like have a program that generates random alu machine code, generate dozens of instructions with randomly chosen source and destination registers, randomize add, subtract, and, or, not, etc. randomize the flags on and off, etc. pre-load all the registers, set the flags to a known state, run that chunk of code and then capture the registers and flags and see how it compares to real hardware. You can produce an infinite amount of separate tests, various lengths, etc. easier to debug this than to debug a code sequence that does some data or flag thing that is part of a more complicated program.
Take that combination of test programs, multplied by optimization setting, multiplied by compiler, etc. And beat on it with interrupts. Vary the rate of the interrupts. since this is a simulator you can do something I had hardawre for one time. In the interrupt, examine the return address, compute an address that is some number of instructions out ahead of that address, remember that address. Return from the interrupt, when you see that address being fetched fire a prefetch abort, have the prefetch abort code, stop watching that address when the prefetch abort fires (in the simulation) and have the code for the prefetch abort handler return to where the abort happend (per the arm arm) and let it continue. I was able to create a fair amount of pain on the processor under test with this setup...particularly with the caches on...which you dont have on an arm7tdmi.
Note that a high percentage of the gba games are thumb mode because on that platform, which used mostly 16 bit wide data busses, thumb mode ran (much) faster than arm mode even though thumb code takes about 10-15% more instructions. as well as taking less rom space for the binary. Carefully examine the blx instruction as I think there are different implementations based on architecture armv4 is different than armv6 or 7, so if you are using an armv6 or 7 manual as a reference or hardware for validating against, understand those differences.
blah, blah, blah TL; DR. sorry for rambling this is a fun topic for me...

Will Endianness be an issue for this type of binary IO operation?

For space efficiency, I have decided to code my save files using binary. Every byte represents an id for the type of tile. Would this cause an issue with different Endian computing?
Also, out of curiosity, is it the CPU or the operating system which sets the Endian type?
Additional information: I am using C++ and building an x-platform game. I do not want to use an additional API such as Boost.

Yes, it will cause issues - if the saved file from BE gets loaded on LE or vice versa. This is why some Unicode encodings such as UTF-16 and UTF-32 have a so-called byte-order marker.
If your code is usually compiled on BE you will still have to make sure that the LE code will swap the byte order before making use of the data.
The CPU sets the Endianess and some chips (e.g. some MIPS CPUs) allow that to be switched when bootstrapping the system.

We could use a little more info. Cross platform is one thing, but what platforms? If you mean cross platform like x86 Mac, x86 Linux and x86 Windows, then no you won't need to worry about it (although struct packing may still be an issue if you try to just fwrite structs to disk and compile with different compilers on different platforms). Even if you have a couple of different OS/CPU combos you can make a list of everything you want to support and if they all have the same endianess, don't worry about it.
If you have no expectation that save data will be moved from platform to platform, you also don't need to worry about it. Endianess is only an issue when you want create data on a big-endian machine and then read it on a little-endian machine or vice versa. If these are just local data files, no big deal, although it's probably safe to assume that if your users can copy their saves from one platform to another, they will, as they will do pretty much anything you don't want them to do and don't support.
Additionally, since you only mention bytes, if a byte array is as complex as your data is going to get, you actually don't need to worry about endianess. It's only an issue for multi-byte data types. So if you are just saving byte arrays, and the rest of your bookkeeping data also fits in bytes, there's nothing to worry about, but as soon as you save a short, int or float you'll have potential endian issues.
My personal opinion is whenever you serialize take endianess into account, but I have an extremely multplatform background (i.e. shipping the same product on 5 game systems). It's pretty easy, the swap macros are already there, and when you inevitably decide to move to another endianess you won't have to rewrite stuff. If the data is more complex or structured, maybe consider a library like Protocol Buffers or BSON.
Both the CPU and operating system may be responsible for endianess. Historically it was baked into the CPU, and while x86 is still hardwired as little-endian, most modern RISC derivatives can operate in either mode, making it the choice of the hardware and OS developers.

What makes a system little-endian or big-endian?

I'm confused with the byte order of a system/cpu/program.
So I must ask some questions to make my mind clear.
Question 1
If I only use type char in my C++ program:
void main()
{
char c = 'A';
char* s = "XYZ";
}
Then compile this program to a executable binary file called a.out.
Can a.out both run on little-endian and big-endian systems?
Question 2
If my Windows XP system is little-endian, can I install a big-endian Linux system in VMWare/VirtualBox?
What makes a system little-endian or big-endian?
Question 3
If I want to write a byte-order-independent C++ program, what do I need to take into account?

Can a.out both run on little-endian and big-endian system?
No, because pretty much any two CPUs that are so different as to have different endian-ness will not run the same instruction set. C++ isn't Java; you don't compile to something that gets compiled or interpreted. You compile to the assembly for a specific CPU. And endian-ness is part of the CPU.
But that's outside of endian issues. You can compile that program for different CPUs and those executables will work fine on their respective CPUs.
What makes a system little-endian or big-endian?
As far as C or C++ is concerned, the CPU. Different processing units in a computer can actually have different endians (the GPU could be big-endian while the CPU is little endian), but that's somewhat uncommon.
If I want to write a byte-order independent C++ program, what do I need to take into account?
As long as you play by the rules of C or C++, you don't have to care about endian issues.
Of course, you also won't be able to load files directly into POD structs. Or read a series of bytes, pretend it is a series of unsigned shorts, and then process it as a UTF-16-encoded string. All of those things step into the realm of implementation-defined behavior.
There's a difference between "undefined" and "implementation-defined" behavior. When the C and C++ spec say something is "undefined", it basically means all manner of brokenness can ensue. If you keep doing it, (and your program doesn't crash) you could get inconsistent results. When it says that something is defined by the implementation, you will get consistent results for that implementation.
If you compile for x86 in VC2010, what happens when you pretend a byte array is an unsigned short array (ie: unsigned char *byteArray = ...; unsigned short *usArray = (unsigned short*)byteArray) is defined by the implementation. When compiling for big-endian CPUs, you'll get a different answer than when compiling for little-endian CPUs.
In general, endian issues are things you can localize to input/output systems. Networking, file reading, etc. They should be taken care of in the extremities of your codebase.

Question 1:
Can a.out both run on little-endian and big-endian system?
No. Because a.out is already compiled for whatever architecture it is targeting. It will not run on another architecture that it is incompatible with.
However, the source code for that simple program has nothing that could possibly break on different endian machines.
So yes it (the source) will work properly. (well... aside from void main(), which you should be using int main() instead)
Question 2:
If my WindowsXP system is little-endian, can I install a big-endian
Linux system in VMWare/VirtualBox?
Endian-ness is determined by the hardware, not the OS. So whatever (native) VM you install on it, will be the same endian as the host. (since x86 is all little-endian)
What makes a system little-endian or big-endian?
Here's an example of something that will behave differently on little vs. big-endian:
uint64_t a = 0x0123456789abcdefull;
uint32_t b = *(uint32_t*)&a;
printf("b is %x",b)
*Note that this violates strict-aliasing, and is only for demonstration purposes.
Little Endian : b is 89abcdef
Big Endian : b is 1234567
On little-endian, the lower bits of a are stored at the lowest address. So when you access a as a 32-bit integer, you will read the lower 32 bits of it. On big-endian, you will read the upper 32 bits.
Question 3:
If I want to write a byte-order independent C++ program, what do I
need to take into account?
Just follow the standard C++ rules and don't do anything ugly like the example I've shown above. Avoid undefined behavior, avoid type-punning...

Little-endian / big-endian is a property of hardware. In general, binary code compiled for one hardware cannot run on another hardware, except in a virtualization environments that interpret machine code, and emulate the target hardware for it. There are bi-endian CPUs (e.g. ARM, IA-64) that feature a switch to change endianness.
As far as byte-order-independent programming goes, the only case when you really need to do it is to deal with networking. There are functions such as ntohl and htonl to help you converting your hardware's byte order to network's byte order.

The first thing to clarify is that endianness is a hardware attribute, not a software/OS attribute, so WinXP and Linux are not big-endian or little endian, but rather the hardware on which they run is either big-endian or little endian.
Endianness is a description of the order in which the bytes are stored in a data-type. A system that is big-endian stores the most significant (read biggest value) value first and a little-endian system stores the least significant byte first. It is not mandatory to have each datatype be the same as the others on a system so you can have mixed-endian systems.
A program that is little endian would not run on a big-endian system, but that has more to with the instruction set available than the endianness of the system on which it was compiled.
If you want to write a byte-order independent program you simply need to not depend on the byte order of your data.

1: The output of the compiler will depend on the options you give it and if you use a cross-compiler. By default, it should run on the operating system you are compiling it on and not others (perhaps not even others of the same type; not all Linux binaries run on all Linux installs, for example). In large projects, this will be the least of your concern, as libraries, etc, will need built and linked differently on each system. Using a proper build system (like make) will take care of most of this without you needing to worry.
2: Virtual machines abstract the hardware in such a way as to allow essentially anything to run within anything else. How the operating systems manage their memory is unimportant as long as they both run on the same hardware and support whatever virtualization model is in use. Endianness means the byte-order; if it is read left-right or right-left (or some other format). Some hardware supports both and virtualization allows both to coexist in that case (although I am not aware of how this would be useful except that it is possible in theory). However, Linux works on many different architectures (and Windows some other than Ixxx), so the situation is more complicated.
3: If you monkey with raw memory, such as with binary operators, you might put yourself in a position of depending on endianness. However, most modern programming is at a higher level than this. As such, you are likely to notice if you get into something which may impose endianness-based limitations. If such is ever required, you can always implement options for both endiannesses using the preprocessor.

The endianness of a system determine how the bytes are interpreted, so what bit is considered the "first" and what is considered the "last".
You need to care about it only when loading or saving from some sources external to your program, like disk or networks.

Creating a native application for X86?

Is there a way I could make a C or C++ program that would run without an operating system and that would draw something like a red pixel to the top left corner? I have always wondered how these types of applications are made. Since Windows is written in C I imagine there is a way to do this.
Thanks

If you're writing for a bare processor, with no library support at all, you'll have to get all the hardware manuals, figure out how to access your video memory, and perform whatever operations that hardware requires to get a pixel drawn onto the display (or a sound on the beeper, or a block of memory read from the disk, or whatever).
When you're using an operating system, you'll rely on device drivers to know all this for you. Programs are still written, every day, for platforms without operating systems, but rarely for a bare processor. Many small MPUs come with a support library, usually a set of routines that lets you manipulate whatever peripheral devices they support.

It can certainly be done. You typically write the code in C, and you pretty much have to do everything on your own, with no standard library. To set your pixel, you'd usually load a pointer to the physical address of the screen, and write the correct value to that pointer. Alternatively, on a PC you could consider using the VESA BIOS. In all honesty, it's fairly similar to the way most code for MS-DOS was written (most used MS-DOS to read and write data on disk, but little else).

The core bootloader and the part of the Kernel that bootstraps the OS are written in assembly. See http://en.wikipedia.org/wiki/Booting for a brief writeup of how an operating system boots. There's no way I'm aware of to write a bootloader or Kernel purely in a higher level language such as C or C++ without using assembly.

You need to write a bootstrapper and a loader combination followed by a payload which involves setting the VGA mode manually by interrupt, grabbing a handle to the basic video buffer and then writing a value to the 0th byte.
Start here: http://en.wikipedia.org/wiki/Bootstrapping_(computing)

Without an OS it's difficult to have a loader, which means no dynamic libc. You'd have to link statically, as well as have a decent amount of bootstrap code written in assembly (although it could be provided as object files which you could then link with). Also, since you'd be at the mercy of whatever the system has, you'd be stuck with the VESA video modes (unless you want to write your own graphics driver and subsystem, which you don't).

There is, but not generally from within the OS. Initially, they are an asm stub that's executed from the MBR on the drive. See MBR. For x86 processors, this is generally 16-bit processing code, this generally jumps into the operating system code from here, and upgrades to 32-bit/64-bit mode depending on the operating system and chipset.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js