Intel Fortran Compiler with large file size

Intel Fortran Compiler with large file size - fortran

I have a large piece of legacy code that is written in Fortran 77. I am compiling it and running it with the Intel Fortran Compiler (Version 11?). I recently ran into an issue where the output file reached just shy of 2GB in size, and the output stopped getting written to disk.
I have hunted around looking to see if this is part of the Fortran 77 standard, or if I am simply missing a compiler flag or something, but haven't found anything that points towards my problem.
Changing the write statements isn't an option, since the legacy code is on the order of several hundred thousand lines. Worst case scenario is that every few days I go in and truncate the earlier portions of the output to a different file, but I would prefer that I not have to do this.

The most probable reason for this kind of behaviour is the memory model in use. In 64-bit mode there are three memory models, distinguished by the addressing mode used:
small model - RIP-related addressing is used for everything, from calling functions to accessing data. RIP is the 64-bit instruction pointer register of x64 (the 64-bit extension of EIP), but the relative address can only be a signed 32-bit number (and there are some restrictions that prevent the use of the full signed integer range), hence the combined code + static data size is limited to about 2 GiB.
medium model - program code is limited to 2 GiB, hence RIP-related function calls, but data symbols are split in two types. Small data symbols are those that fit together with the code in the first 2 GiB and they are addressed using the same RIP-relative method as in the small model. Large data symbols are accessed using register addressing with the absolute address of the symbol loaded into a register, which is slower but there are no limits on the addressable memory.
large model - all symbols are accessed using absolute addressing. There are no restrictions on the code or data size.
Most compilers that can target x64, ifort included, accept the --mcmodel=model option that allows one to control the memory model used. The default model is small. The size of your object file means that there is an enormous quantity of initialised static data, possibly some very large initialised arrays (think DATA or BLOCK DATA statements) or many smaller arrays (I doubt that even 1 million code statements would generate 2 GiB of instructions code). Compiling with --mcmodel=medium or --mcmodel=large should solve the problem with the large object file size.
Note that linking together object codes that use different memory models is a recipe for disaster - the whole application should be compiled with the same memory model.

Related

Why does the size of the binary executable increase by different amounts for the same block of code?

Let's say I have a C++ program and I compile it using g++. I then get an executable file which, as an example, has a size of 100 kb. I then add a couple of lines of C++ code and compile again and the size of the executable has increased to 101 kb. Then I add the exact same block of C++ code and compile a third time. This time the executable has increased to 106 kb. Why does it happen that the same code sometimes increases the size of the executable by one amount, and another time something much greater?
Also the big increase only happens every couple of times, most of the time it increases by the same, small, amount.

There are a variety of reasons why the size change of the resulting binary is not linear with the code size change. This is particularly true if some kind of optimization is enabled.
Even in debug mode (no optimizations), the following things could cause this to happen:
The code size in the binary typically needs to be aligned to a certain size (dependent on the hardware). The size can only grow in multiplies of the alignment.
The same applies for metadata tables (relocation tables, debug information)
The compiler reserving extra space for debug information, based just on the number of methods/variables in use
With some compilers (not sure about gcc), code in a binary can be updated in-place when only minor changes where done, instead of performing a full link on each build. This would result in different binary sizes when adding code and building vs. deleting the binary before each build.
If optimizations are enabled, it gets even more confusing, due to possible optimization strategies:
The compiler may remove code he finds is unreachable
If optimizing for speed, loop unrolling is a good thing to do, but only up to a certain degree. If adding more code inside the loop, the compiler might decide that the extra code size is no longer worth the speed gain.
Also other optimizations work only up to a certain level, after which they do more harm than good. This could even result in the binary file getting smaller by adding code.
These are just a bunch of possible reasons, there might be many more.

how to declare 10000 x 100000 sized integer matrix in C++

I need to create a matrix which size is 10000x100000. My RAM is 4GB. It works till the 25th iteration (debug), but after 25th iteration I get an error "bad allocation" however only 25% of RAM is used which means the problem is not related with memory. So what can I do?
EDIT:
int **arr;
arr=new int*[10000];
for(i=0;i<10000;i++)
arr[i]=new int[100000];
My allocation is above.

If you're compiling for x64, you shouldn't have any problems.
If you're compiling for x86 (most likely), you can enable the /LARGEADDRESSAWARE linker flag if you're using Visual C++, or something similar for other compilers. For Visual C++, the option can also be found in the Linker -> System -> Enable Large Addresses property in the IDE.
This sets a flag in the resulting EXE file telling the OS that the code can handle addresses over 2 GB. When running such an executable on x64 Windows (your case), the OS gives it 4 GB of address space to play with, as opposed to just 2 GB normally.
I tested your code on my system, Windows 7 x64, 8 GB, compiled with Visual C++ Express 2013 (x86, of course) with the linker flag, and the code ran fine - allocated almost 4 GB with no error.
Anyway, the 25th iteration is far too quick for it to fail, regardless of where it runs and how it's compiled (it's roughly 10 MB), so there's something else going wrong in there.
By the way, the HEAP linker option doesn't help in this case, as it doesn't increase the maximum heap size, it just specifies how much address space to reserve initially and in what chunks to increase the amount of committed RAM. In short, it's mostly for optimization purposes.

A possible solution would be to use your hard drive.
just open a file and store the data you need.
then just copy the data you need to a buffer.
Even if you will be successful with allocating this amount of data on the heap
you will overload the heap with data you are most likely wont be using most of the time.
Eventually you might run out of space and that will lead to either decreased performance or unexpected behaviors.
If you are worried about hindered performance by using the hard drive maybe to your problem thinking about a procedural solution would be fitting. If you could produce the data you need at any given moment instead of storing it you could solve your problem as well.

If you are using VS, you'll probably want to try out the HEAP linker option and make sure, you compile for a x64 bit target, because otherwise you'll run out of address space. The size of you physical memory should not be a limiting factor, as windows can use the pagefile to provide additional memory.
However, from a performance point of view it is probably a horrible Idea to just allocate a matrix of this size. Maybe you should consider using a spares matrix, or (as suggested by LifePhilPsyPro) generate the data on demand.

For allocating extremely large buffers you are best off using the operating system services for mapping pages to the address space rather than new/malloc.
You are trying to allocate more than 4GB.

allocate/reserve avr sram with known address at load time to fit into progspace

i'm trying to reserve some bytes of sram where the address MUST be know at load time so that it can be fitted into PROGSPACE. Untill now i tested my code ok with a tricky allocation on the arduino nano board by setting the address to (0x1F6) and latter on the program i do
volatile byte shifty_data[3];
to ensure its not overwriten in the heap...
the code is working ok, but i'm not happy with this because its not compatible with other cores and possibly with environment changes.
Until now i've considered changes in malloc __heap_start (with no success because its not constant and address is not known at load time, i think), i've also looked at avr/io.h and specifically at iom328p.h thru RAMSTART defines, this might work but... its seems too low on the systemn since i want to use hardware SPI over it... and it might be a better way of doing this in an higher level preferably within the arduino files.
any ideas?

I am not quite sure what you are asking here, but there are two crucial things that I believe you are misunderstanding. Program space is separate from SRAM. they are in two different address buses. AVR actually provides an instruction to copy data from program space into RAM because of this separation [making it a modified Harvard architecture]. Then, there is also the thing that any globally declared variable will either reside in the .bss section or .data sections of SRAM [this is actually part of the C standard]. The __do_copy_data and __do_clear_bss sections of the final executable take care of that [they are automatically added into the .init4 section]. You can override this mechanism using compiler flags, however, the address of each global variable is known from the time the program starts executing [something that happens out of flash memory, not SRAM].
Now, placing things into SRAM, I suggest you take a look at this page on the avr-libc manual. It deals with memory sections in general and how to tweak them. Cheers.

Accessing >2,3,4GB Files in 32-bit Process on 64-bit (or 32-bit) Windows

Disclaimer: I apologize for the verbosity of this question (I think it's an interesting problem, though!), yet I cannot figure out how to more concisely word it.
I have done hours of research as to the apparently myriad of ways in which to solve the problem of accessing multi-GB files in a 32-bit process on 64-bit Windows 7, ranging from /LARGEADDRESSAWARE to VirtualAllocEx AWE. I am somewhat comfortable in writing a multi-view memory-mapped system in Windows (CreateFileMapping, MapViewOfFile, etc.), yet can't quite escape the feeling that there is a more elegant solution to this problem. Also, I'm quite aware of Boost's interprocess and iostream templates, although they appear to be rather lightweight, requiring a similar amount of effort to writing a system utilizing only Windows API calls (not to mention the fact that I already have a memory-mapped architecture semi-implemented using Windows API calls).
I'm attempting to process large datasets. The program depends on pre-compiled 32-bit libraries, which is why, for the moment, the program itself is also running in a 32-bit process, even though the system is 64-bit, with a 64-bit OS. I know there are ways in which I could add wrapper libraries around this, yet, seeing as it's part of a larger codebase, it would indeed be a bit of an undertaking. I set the binary headers to allow for /LARGEADDRESSAWARE (at the expense of decreasing my kernel space?), such that I get up to around 2-3 GB of addressable memory per process, give or take (depending on heap fragmentation, etc.).
Here's the issue: the datasets are 4+GB, and have DSP algorithms run upon them that require essentially random access across the file. A pointer to the object generated from the file is handled in C#, yet the file itself is loaded into memory (with this partial memory-mapped system) in C++ (it's P/Invoked). Thus, I believe the solution is unfortunately not as simple as simply adjusting the windowing to access the portion of the file I need to access, as essentially I want to still have the entire file abstracted into a single pointer, from which I can call methods to access data almost anywhere in the file.
Apparently, most memory mapped architectures rely upon splitting the singular process into multiple processes.. so, for example, I'd access a 6 GB file with 3x processes, each holding a 2 GB window to the file. I would then need to add a significant amount of logic to pull and recombine data from across these different windows/processes. VirtualAllocEx apparently provides a method of increasing the virtual address space, but I'm still not entirely sure if this is the best way of going about it.
But, let's say I want this program to function just as "easily" as a singular 64-bit proccess on a 64-bit system. Assume that I don't care about thrashing, I just want to be able to manipulate a large file on the system, even if only, say, 500 MB were loaded into physical RAM at any one time. Is there any way to obtain this functionality without having to write a somewhat ridiculous, manual memory system by hand? Or, is there some better way than what I have found through thusfar combing SO and the internet?
This lends itself to a secondary question: is there a way of limiting how much physical RAM would be used by this process? For example, what if I wanted to limit the process to only having 500 MB loaded into physical RAM at any one time (whilst keeping the multi-GB file paged on disk)?
I'm sorry for the long question, but I feel as though it's a decent summary of what appear to be many questions (with only partial answers) that I've found on SO and the net at large. I'm hoping that this can be an area wherein a definitive answer (or at least some pros/cons) can be fleshed out, and we can all learn something valuable in the process!

You could write an accessor class which you give it a base address and a length. It returns data or throws exception (or however else you want to inform of error conditions) if error conditions arise (out of bounds, etc).
Then, any time you need to read from the file, the accessor object can use SetFilePointerEx() before calling ReadFile(). You can then pass the accessor class to the constructor of whatever objects you create when you read the file. The objects then use the accessor class to read the data from the file. Then it returns the data to the object's constructor which parses it into object data.
If, later down the line, you're able to compile to 64-bit, you can just change (or extend) the accessor class to read from memory instead.
As for limiting the amount of RAM used by the process.. that's mostly a matter of making sure that
A) you don't have memory leaks (especially obscene ones) and
B) destroying objects you don't need at the very moment. Even if you will need it later down the line but the data won't change... just destroy the object. Then recreate it later when you do need it, allowing it to re-read the data from the file.

Memory Layout of application

The following question is a head-scratcher for me. Assuming that I have two platforms
with an identical hardware, the same OS and the same compiler on it. If I compile exactly the
same application, can I be sure that the memory layout on both machines will exactly be the same? In other words, both applications have exaclty the same virtual address space or is
there a high chance that this is not the case.
Thanks for your thoughts about this!

You can't count on it. As a security feature, some OS's (including Windows) randomize memory layout to some extent.
(Here's a supporting link: http://blogs.msdn.com/b/winsdk/archive/2009/11/30/how-to-disable-address-space-layout-randomization-aslr.aspx)

It is highly improbable that an application will be executed in the same address space on the same platform, nonetheless on another computer. Other applications may be running which will affect where the OS loads your application.
Another point to consider is that some applications load run-time libraries (a.k.a. DLLs & shared libraries) on demand. An application may have a few DLLs loaded or not when your application is running.
In non-embedded platforms, the majority of applications don't care about exact physical memory locations, nor is it a concern that they are loaded in the same location each time. Most embedded platforms load their applications in the same place each time, as they don't have enough memory to move it around.
Because of these cases and the situations other people have mentioned, DO NOT CODE CONSTANT MEMORY LOCATION principles into your program. Very bad things will happen, especially difficult to trace and debug.

Apart from dynamic question such as stack addresses as Steven points out, there is also the aspect of compile time and static layout.
Already I think that two machines are exact clones of each other is a very particular situation, since you may have tiny differences in CPU version, libraries etc. Then some compilers (perhaps depending on some options) also put the compile time and date in the executable. If e.g your two hostnames have different lengths or this uses a date format that varies in length, not only these strings will be different, but all other static variables might be slightly shifted in address space.
I remember that gcc had difficulties on some architectures with its automatic build since the compiler that was produced in stage 2 was differed of the one build in stage 3 for such stupid reasons.

The __TIME__ macro expands to (the start of) the compilation time. Furthermore, it's determined
independently for each and every .cpp file that you compile, and the linker can eliminate duplicate strings.
As a result, depending on the compile speed, your executables may end up not just with different __TIME__ strings, but even a different number of __TIME__ strings.
If you're working late, you could see the same with __DATE__ strings ;)

Is it possible for them to have the same memory layout? Yes, it is a possibility. Is it probable? Not really.
As others have pointed out, things like address space randomization and __TIME__ macros can cause the address space to differ (whether the changes were made at compile time or run time). In my experiences, many compilers don't produce the same identical output when run twice on the same machine using the exact same input (functions are laid out in memory in different orders, etc).
Is this a rhetorical/intellectual question, or is this causing you to run into some kind of problem with a program you are writing?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js