I'm using a proprietary library to import data, it uses the GPU (OpenGL) to transform the data.
Running my program through Valgrind (memcheck) causes the data import to take 8-12 hours (instead of a fraction of a second). I need to do my Valgrind sessions overnight (and leave my screen unlocked all night, since the GPU stuff pauses while the screen is locked). This is causing a lot of frustration.
I'm not sure if this is related, but Valgrind shows thousands of out-of-bound read/write errors in the driver for my graphics card:
==10593== Invalid write of size 4
==10593== at 0x9789746: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_dri.so)
(I know how to suppress those warnings).
I have been unable to find any ways of selectively instrumenting code, or excluding certain shared libraries from being instrumented. I remember using a tool on Windows 20 years or so ago that would skip instrumenting selected binaries. It seems this is not possible with memcheck:
Is it possible to make valgrind ignore certain libraries? -- 2010, says this is not possible.
Can I make valgrind ignore glibc libraries? -- 2010, solutions are to disable warnings.
Restricting Valgrind to a specific function -- 2011, says it's not possible.
using valgrind at specific point while running program -- 2013, no answers.
Valgrind: disable conditional jump (or whole library) check -- 2013, solutions are to disable warnings.
...unless things have changed in the last 6 or 7 years.
My question is: Is there anything at all that can be done to speed up the memory check? Or to not check memory accesses in certain parts of the program?
Right now the only solution I see is to modify the program to read data directly from disk, but I'd rather test the actual program I'm planning to deploy. :)
No, this is not possible. When you run an application under Valgrind it is not running natively under the OS but rather in a virtual environment.
Some of the tools like Callgrind have options to control the instrumentation. However, even with the instrumentation off the application under test is still running under the Valgrind virtual environment.
There are a few things you can do to make things less slow
Test an optimized build of your application. You will lost line number information as a result, however.
Turn of leak detection
Avoid costly options like trace-origins
The sanitizers are faster and can also detect stack overflows, but at the cost of requiring instrumentation.
I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).
Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.
I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !
I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).
You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.
It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.
I have a problem with Qt Creator, or one of its components.
I have a program which needs lots of memory (about 4 GBytes) and I use calloc to allocate it. If I compile the C code with mingw/gcc (without using the Qt-framework) it works, but if I compile it within the Qt Creator (with the C code embedded in the Qt framework using C++), using the mingw/gcc toolchain, calloc returns a null-pointer.
I already searched and found the qt-pro option QMAKE_LFLAGS += -Wl,--large-address-aware, which worked for some cases (around 3.5GBytes), but if I go above 4GBytes, it only works with the C code compiled with gcc, not with Qt.
How can I allocate the needed amount of memory using calloc when compiling with Qt Creator?
So your cigwin tool chain builds 64-bit applications for your. Possible size of memory, that can be allocated by 64-bit application is 264 bytes that far exceeds 4Gb. But Qt Creator (if you installed it from QtSDK and not reconfigured it manually) uses Qt's tool chain, that builds 32 bit applications. You theoretically can allocate 4Gb of memory by 32 bit application, but do not forget, that all libraries will be also loaded into this memory. In practice, you are possible to allocate about 3 Gb of memory and not in one continuous chunk.
You have 3 ways to solve your problem:
reconsider your algorithm. Do not allocate 4Gb of RAM, use smarter data structures, or use disk cache etc. I believe if your problem would actually require more then 4 GB of memory to solve, you wouldn't ask this question.
separate your Qt code from your C program. Then, you can still use 64-bit-target-compiler for C program and 32-bit-target-compiler for Qt/C++ part. You can communicate with your C program through any interprocess communication mechanism. (Actually standard input/output streams are often enough)
Move to 64 bit. I mean, use 64-bit-target-compiler for both C and C++ code. But it is not so simple, as one could think. You'll need to rebuild Qt in 64 bit mode. It is possible with some modules turned off and some code fixups (I've tried once), but Windows 64 bit officially not supported.
So I take my C++ program in Visual studio, compile, and it'll spit out a nice little EXE file. But EXEs will only run on windows, and I hear a lot about how C/C++ compiles into assembly language, which is runs directly on a processor. The EXE runs with the help of windows, or I could have a program that makes an executable that runs on a mac. But aren't I compiling C++ code into assembly language, which is processor specific?
My Insights:
I'm guessing I'm probably not. I know there's an Intel C++ compiler, so would it make processor-specific assembly code? EXEs run on windows, so they advantage of tons of things already set up, from graphics packages to the massive .NET framework. A processor-specific executable would be literally starting from scratch, with just the instruction set of the processor.
Would this executable be a file-type? We could be running windows and open it, but then would control switch to processor only? I assume this executable would be something like an operating system, in that it would have to be run before anything else was booted up, and have only the processor instruction set to "use".
Let's think about what "run" means...
Something has to load the binary codes into memory. That's an OS feature. The .EXE or binary executable file or bundle or whatever, is formatted in a very OS-specific way so that the OS can load it into memory.
Something has to turn control over to those binary codes. There's the OS, again.
The I/O routines (in C++, but this is true in most places) are just a library that encapsulate OS API's. Drat that OS, it's everywhere.
Reminiscing.
In the olden days (yes, I'm this old) I worked on machines that didn't have OS's. We also didn't have C.
We wrote machine codes using tools like "assemblers" and "linkers" to create big binary images that we could load into the machine. We had to load these binary images through a painful bootstrap process.
We'd use front panel keys to load enough code into memory to read a handy device like a punched paper-tape reader. This would load a small piece of fairly standard boot linking loader software. (We used mylar tape so it wouldn't wear out.)
Then, when we had this linking loader in memory, we could feed the tape we'd prepared earlier with the assembler.
We wrote our own device drivers. Or we used library routines that were in source form, punched on paper tapes.
A "patch" was actually patched pieces of paper tape. Plus, since there were also little bugs, we'd have to adjust the memory image based on hand-written instructions -- patches that hadn't been put into the tape.
Later, we had simple OS's that had simple API's, simple device drivers, and a few utilities like a "file system", an "editor" and a "compiler". It was for a language called Jovial, but we also used Fortran sometimes.
We had to solder serial interface boards so we could plug in a device. We had to write device drivers.
Bottom Line.
You can easily write C++ programs that don't require an OS.
Learn about the hardware BIOS (or BIOS-like) facilities that are part of your processor's chipset. Most modern hardware has a simple OS wired into ROM that does power-on self-test (POST), loads a few simple drivers, and locates boot blocks.
Learn how to write your own boot block. That is the first proper "software" thing that's loaded after POST. This isn't all that hard. You can use various partitioning tools to force your boot block program onto a disk and you'll have complete control over the hardware. No OS.
Learn how GRUB, LILO or BootCamp launch an OS. It's not complicated. Once they're booted, they can load your program and you're off and running. This is slightly simpler because you create the kind of partition that a boot loader wants to load. Base yours on the Linux kernel and you'll be happier. Don't try to figure out how Windows boots -- it's too complicated.
Read up on ELF. http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
Learn how device drivers are written. If you don't use an OS, you'll need to write device drivers.
The problem is that the OS really does a lot to start your programs. The EXE file itself has header information on it that Windows recognizes, identifying itself as an EXE file. Your app does everything, from filesystem access to memory allocations, through the OS.
But yes, you CAN run apps compiled for Windows/intel on other platforms without emulation. If you want to run your EXE on a Mac or UNIX, you will need to install a bit more software to do the work that Windows would do to run your program -- take a look at the "Wine" project.
What you're talking about is what's known in the embedded world as a "bare-metal" application. They're very common for things like a ARM Cortex-M3 that goes in (say) a debit-card validator box or an interactive toy, and doesn't have enough memory or capability to run a full operating system. So, instead of getting an "ARM/Linux" compiler that would compile an application to run on Linux on an ARM processor, you get an "ARM bare-metal" compiler that compiles things to run on an ARM processor without an operating system. (I'm using ARM rather than x86 as an example, because x86 bare-metal applications are really quite rare these days.)
As stated in your question and the other answers, your application will need to do some things that would otherwise be taken care of by the operating system.
First, it needs to initialize the memory system, the interrupt vectors, and various other bits of board goo. Typically this is something that a bare-metal compiler will do for you, though if you have a weird board, you may need to tell it how to do that. This gets things from the point where the board turns on to the point where your main() function starts.
Then, you need to interact with things outside the CPU and RAM. An operating system includes all sorts of functions for doing this -- disk I/O, screen output, keyboard and mouse input, networking, etc., so forth, and so on. Without an operating system, you have to get that from somewhere else. You may get some of that from libraries from your hardware manufacturer; for instance, a board I was recently playing with has a 40x200-pixel LED screen, and it came with a library with the code to turn that on and set individual pixel values on it. And there are several companies selling libraries to implement a TCP/IP stack and things like that, for doing networking or whatnot.
Consider, for example, that this makes it difficult to do even a basic printf. When you have an operating system, printf just sends a message to the operating system that says "put this string on the console", and the operating system finds the current cursor position on the console, and does all the stuff to figure out what pixels to change on the screen, and what CPU instructions to use to change those pixels, in order to do that.
Oh, and did we mention that you first have to figure out how to get the program into the CPU? A typical computer has a bit of programmable ROM that it will load instructions from when it starts up. On an x86, this is the BIOS, and it usually already contains a handy program that gets the CPU started, sets up the display, looks for disks, and loads a program off the disk that it finds. On an embedded system, that's typically where your program goes -- which means you need some way to put your program there. Often, that means you have a device called a "debugger" that's physically attached to your embedded board that loads the program -- and can also do things that allow you to pause the processor and determine what its state is, so that you can step through your program just as if you were running it in a software debugger on your computer. But I digress.
Anyway, to answer your second question, this executable that you'd create is something that gets stored in that ROM on your embedded board -- or perhaps you'd just store a bit of it in ROM (which is, after all, pretty small) and store the rest on a flash drive, and the bit in ROM would include the instructions to get the rest of it off the flash drive. It would probably be stored as a file on your main computer (that is, the Linux or Windows computer where you're creating it), but that's just for storage, it wouldn't run there.
You'll notice that when you've got a lot of these libraries together, they're doing a fair bit of what an operating system does, and there's sort of this space between the pile of libraries and a real operating system. In that space goes what's called an RTOS -- "real-time operating system". The smaller ones of these are really just collections of libraries that work together to do all the operating-systemy things, and sometimes also include stuff so you can run multiple threads at once (and then you can have different threads act like different programs) -- though all of this is all compiled into the same compiled "program", and the RTOS is really nothing more than a library you've included. Larger ones start storing parts of the code in separate places, and I think some of them can even load pieces of code off of disks -- just like Windows and Linux do when running a program. It's sort of a continuum, rather than an either/or.
The FreeRTOS system is an open-source RTOS that's towards the smaller end of the RTOS space; they might be a good place to look at some of this if you're more interested. They do have some examples of x86 applications, which would give you an idea of what sort of x86 systems would run a bare-metal or RTOS-based program and how you'd compile something to run on one; link here: http://www.freertos.org/a00090.html#186.
The computer is not the CPU. To do anything useful, the CPU has to be connected to memory and IO controllers and other devices. An OS takes care of abstracting all of that from running programs. So, if you want to write a program that runs without an OS, your program will have to replicate at least some features of an OS: Taking over from the BIOS during the boot process, initializing devices, communicating with the disk controller to load code and data, communicating with the display controller to show information to the user, communicating with the keyboard controller and the mouse controller to read user input etc etc etc.
Unless you are building an embedded system with specialized hardware, there is no point in doing this. Besides, running your program would mean the user would have to give up running other programs. While this may be acceptable for an ATM today or WordStar in 1984, these days people frown on not being able to check email while listening to music.
Sure, they exist. They are called cross compilers. For example, that's how I can program for the iPhone platform using Xcode.
A related type of compiler is one that compiles for a virtual platform. That's how Java works.
Any given compiler/toolset produces code for a particular processor/OS combination. So your Visual Studio compile example produces code for x86/Windows. That .EXE will only run on x86/Windows and not on (for example) ARM/Windows (as used by some cellphones).
To produce code for a processor/OS combination other than what you're running the compiler on requires what is generally referred to as a cross-compiler. If you have a full professional Visual Studio subscription, you can get the ARM cross compiler, which will allow you to produce ARM/Windows .EXE files which won't run on your desktop machine, but WILL run on an ARM/Windows based cellphone or palmtop.
Yes, you can make an executable that runs on the 'bare metal' of a processor. Obviously that's how operating system kernels work. The main thing you need to do is create an executable that uses no libraries whatsoever. However, the "no libraries" restriction includes the C standard library! So that means no malloc, no printf, etc. You have to basically be your own OS and manage memory and I/O yourself. This will inevitably require a fair bit of work directly in assembly at some stage.
You also lose several other luxuries, such as main(), which can't be the starting point of your program since main() is something that is invoked by the OS and the C runtime environment.
Absolutely! That is what embedded programming is. As many have probably said already the operating system does quite a bit for you. And even in the embedded world without an operating system a number of the development tools will provide the startup code to get the processor running enough to jump to your program. Some/many provide full blow C/C++ libraries so that you can call functions like memcpy() and sometimes even malloc() and printf().
You are welcome to provide every line of code and every instruction and not use a development tool package but still use a compiler like gcc for example. Some of the binary formats are common to those run on operating systems like elf for example. You can execute elf files on Linux but also have your embedded program result in an elf binary. The processor cannot execute elf in that format but whatever programs the boot prom or ram in some cases will extract the binary program from the elf file, not unlike an operating system extracting the program to run from an elf file. EXE is not one of those file formats. Your favorite windows application compiler is probably not an embedded compiler either although you can sometimes use one to do the high level language stuff and then use an alternative assembler and linker. More work than it is worth usually. For example you write a function in C (that does NOT make any library or system calls), compile that to an object. Write your own or find a utility to extract the compiled binary from that object, convert it to another object format or to assembler (disassemble). Add your startup code and other assembly to it. Assemble and link everything together as an embedded program. I did it once with Microsofts embedded visual C just to see how it measured up to other compilers, it wasnt horrible but certainly was not worth the effort of hacking to get at the output.
Every processor from the one in your computer to the one in your cell phone or microwave has too have some boot up code. That code is not running on an operating system. That code uses the same or similar compilers than operating system applications use. For some devices that code puts the processor and memory and on and off chip peripherals in a state where the operating system can be started. From there the operating system takes over. On your computer this would be the BIOS followed by the bootloader, then eventually the operating system, dos, windows, linux, etc.
The main problem is the file format. PE is very different to ELF(Used in unix-like systems). A valid PE program cannot be a valid ELF. So, you either load the binary dynamically with different starters or you have to give up.
Other than that, with knowledge of OS services, the value of registers at startup, etc. your code can probably detect easily and reliably which OS you are running under and act accordingly(Some malware does just that). Another challenge is then reusing code instead of having two or more different programs in the same binary. Basically you would have to write an emulator, at least for the services that you need.
Don't also forget about the Windows libraries. Look into QT and GTK+
I'm currently working on a legacy app (win32, Visual C++ 2005) that allocates memory using LocalAlloc (in a supplied library I can't change). The app keeps very large state in fixed memory (created at the start with multiple calls to LocalAlloc( LPTR, size)). I notice that in release mode I run out of memory at about 1.8gb but in debug it happily goes on to over 3.8gb. I'm running XP64 with the /3gb switch. I need to increase the memory used in the app and I'm hitting the memory limit in release (debug works ok). Any ideas?
You probably have the Debug configuration linking with /LARGEADDRESSAWARE and the Release configuration linking with /LARGEADDRESSAWARE:NO (or missing altogether).
Check Linker->System->Enable Large Addresses in the project's configuration properties.
Sounds like your Release build is also compiled as x86. If not, than there must be something in your code which treats pointer as signed 32-bit integers and this code is only active in Release.
How does the running out of memory manifests itself?
Also, there is no reason to use /3gb flag for XP64 when running 64-bit applications: it doesn't change anything in this scenario
One suggestion: have a look at the base addresses of the DLLs that get loaded into the process space in release and debug mode, and see if there is much difference. It's possible that, in the release case, there are DLLs loaded at addresses so that, while there's enough free space in total to support a LocalAlloc() call, there isn't enough continuous address space to satisfy it. (For a contrived example, suppose that there was a DLL loaded at 0x40000000 (1Gb), another at 0x80000000 (2Gb), and another at 0xC0000000 (3Gb). Even if these DLLs were really small, the process couldn't allocate more than 1Gb at a time, as there's no continuous block of address space left free that's big enough).
You could also get a variation on this problem if the memory allocations happened in a different order in debug and release, too.