I want to know via WMI or other means in c++ if the user has integrated or dedicated GPU card?
I have gone over Win32_VideoController and could not find anything that will help me to differentiate between the two.
Thanks in advance.
It is surprising that no one brought this idea up after so many years. Most people said it is impossible, and it is true that Windows natively does not provide any means to detect whether it is an iGPU or dGPU. However, I managed to make it work to a certain extent, if you could bear with the limitations.
The general idea is that you can use wmic to get the name of CPU installed, and maintain a list of all CPUs that has integrated graphics (which may be very short, depending on what you need this feature for.)
For newer CPU models like 9th gen or newer Intel desktop processor and AMD Ryzen 1000 and newer, you can simply tell by the CPU naming. Intel processors without integrated graphics will end with letter F, while AMD processors with integrated graphics will end with letter G. Then you can use wmic to get list of all gpus (including iGPU), and by counting the number of gpus installed, you can easily tell if the user has an iGPU or dGPU, as it is impossible to have more than 1 iGPU.
Thus, if you detect that the CPU comes with an iGPU, and if there is only 1 GPU reported by wmic, you know that it is definitely iGPU. On the other hand, if there are multiple GPUs reported, you know that one of them is definitely dGPU. Of course, this does not work if the user manually disabled iGPU themselves, thus I am saying this approach has limitations.
Related
I read an interesting paper, entitled "A High-Resolution Side-Channel Attack on Last-Level Cache", and wanted to find out the index hash function for my own machine—i.e., Intel Core i7-7500U (Kaby Lake architecture)—following the leads from this work.
To reverse-engineer the hash function, the paper mentions the first step as:
for (n=16; ; n++)
{
// ignore any miss on first run
for (fill=0; !fill; fill++)
{
// set pmc to count LLC miss
reset_pmc();
for (a=0; a<n; a++)
// set_count*line_size=2^19
load(a*2^19);
}
// get the LLC miss count
if (read_pmc()>0)
{
min = n;
break;
}
}
How can I code the reset_pmc() and read_pmc() in C++? From all that I read online so far, I think it requires inline assembly code, but I have no clue what instructions to use to get the LLC miss count. I would be obliged if someone can specify the code for these two steps.
I am running Ubuntu 16.04.1 (64-bit) on VMware workstation.
P.S.: I found mention of these LONGEST_LAT_CACHE.REFERENCES and LONGEST_LAT_CACHE.MISSES in Chapter-18 Volume 3B of the Intel Architectures Software Developer's Manual, but I do not know how to use them.
You can use perf as Cody suggested to measure the events from outside the code, but I suspect from your code sample that you need fine-grained, programmatic access to the performance counters.
To do that, you need to enable user-mode reading of the counters, and also have a way to program them. Since those are restricted operations, you need at least some help from the OS kernel to do that. Rolling your own solution is going to be pretty difficult, but luckily there are several existing solutions for Ubunty 16.04:
Andi Kleen's jevents library, which among other things lets you read PMU events from user space. I haven't personally used this part of pmu-tools, but the stuff I have used has been high quality. It seems to use the existing perf_events syscalls for counter programming so and doesn't need a kernel model.
The libpfc library is a from-scratch implementation of a kernel module and userland code that allows userland reading of the performance counters. I've used this and it works well. You install the kernel module which allows you to program the PMU, and then use the API exposed by libpfc to read the counters from userspace (the calls boil down to rdpmc instructions). It is the most accurate and precise way to read the counters, and it includes "overhead subtraction" functionality which can give you the true PMU counts for the measured region by subtracting out the events caused by the PMU read code itself. You need to pin to a single core for the counts to make sense, and you will get bogus results if your process is interrupted.
Intel's open-sourced Processor Counter Monitor library. I haven't tried this on Linux, but I used its predecessor library, the very similarly named1 Performance Counter Monitor on Windows, and it worked. On Windows it needs a kernel driver, but on Linux it seems you can either use a drive or have it go through perf_events.
Use the likwid library's Marker API functionality. Likwid has been around for a while and seems well supported. I have used likwid in the past, but only to measure whole processes in a matter similar to perf stat and not with the marker API. To use the marker API you still need to run your process as a child of the likwid measurement process, but you can read programmatically the counter values within your process, which is what you need (as I understand it). I'm not sure how likwid is setting up and reading the counters when the marker API is used.
So you've got a lot of options! I think all of them could work, but I can personally vouch for libpfc since I've used it myself for the same purpose on Ubuntu 16.04. The project is actively developed and probably the most accurate (least overhead) of the above. So I'd probably start with that one.
All of the solutions above should be able to work for Kaby Lake, since the functionality of each successive "Performance Monitoring Architecture" seems to generally be a superset of the prior one, and the API is generally preserved. In the case of libpfc, however, the author has restricted it to only support Haswell's architecture (PMA v3), but you just need to change one line of code locally to fix that.
1 Indeed, they are both commonly called by their acronym, PCM, and I suspect that the new project is simply the officially open sourced continuation of the old PCM project (which was also available in source form, but without a mechanism for community contribution).
I would use PAPI, see http://icl.cs.utk.edu/PAPI/
This is a cross platform solution that has a lot of support, especially from the hpc community.
I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.
Recently I needed to do some experiments which need run multiple different kernel on AMD hardware. But I have several questions before starting to coding hence I really need your help.
First, I am not quite sure whether AMD HW can support concurrent kernel execution on one device. Because when I refer to the OpenCL specs, they said the command queue can be created as in-order and out-of-order. But I don't "out-of-order" mean "concurrent execution". Is there anyone know info about this? My hardware is AMD APU A8 3870k. If this processor does not support, any other AMD products support?
Second, I know there is an extension "device fission" which can be used to partition one device into two devices. This works only on CPU now. But in OpenCL specs, I saw something, i.e. "clcreatesubdevice", which is also used to partition one device into two? So my question is is there any difference between these two techniques? My understanding is: device fission can only be used on CPU, clcreatesubdevice can be used on both the CPU and the GPU. Is that correct?
Thanks for any kind reply!
Real concurrent kernels is not a needed feature and causes so much troubles to driver developers. As far as I know, AMD does not support this feature without the subdevice split. As you mentioned, "out-of-order" is not cuncurrent, is just a out of order execution of the queue.
But what is the point in running both of them in parallel at half the speed instead of sequentially at full speed? You will probably loose overall performance if you do it in such a way.
I recomend you to use more GPU devices (or GPU + CPU) if you run out of resources in one of the GPUs. Optimizing could be a good option too. But splitting is never a good option for real scenario, only for academic purposes or testing.
Recently, I have been spending a lot of my time researching the topic of GPUs, and have came across several articles talking about how PC games are having a hard time staying ahead of the curve compared to console games due to limitations with the APIs. For example, on Xbox 360, it is my understanding that the games run in kernel mode, and that because the hardware will always be the same, the games can be programmed "closer to the metal" and the Directx api has less abstraction. On PC however, making the same number of draw calls with Direct-X or Opengl may take even more the 2 times the amount of time than on console due to switching to kernel mode and more layers of abstraction. I am interested in hearing possible solutions to this problem.
I have heard of a few solutions, such as programing directly on the hardware, but while (from what I understand), ATI has released the specifications of there low level API, nVidia keeps theirs secret, so that wouldn't work too well, not to mention the added development time of making different profiles.
Would programming an entire "software rendering" solution in Opencl and running that on a GPU be any better? My understanding is that games with a lot of draw calls are cpu bound and the calls are single threaded (on PC that is), so is Opencl a viable option?
So the question is:
What are possible methods to increase the efficiency of, or even remove the need for, graphics APIs such as Opengl and Directx?
The general solution is to not make draw as many draw calls. Texture atlases via array textures, instancing, and various other techniques make this possible.
Or to just use the fact that modern computers have a lot more CPU performance than consoles. Or even better, make yourself GPU bound. After all, if your CPU is your bottleneck, then that means you have GPU power to spare. Use it.
OpenCL is not a "solution" to anything related to this. OpenCL has no access to any of the many things one would need to do to actually use a GPU to do rendering. In order to use OpenCL for graphics, you would have to not use the GPU's rasterizer/clipper, it's specialized buffers for transferring information from stage to stage, the post T&L cache, or the blending/depth comparison/stencil/etc hardware. All of that is fixed function and extremely fast and specialized. And completely unavailable to OpenCL.
And even then, it doesn't actually make it not CPU bound anymore. You still have to marshal what you're rendering and so forth. And you probably won't have access to the graphics FIFO, so you'll have to find another way to feed your shaders.
Or, to put it another way, this is a "problem" that doesn't need solving.
If you try to write a renderer in OpenCL, you will end up with something resembling OpenGL and DirectX. You will also most likely end up with something much slower than these APIs which were developed by many experts over many years. They are specialized to handle efficient rasterizing and use internal hooks not available to OpenCL. It could be a fun project, but definitely not a useful one.
Nicol Bolas already gave you some good techniques to increase the load of the GPU relative to the CPU. The final answer is of course that the best technique will depend on your specific domain and constraints. For example, if your rendering needs call for lots of pixel overdraw with complicated shaders and lots of textures, the CPU will not be the bottleneck. However, the most important general rule from with modern hardware is to limit the number of OpenGL calls made by better batching.
APIs. For example, on Xbox 360, it is my understanding that the games run in kernel mode, and that because the hardware will always be the same, the games can be programmed "closer to the metal" and the Directx api has less abstraction. On PC however, making the same number of draw calls with Direct-X or Opengl may take even more the 2 times the amount of time than on console due to switching to kernel mode and more layers of abstraction.
The benefits of close-to-metal operation on consoles is largely overcompensated on PCs by their much larger CPU performance and available memory. Add to this that the HDDs of consoles are not nearly as fast as modern PC ones (SATA-1 vs SATA-3, or even just PATA) and many games get their contents from an optical drive which is even slower.
The PS3 360 for example offers only 256MiB memory for game logic and another 256MiB of RAM for graphics and more you don't get to work with. The X-Box 360 offers 512MiB of unified RAM, so you have to squeeze everthing into that. Now compare this with a low end PC, which easily comes with 2GiB of RAM for the program alone. And even the cheapest graphics cards offer at least 512MiB of RAM. A gamers machine will have several GiB of RAM, and the GPU will offer something between 1GiB to 2GiB.
This extremly limits the possibilites for a game developer and many PC gamers are mourning that so many games are "consoleish", yet their PCs could do so much more.
It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim
Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.
Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.
The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.
A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.
Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).
What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.
If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.
I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim