Reliable way to programmatically get the number of hardware threads on Windows - c++

I'm struggling to find a reliable way to get the number of hardware threads on Windows. I am running a Windows 7 Professional SP1 64-bit on a machine with dual CPU Intel Xeon E5-2699 v3 # 2.30GHz totalizing 36 cores and 72 threads.
I have tried different methods to get the number of cores, and I have found that only two of them seem to work accurately in a 32-bit or 64-bit process. Here are my results:
+------------------------------------------------+----------------+----------------+
| Methods | 32-bit process | 64-bit process |
+------------------------------------------------+----------------+----------------+
| GetSystemInfo->dwNumberOfProcessors | 32 | 36 |
| GetNativeSystemInfo->dwNumberOfProcessors | 36 | 36 |
| GetLogicalProcessorInformation | 36 | 36 |
| GetProcessAffinityMask.processAffinityMask | 32 | 32 |
| GetProcessAffinityMask.systemAffinityMask | 32 | 32 |
| omp_get_num_procs | 32 | 36 |
| getenv("NUMBER_OF_PROCESSORS") | 36 | 36 |
| GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| boost::thread::hardware_concurrency() | 32 | 36 |
| Performance counter API | 36 | 36 |
| WMI | 72 | 72 |
| HARDWARE\DESCRIPTION\System\CentralProcessor | 72 | 72 |
+------------------------------------------------+----------------+----------------+
I do not explain why all these functions return different values. The only 2 methods which seem reliable to me is either using WMI (but fairly complicated) or simply to read in the Windows registry the following key: HARDWARE\DESCRIPTION\System\CentralProcessor.
What do you think?
Do you confirm that the WMI and registry key methods are the only reliable methods?
Thanks in advance

The API function that you need is GetLogicalProcessorInformationEx. Since you have more than 64 processors, your processors are grouped. GetLogicalProcessorInformation only reports the processors in the processor group that the thread is currently assigned. You need to use GetLogicalProcessorInformationEx to get past that limitation.
The documentation says:
On systems with more than 64 logical processors, the GetLogicalProcessorInformation function retrieves logical processor information about processors in the processor group to which the calling thread is currently assigned. Use the GetLogicalProcessorInformationEx function to retrieve information about processors in all processor groups on the system.

Late answer with code:
size_t myHardwareConcurrency(){
size_t concurrency=0;
DWORD length=0;
if(GetLogicalProcessorInformationEx(RelationAll,nullptr,&length)!=FALSE){
return concurrency;}
if(GetLastError()!=ERROR_INSUFFICIENT_BUFFER){
return concurrency;}
std::unique_ptr<void,void(*)(void*)>buffer(std::malloc(length),std::free);
if(!buffer){
return concurrency;}
unsigned char*mem=reinterpret_cast<unsigned char*>(buffer.get());
if(GetLogicalProcessorInformationEx(RelationAll,reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem),&length)==false){
return concurrency;}
for(DWORD i=0;i<length;){
auto*proc=reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem+i);
if(proc->Relationship==RelationProcessorCore){
for(WORD group=0;group<proc->Processor.GroupCount;++group){
for(KAFFINITY mask=proc->Processor.GroupMask[group].Mask;mask!=0;mask>>=1){
concurrency+=mask&1;}}}
i+=proc->Size;}
return concurrency;}
It worked on my dual Xeon gold 6154 with Windows 64 bit system (2 procs * 18 cores/proc * 2 threads/core = 72 threads). The result is 72 both for 32 bit processes and for 64 bit processes.
I do not have access to a system with a 32 bit Windows though.
In case of error, it returns zero like std::thread::hardware_concurrency does.

You can use the CPUID instruction to query the processor directly (platform independent, though since you can't do inline asm in MSVC anymore for some compilers you'll need to use different functions to have access to it). The only downside is that as of a few years ago Intel and AMD handle this instruction differently, and you'll need to do a lot of work to ensure you are reading the information correctly. In fact, not only will you be able to get a core count, but you can get all kinds of processor topology information. Not sure how it works in a VM though if you are using that environment.

Related

Build Gstreamer gst-plugins-bad 1.18.6 with support for Nvidia cuda, encoding/decoding, etc

Similar questions have been asked elsewhere on Stack Overflow. However, I do not believe any posts here give an answer relevant to the more recent releases of gst-plugins-bad.
I'd like to encode and decode H264 video streams with gstreamer using hardware support via my GTX1080 video card. I have been able to get this to work previously following this guide on gst-plugins-bad 1.16.3, but my goal now is to access features available in 1.18.6. However, starting with 1.17.0, the build system for gst-plugins-bad changed from autoconf to Meson. I don't have ANY experience w/ Meson at all (honestly, I hadn't even heard of it until this point), and as such have no idea how to pass the proper arguments to build w/ nvidia support. Also, I wouldn't have come asking if there were any documentation available referencing what I'm trying to do here specific to later versions of gstreamer plugins. As far as I can tell, there isn't.
I am on Ubuntu 22.04 with Cuda 11.8, Gstreamer 1.20.3, gst-plugins-bad 1.18.6.
For brevity is the output from nvidia-smi:
Wed Jan 4 14:42:19 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:42:00.0 On | N/A |
| 0% 54C P2 40W / 240W | 652MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1879 G /usr/lib/xorg/Xorg 354MiB |
| 0 N/A N/A 2034 G /usr/bin/gnome-shell 92MiB |
| 0 N/A N/A 12603 G ...1/usr/lib/firefox/firefox 163MiB |
| 0 N/A N/A 20393 G ...RendererForSitePerProcess 38MiB |
+-----------------------------------------------------------------------------+
Any help is appreciated, thanks in advance.

Which sample type (not size) to choose in QtMultimedia's QAudioFormat for 24, 32 and 64 bit audio?

I am writing a media player with Qt, but I'm now facing some unknown situation. Actually, I'm trying to use QAudioOutput and QAudioDecoder to play high res music (24, 32 or even 64 bit audio). But QAudioFormat (the glue between all audio classes) specify a sampleType as in the following table:
| Constant | Value | Description |
|---------------------------|-------|--------------------------------|
| QAudioFormat::Unknown | 0 | Not Set |
| QAudioFormat::SignedInt | 1 | Samples are signed integers |
| QAudioFormat::UnSignedInt | 2 | Samples are unsigned intergers |
| QAudioFormat::Float | 3 | Samples are floats |
Now, the problem arise when I also set the sample size to something greater than 16bits. I now have one hypothesis that I need confirmation :
assuming ints are 32bits in size, if I want to support up to 32bit sample sizes I have to use QAudioFormat::SignedInt with pcm audio for 24 and 32 bit audio (filling with 0 for 24bit audio).
But what if there is a higher sample size (eg: 64bit audio for dsd converted to pcm). Should I assume that I still set the sample type to QAudioFormat::SignedInt but that each "sample" of 64bits is stored in two ints ? Or is it simply not supported by QtMultimedia ?
I'm open to any enlightenment 😙!
From the documentation for QAudioFormat::setSampleSize():
void QAudioFormat::setSampleSize(int sampleSize)
Sets the sample size to the sampleSize specified, in bits.
This is typically 8 or 16, but some systems may support higher sample
sizes.
Therefore, to use 64-bit samples, you'd need to call setSampleSize(64). That could be called in combination with a call to setSampleType() to specify whether the samples will be fixed-point-signed vs fixed-point-unsigned vs floating-point -- note that the values in setSampleType() do not imply any particular sample size.
For 64-bit audio, each sample will be stored as 64 bits of data; you could access each sample as a long long int, or alternatively as an int64_t (or unsigned long long int or uint64_t for unsigned samples, or as a double for floating-point samples).
(Of course none of this guarantees that your Qt library's QtMultimedia actually supports 64-bit samples; it may or may not, but at least the API supports telling Qt what you want :) )

Inconsistent results in compressing with zlib between Win32 and Linux-64 bit

Using zlib in a program and noticed a one bit difference in how "foo" is compressed on Windows 1F8B080000000000000A4BCBCF07002165738C03000000 and Linux 1F8B08000000000000034BCBCF07002165738C03000000. Both decompress back to "foo"
I decided to check outside our code to see if the implementation was correct and used the test programs in the zlib repository to double check. I got the same results:
Linux: echo -n foo| ./minigzip64 > text.txt'
Windows: echo|set /p="foo" | minigzip > text.txt
What would account for this difference? Is it a problem?
1F8B 0800 0000 0000 000 *3/A* 4BCB CF07 0021 6573 8C03 0000 00
First off, if it decompresses to what was compressed, then it's not a problem. Different compressors, or the same compressor at different settings, or even the same compressor with the same settings, but different versions, can produce different compressed output from the same input.
Second, the compressed data in this case is identical. Only the last byte
of the gzip header that precedes the compressed data is different. That byte identifies the originating operating system. Hence it rightly varies between Linux and Windows.
Even on the same operating system, the header can vary since it carries a modification date and time. However in both your cases the modification date and time was left out (set to zeros).
Just to add to the accepted answer here. I got curious and tried out for myself, saving the raw data and opening with 7zip:
Windows:
Linux:
You can immediately notice that the only field that's different is the Host OS.
What the data means
Header Data Footer
1F8B080000000000000A | 4BCBCF0700 | 2165738C03000000
Let's break that down.
Header
First, from this answer I realize it's actually a gzip instead of a zlib header:
Level ZLIB GZIP
1 | 78 01 | 1F 8B
9 | 78 DA | 1F 8B
Further searching led me to an article about Gzip on forensics wiki.
The values in this case are:
Offset Size Value Description
0 | 2 | 1f8b | Signature (or identification byte 1 and 2)
2 | 1 | 08 | Compression Method (deflate)
3 | 1 | | Flags
4 | 4 | | Last modification time
8 | 1 | | Compression flags (or extra flags)
9 | 1 | 0A | Operating system (TOPS-20)
Footer
Offset Size Value Description
0 | 4 | 2165738C | Checksum (CRC-32) (Little endian)
4 | 4 | 03 | Uncompressed data size Value in bytes.
Interesting thing to note here is that even if the Last modification time
and Operating system in header is different, it will compress to the same data
with the same checksum in the footer.
The IETF RFC has a more detailed summary of the format

What is IACA and how do I use it?

I've found this interesting and powerful tool called IACA (the Intel Architecture Code Analyzer), but I have trouble understanding it. What can I do with it, what are its limitations and how can I:
Use it to analyze code in C or C++?
Use it to analyze code in x86 assembler?
2019-04: Reached EOL. Suggested alternative: LLVM-MCA
2017-11: Version 3.0 released (latest as of 2019-05-18)
2017-03: Version 2.3 released
What it is:
IACA (the Intel Architecture Code Analyzer) is a (2019: end-of-life) freeware, closed-source static analysis tool made by Intel to statically analyze the scheduling of instructions when executed by modern Intel processors. This allows it to compute, for a given snippet,
In Throughput mode, the maximum throughput (the snippet is assumed to be the body of an innermost loop)
In Latency mode, the minimum latency from the first instruction to the last.
In Trace mode, prints the progress of instructions through their pipeline stages.
when assuming optimal execution conditions (All memory accesses hit L1 cache and there are no page faults).
IACA supports computing schedulings for Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and Skylake processors as of version 2.3 and Haswell, Broadwell and Skylake as of version 3.0.
IACA is a command-line tool that produces ASCII text reports and Graphviz diagrams. Versions 2.1 and below supported 32- and 64-bit Linux, Mac OS X and Windows and analysis of 32-bit and 64-bit code; Version 2.2 and up only support 64-bit OSes and analysis of 64-bit code.
How to use it:
IACA's input is a compiled binary of your code, into which have been injected two markers: a start marker and an end marker. The markers make the code unrunnable, but allow the tool to find quickly the relevant pieces of code and analyze them.
You do not need the ability to run the binary on your system; In fact, the binary supplied to IACA can't run anyways because of the presence of the injected markers in the code. IACA only requires the ability to read the binary to be analyzed. Thus it is possible, using IACA, to analyze a Haswell binary employing FMA instructions on a Pentium III machine.
C/C++
In C and C++, one gains access to marker-injecting macros with #include "iacaMarks.h", where iacaMarks.h is a header that ships with the tool in the include/ subdirectory.
One then inserts the markers around the innermost loop of interest, or the straight-line chunk of interest, as follows:
/* C or C++ usage of IACA */
while(cond){
IACA_START
/* Loop body */
/* ... */
}
IACA_END
The application is then rebuilt as it otherwise would with optimizations enabled (In Release mode for users of IDEs such as Visual Studio). The output is a binary that is identical in all respects to the Release build except with the presence of the marks, which make the application non-runnable.
IACA relies on the compiler not reordering the marks excessively; As such, for such analysis builds certain powerful optimizations may need to be disabled if they reorder the marks to include extraneous code not within the innermost loop, or exclude code within it.
Assembly (x86)
IACA's markers are magic byte patterns injected at the correct location within the code. When using iacaMarks.h in C or C++, the compiler handles inserting the magic bytes specified by the header at the correct location. In assembly, however, you must manually insert these marks. Thus, one must do the following:
; NASM usage of IACA
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.innermostlooplabel:
; Loop body
; ...
jne .innermostlooplabel ; Conditional branch backwards to top of loop
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
It is critical for C/C++ programmers that the compiler achieve this same pattern.
What it outputs:
As an example, let us analyze the following assembler example on the Haswell architecture:
.L2:
vmovaps ymm1, [rdi+rax] ;L2
vfmadd231ps ymm1, ymm2, [rsi+rax] ;L2
vmovaps [rdx+rax], ymm1 ; S1
add rax, 32 ; ADD
jne .L2 ; JMP
We add immediately before the .L2 label the start marker and immediately after jne the end marker. We then rebuild the software, and invoke IACA thus (On Linux, assumes the bin/ directory to be in the path, and foo to be an ELF64 object containing the IACA marks):
iaca.sh -64 -arch HSW -graph insndeps.dot foo
, thus producing an analysis report of the 64-bit binary foo when run on a Haswell processor, and a graph of the instruction dependencies viewable with Graphviz.
The report is printed to standard output (though it may be directed to a file with a -o switch). The report given for the above snippet is:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - ../../../tests_fma
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.55 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU, PORT3_AGU
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovaps ymm1, ymmword ptr [rdi+rax*1]
| 2 | 0.5 | 0.5 | | 1.0 1.0 | | | | | CP | vfmadd231ps ymm1, ymm2, ymmword ptr [rsi+rax*1]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | CP | vmovaps ymmword ptr [rdx+rax*1], ymm1
| 1 | | | | | | | 1.0 | | | add rax, 0x20
| 0F | | | | | | | | | | jnz 0xffffffffffffffec
Total Num Of Uops: 6
The tool helpfully points out that currently, the bottleneck is the Haswell frontend and Port 2 and 3's AGU. This example allows us to diagnose the problem as the store not being processed by Port 7, and take remedial action.
Limitations:
IACA does not support a certain few instructions, which are ignored in the analysis. It does not support processors older than Nehalem and does not support non-innermost loops in throughput mode (having no ability to guess which branch is taken how often and in what pattern).

Detecting memory leaks in C++ Qt combine?

I have an application that interacts with external devices using serial communication. There are two versions of the device differing in their implementations.
-->One is developed and tested by my team
-->The other version by a different team.
Since the other team has left, our team is looking after it's maintenance. The other day while testing the application I noticed that the application takes up 60 Mb memory at startup and to my horror it's memory usage starts increasing with 200Kb chunks, in 60 hrs it shoots up to 295 Mb though there is no slow down in the responsiveness and usage of application. I tested it again and again and the same memory usage pattern is repeated.
The application is made in C++,Qt 4.2.1 on RHEL4.
I used mtrace to check for any memory leaks and it shows no such leaks. I then used valgrind memcheck tool, but the messages it gives are cryptic and not very conclusive, it shows leaks in graphical elements of Qt, which on scrutiny can be straightaway rejected.
I am in a fix as to what other tools/methodologies can be adopted to pinpoint the source of these memory leaks if any.
-->Also, in a larger context, how can we detect and debug presence of memory leaks in a C++ Qt application?
-->How can we check, how much memory a process uses in Linux?
I had used gnome-system-monitor and top command to check for memory used by the application, but I have heard that results given by above mentioned tools are not absolute.
EDIT:
I used ccmalloc for detecting memory leaks and this is the error report I got after I closed the application. During application execution, there were no error messages.
|ccmalloc report|
=======================================================
| total # of| allocated | deallocated | garbage |
+-----------+-------------+-------------+-------------+
| bytes| 387325257 | 386229435 | 1095822 |
+-----------+-------------+-------------+-------------+
|allocations| 1232496 | 1201351 | 31145 |
+-----------------------------------------------------+
| number of checks: 1 |
| number of counts: 2434332 |
| retrieving function names for addresses ... done. |
| reading file info from gdb ... done. |
| sorting by number of not reclaimed bytes ... done. |
| number of call chains: 3 |
| number of ignored call chains: 0 |
| number of reported call chains: 3 |
| number of internal call chains: 3 |
| number of library call chains: 1 |
=======================================================
|
| 3.1% = 33.6 KB of garbage allocated in 47 allocations
| |
| | 0x???????? in
| |
| | 0x081ef2b6 in
| | at src/wrapper.c:489
| |
| | 0x081ef169 in <_realloc>
| | at src/wrapper.c:435
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.8% = 8722 Bytes of garbage allocated in 35 allocations
| |
| | 0x???????? in
| |
| | 0x081ef134 in
| | at src/wrapper.c:422
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
| 0.1% = 1144 Bytes of garbage allocated in 5 allocations
| |
| | 0x???????? in
| |
| | 0x081ef1cb in
| | at src/wrapper.c:455
| |
| `-----> 0x081ef05c in
| at src/wrapper.c:318
|
`------------------------------------------------------
free(0x09cb650c) after reporting
(This can happen with static destructors.
When linking put `ccmalloc.o' at the end (for gcc) or
in front of the list of object files.)
free(0x09cb68f4) after reporting
free(0x09cb68a4) after reporting
free(0x09cb6834) after reporting
free(0x09cb6814) after reporting
free(0x09cb67a4) after reporting
free(0x09cb6784) after reporting
free(0x09cb66cc) after reporting
free(0x09cb66ac) after reporting
free(0x09cb65e4) after reporting
free(0x09cb65c4) after reporting
free(0x09cb653c) after reporting
ccmalloc_report() called in non valid state
I have no clue, what this means, it doesn't seem to indicate any memory leaks to me? I may be wrong. Does anyone of you have come across such a scenario?
link|edit|delete
Valgrind can be a bitch if you don't really read the manuals or whatever documentation is actually available (man page for starters) - but they are worth it.
Basicly, you could start by running the valgrind on your application with --gen-suppressions=all and then create a suppressions for each block that is originating from QT itself and then use the suppression file to block those errors and you should be left with only with errors in your own code.
Also, you could try to use valgrind thru a alleyoop frontend if that makes things easier for you.
There are also bunch of other tools that can be used to detect memory leaks and Linux Journal has article about those here: http://www.linuxjournal.com/article/6556
And last, in some cases, some static analysis tools can spot memory errors too..
I'd like to make the minor point that just because the meory used by a process is increasing, it does not follow that you have a memory leak. Take a word processor as an example - as you write text, the memory usage increases, but there is no leak. Most processes in fact increase their memoryy usage as they run, often until they reach some sort of near steady-state, where objects been created are balanced by old objects being destroyed.
You said you tried Valgrind's memcheck tool; you should also try the massif tool, which should be able to graph the heap usage over time, and tell you where the memory was allocated from.
One of the reasons why top isn't too useful to measure memory usage is that they don't take into account that memory is often shared between processes. For the best overview on where the process has allocated memory, I recommend using a recent Linux kernel and checking /proc/<pid>/maps for your process. This shows what memory is mapped to that process and from where. For example, here's a snippet from konqueror on my system.
b732a000-b7a20000 r-xp 00000000 fd:05 205437 /usr/lib/qt3/lib/libqt-mt.so.3.3.8
Size: 7128 kB
Rss: 3456 kB
Pss: 347 kB
Shared Clean: 3452 kB
Shared Dirty: 0 kB
Private Clean: 4 kB
Private Dirty: 0 kB
Referenced: 3452 kB
The important thing here is that, although the resident set resulting from the load of libqt-mt.so.3.3.8 is 3456kB, all but 4kB of that is shared between all processes which loaded the library, so it's a one-off system-wide cost. top doesn't expose this information, so just reading the RSS from top is misleading.