I'm currently writing an video game console emulator that is based on ARM7tdmi processor and I am almost in the stage that I wish to test if the processor is functioning correctly. I have only developed CPU and memory part of the entire console so only possible way to debug the processor is using logging (console) system. So far, I've only tested it simply by fetching dummy Opcodes and executing random instructions. Is there an actual ARM7 program (or other methodologies) that is specifically designed for this kind of purpose to make sure the processor is correctly functioning? Thanks in advance.
I used Dummy Opcodes such as,
ADD r0, r0, r1, LSL#2
MOV r1, #r0
But in 32 bit Opcode format.
I also wrote some tests and found some bugs in a GBA emulator. I have also written my own emulators (as well as work in the processor business testing processors and boards).
I have a few things that I do regularly. These are my general test methodologies.
There are a number of open source libraries out there, for example zlib and other compression libraries, jpeg, mp3, etc. It is not hard to bare metal these, fake an fopen, fread, fwrite with chunks of data and a pointer. the compression libs as well as encryption and hashes you can self test on the target processor. compress something, decompress it and compare the original with the uncompressed. I often will also run the code under test on a host, and compute the checksum of the compressed and decompressed versions, and give me a hardcoded check value which I then run on the target platform. For jpeg or mp3 or hash algorithms I use a host version of the code under test to produce a golden value that I then compare on the target platform.
Before doing any of that though the flags are very tricky to get right, the carry flag in particular (and signed overflow), some processors invert the carry out flag when it is a subtract operation (subtract is an add with the second operand ones complemented and the carry in ones complemented (normal add without carry is a carry in of zero, subtract without carry then is an add with second operand inverted and a carry in of 1)). And that inversion of the carry out affects the carry on if the instruction set has a subtract with borrow, whether or not carry is inverted on the way in or not.
It is sometimes obvious from the conditional branch definitions (if C is this and V is that, if C is this and Z is that) for unsigned and signed variations of less than, greater than, etc as to how that particular processor manages the carry (unsigned overflow) and the signed overflow flags without having to experiment on real silicon. I dont memorize what processor does what, I figure it out per instruction set, so I dont know what ARM does.
ARM has nuances with the shift operations that you have to be careful that were implemented properly, read the pseudo code under each instruction, if shift amount == 32 then do this if shift amount == 0 then do that, otherwise do this other thing. with the arm7 you could do unaligned accesses if the fault was disabled and it would rotate the data around within the 32 bits, or something like that. If the 32 bits at address 0 was 0x12345678, then a 16 bit read at address 1 would give you something like 0x78123456 on the bus and the destination would then get 0x3456. Hopefully most folks didnt rely on that. But that and other "UNPREDICTABLE RESULTS" comments in the ARM ARM, changed from ARM ARM to ARM ARM (If you have some of the different hardcopy manuals this will be more obvious, the old white covered one (the skinny one as well as the thick one) and the blue covered one). So depending on the manual you read (for those armv4 processors) you were sometimes allowed to do something and sometimes not allowed to do something. So you might find code/binaries that do things you think are unpredictable, if you only rely on one manual.
Different compilers generate differen code sequences so if you can find different arm compilers (clang/llvm and gcc being obvious first choices), get some eval copies of other compilers if you can (Kiel is probaby a good choice, now owned by arm I think it contains both Kiel and the RVCT arm compilers). Compile the same test code with different optimization settings, test every one of those variations, and repeat that for each compiler. If you only use one compiler for testing you will leave gaps in instruction sequences as well as a number of instructions or variations that will never be tested because the compiler never generates them. I hit this exact problem once. Using open source code you get different programmer habits too, whether it is asm or C or other languages different individuals have different programming habits and as a result generate different instruction sequences and mixes of instructions which can hide or expose processor bugs. If this is a single person hobby project you eventually will rely on others. The good thing here being a gba or ds or whatever emulator when you start using roms you will have a large volume of other peoples code, unfortunately debugging that is difficult.
I heard some hearsay ones that intel/x86 design validation folks use operating systems, various ones, to beat on their processors, it creates a lot of chaos and variation. Beats up on the processor but like the roms, extrememly difficult to debug if something goes wrong. I have personal experience with that with caches and such running linux on the processors I have worked on. Didnt find the bug until we had Linux ported and booting, and the bug was crazy hard to find...fortunately the arm7tdmi does not have a cache. If you have a cache then take those combinations I described above, test code multiplied by optimization level multiplied by different compilers, and then add to that in the bootstrap or other places compile a version with one, two, three, four, nops or other data such that the alignment of the binary changes relative to the cache lines causing the same program to exercise the cache differently.
In this case where there is real hardware you are trying to emulate you can do things like have a program that generates random alu machine code, generate dozens of instructions with randomly chosen source and destination registers, randomize add, subtract, and, or, not, etc. randomize the flags on and off, etc. pre-load all the registers, set the flags to a known state, run that chunk of code and then capture the registers and flags and see how it compares to real hardware. You can produce an infinite amount of separate tests, various lengths, etc. easier to debug this than to debug a code sequence that does some data or flag thing that is part of a more complicated program.
Take that combination of test programs, multplied by optimization setting, multiplied by compiler, etc. And beat on it with interrupts. Vary the rate of the interrupts. since this is a simulator you can do something I had hardawre for one time. In the interrupt, examine the return address, compute an address that is some number of instructions out ahead of that address, remember that address. Return from the interrupt, when you see that address being fetched fire a prefetch abort, have the prefetch abort code, stop watching that address when the prefetch abort fires (in the simulation) and have the code for the prefetch abort handler return to where the abort happend (per the arm arm) and let it continue. I was able to create a fair amount of pain on the processor under test with this setup...particularly with the caches on...which you dont have on an arm7tdmi.
Note that a high percentage of the gba games are thumb mode because on that platform, which used mostly 16 bit wide data busses, thumb mode ran (much) faster than arm mode even though thumb code takes about 10-15% more instructions. as well as taking less rom space for the binary. Carefully examine the blx instruction as I think there are different implementations based on architecture armv4 is different than armv6 or 7, so if you are using an armv6 or 7 manual as a reference or hardware for validating against, understand those differences.
blah, blah, blah TL; DR. sorry for rambling this is a fun topic for me...
Related
Intel's intrinsics guide lists a number of intrinsics for the AVX-512 K* mask instructions, but there seem to be a few missing:
KSHIFT{L/R}
KADD
KTEST
The Intel developer manual claims that intrinsics are not necessary as they are auto generated by the compiler. How does one do this though? If it means that __mmask* types can be treated as regular integers, it would make a lot of sense, but testing something like mask << 4 seems to cause the compiler to move the mask to a regular register, shift it, then move back to a mask. This was tested using Godbolt's latest GCC and ICC with -O2 -mavx512bw.
Also interesting to note that the intrinsics only deal with __mmask16 and not other types. I haven't tested much, but it looks like ICC doesn't mind taking in an incorrect type, but GCC does seem to try and ensure that there are only 16-bits in the mask, if you use the intrinsics.
Am I not looking past the correct intrinsics for the above instructions, as well as other __mmask* type variants, or is there another way to achieve the same thing without resorting to inline assembly?
Intel's documentation saying, "not necessary as they are auto generated by the compiler" is in fact correct. And yet, it's unsatisfying.
But to understand why it is the way it is, you need to look at the history of the AVX512. While none of this information is official, it's strongly implied based on evidence.
The reason the state of the mask intrinsics got into the mess they are now is probably because AVX512 got "rolled out" in multiple phases without sufficient forward planning to the next phase.
Phase 1: Knights Landing
Knights Landing added 512-bit registers which only have 32-bit and 64-bit data granularity. Therefore the mask registers never needed to be wider than 16 bits.
When Intel was designing these first set of AVX512 intrinsics, they went ahead and added intrinsics for almost everything - including the mask registers. This is why the mask intrinsics that do exist are only 16 bits. And they only cover the instructions that exist in Knights Landing. (though I can't explain why KSHIFT is missing)
On Knights Landing, mask operations were fast (2 cycles). But moving data between mask registers and general registers was really slow (5 cycles). So it mattered where the mask operations were being done and it made sense to give the user finer-grained control about moving stuff back-and-forth between mask registers and GPRs.
Phase 2: Skylake Purley
Skylake Purley extends the AVX512 to cover byte-granular lanes. And this increased the width of the mask registers to the full 64 bits. This second round also added KADD and KTEST which didn't exist in the Knights Landing.
These new mask instructions (KADD, KTEST, and 64-bit extensions of existing ones) are the ones that are missing their intrinsic counterparts.
While we don't know exactly why they are missing, there is some strong evidence in support of it:
Compiler/Syntax:
On Knights Landing, the same mask intrinsics were used for both 8-bit and 16-bit masks. There was no way to distinguish between them. By extended them to 32-bit and 64-bit, it made the mess worse. In other words, Intel didn't design the mask intrinsics correctly to begin with. And they decided to drop them completely rather than fix them.
Performance Inconsistencies:
Bit-crossing mask instructions on Skylake Purley are slow. While all bit-wise instructions are single-cycle, KADD, KSHIFT, KUNPACK, etc... are all 4 cycles. But moving between mask and GPR is only 2 cycles.
Because of this, it's often faster to move them into GPRs to do them and move them back. But the programmer is unlikely to know this. So rather than giving the user full control of the mask registers, Intel opted just have the compiler make this decision.
By making the compiler make this decision, it means that the compiler needs to have such logic. The Intel Compiler currently does as it will generate kadd and family in certain (rare) cases. But GCC does not. On GCC, all but the most trivial mask operations will be moved to GPRs and done there instead.
Final Thoughts:
Prior to the release of Skylake Purley, I personally had a lot of AVX512 code written up which includes a lot of AVX512 mask code. These were written with certain performance assumptions (single-cycle latency) that turned out to be false on Skylake Purley.
From my own testing on Skylake X, some of my mask-intrinsic code which relied on bit-crossing operations turned out to be slower than the compiler-generated versions that moved them to GPRs and back. The reason of course is that KADD and KSHIFT was 4 cycles instead of 1.
Of course, I prefer if Intel did provide the intrinsics to give us the control that I want. But it's very easy to go wrong here (in terms of performance) if you don't know what you're doing.
Update:
It's unclear when this happened, but the latest version of the Intel Intrinsics Guide has a new set of mask intrinsics with a new naming convention that covers all the instructions and widths. These new intrinsics supercede the old ones.
So this solves the entire problem. Though the extent of compiler support is still uncertain.
Examples:
_kadd_mask64()
_kshiftri_mask32()
_cvtmask16_u32() supercedes _mm512_mask2int()
Is it possible to schedule a given task to run exactly n machine instructions before control is returned to the user?
The motivation for this question is the debugging of multithreaded programs, where this could be helpful to reliably reproduce certain bugs or undefined behaviour.
I'm particularly interested in the case of x86_64-linux running on an Intel CPU, but solutions for other architectures or operating systems would also be interesting.
The documentation for the kernel perf suite says
Performance counters are special hardware registers available on most modern
CPUs. These registers count the number of certain types of hw events: such
as instructions executed, cachemisses suffered, or branches mis-predicted -
without slowing down the kernel or applications. These registers can also
trigger interrupts when a threshold number of events have passed.
so it seems like the hardware could support this in principle, but I'm not sure if this is exposed in any way to the user.
Of course it's possible to just use ptrace to single-step the program n times, but that would make all but the most simple programs impossibly slow.
One simple option to ensure an exact count of the instructions executed is to instrument the assembly code and maintain an execution counter. I believe the easiest way to do instrumentation is Pin ( https://software.intel.com/en-us/articles/pintool ).
High level idea:
- interpret machine code and maintain a counter of the number of instructions executed,
after each instruction you increment the counter and check if it is time for a breakpoint,
reset counter after each breakpoint.
The interpretation idea would introduce quite a bit of overhead. I see a number of straightforward optimizations:
Instrument binary statically (create a new binary where all these increments/checks are hard coded). Such an approach would eliminate instrumentation/interpretation overheads. You can consider the instructions related to monitoring/breakpoints as extra instructions executed or chose to ignore them from the counting.
The increments/checks can be more smartly implemented. Imagine we have a set of instructions with no jumps/branches you can do one increment and one check. This idea is simple but might prove pretty tricky in practice, especially if you need an absolutely accurate breakpoint..
Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.
I'd confidently say 99% of applications we write don't need to address more than 2Gb of memory. Of course, there's a lot of obvious benefit to the OS running 64-bit to address more RAM, but is there any particular reason a typical application would be compiled 64bit?
There are performance improvements that might see with 64-bit. A good example is that some parameters in function calls are passed via registers (less things to push on the stack).
Edit
I looked up some of my old notes from when I was studying some of the differences of running our product with a 64-bit build versus a 32-bit build. I ran the tests on a quad core 64-bit machine. So there is the question of comparing apples to oranges since the 32-bit was running under the emulation mode obviously. However, it seems that many things I read such as this, for example, consistently say that the speed hit for WOW64 is not significant. But even if that statement is not true, your application will almost certainly be run on a 64-bit OS. Thus a comparison of a 32-bit build versus 64-bit on a 64-bit machine has value.
In the testing I performed (certainly not comprehensive), I did not find any cases where the 32-bit build was faster. However many of the SQL intensive operations I ran (high CPU and high I/O) were 20% to 50% faster when running with the 64-bit build. These tests involved some fairly “ugly” SQL statements and also some TPCC tests with high concurrency. Of course, a lot depends on compiler switches quite a bit, so you need to do your own testing.
Building them as 64-bit now, even if you never release the build, can help you find and repair problems that you will encounter later when you're forced to build and release as 64-bit.
x64 has eight more general purpose registers that aren't available when running 32-bit code. That's three times as many (twice as many if you count ESI, EDI, EBP and ESP as general purpose; I don't). That can save a lot of loads and stores in functions that use more than four variables.
Don't underestimate the marketing value of offering a native 64-bit version of your product.
Also, you might be surprised just how many people work on apps that require as much memory as they can get.
I'd say only do it if you need more that 2GB.
One thing is 64-bit compilation means (obviously) 64-bit pointers. That means the code and data structures get a bit bigger, meaning that the app. will benefit a little less from cache and will hit the virtual memory a bit more often etc.
So, if you don't need it, the basic effect is to make your app a bit slower and more bloated for no reason.
That said, as time goes on, you'll care more about 64 bit anyway just because that's what all the tools and libraries etc will be written for. Even if your app can live quite happily in 64K, you're unlikely to use 16 bit code - the gains don't really matter (it's a small fast app anyway) and are certainly outweighed by the hassle involved. In time, we'll see 32-bit much the same way.
You could consider it as future-proofing. It may be a long way away, but consider some years in to the future, where 64-bit OS and CPUs are ubiquitous (consider how 16-bit faded away when 32-bit took over). If your application is 32-bit and all your competitors have moved on to 64-bit by then, your application could be seen as (or accused by your competitors as) out of date, slower, or incapable of change. Maybe even one day support for 32-bit applications will be dropped or incomplete (can Windows 7 run 16-bit apps properly?). If you're already building a 64-bit version of your application today, you avoid these problems. If you put it off till later, you might write a lot more code between now and when you port, then your port will be even harder.
For a lot of applications there aren't many compelling technical reasons, but if it's easy, porting now might save you effort in future.
If you don't need the extended address space, delivering in 64 bits mode offers nothing and has some disadvantage like increasing the memory consumption and the cache pressure.
While we offer 64 bits builds, our customer who are at the limit are pushing us to reduce the memory consumption so that they get these advantages.
All applications that may need lots of memory: database servers that want to cache lots of data in memory, scientific applications that handle lots of data, ...
I've recently read this article,Optimizing software in C++. In chapter 2.3 Choice of operating system there is a comparison between advantadges and disavantages of 64 and 32 bits system, with some specific observations regarding Windows.
Mark Wilkins already noted in this thread about more registers for function calls. Another interesting property of 64 bit system is this:
The SSE2 instruction set is supported on all 64-bit CPUs and operating systems.
SSE2 instructions can provide excellent optimizations and they are being increasingly used, so in my opinion this is a notable feature.
Fastcall makes calling subroutines faster by keeping the first four parameters in registers.
When you say that 99% of apps won't benefit from 64-bit, that may well be true for you personally, but during the day I use Visual Studio and Xcode to compile C++ with a large codebase, search the multi-Gb repositories with Google Desktop and Spotlight. Then I come home to write music using a sequencer using several Gb of sound libraries, and do some photoshopping on my 20Gb of photos, and maybe do a bit of video editing with my holiday clips.
So for me (and I dare say many other users), having 64-bit versions of many of these apps will be a great advantage. Word processor, web browser, email client: maybe not. But anything involved with large media will really benefit.
More data can be processed per clock cycle, which can deliver performance improvements to e.g. crypto, video encoding, etc. applications
is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible?
Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?
Thanks!
According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:
Once we have a formula for the memory
requirement we can compare it with the
cache size. As mentioned before, the
cache might be shared with multiple
other cores. Currently {There
definitely will sometime soon be a
better way!} the only way to get
correct information without hardcoding
knowledge is through the /sys
filesystem. In Table 5.2 we have seen
the what the kernel publishes about
the hardware. A program has to find
the directory:
/sys/devices/system/cpu/cpu*/cache
This is listed in Section 6: What Programmers Can Do.
He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.
There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.
C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.
Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.
read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.
Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.
TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).
Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).
http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c
The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.
It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/
You can see this thread: http://software.intel.com/en-us/forums/topic/296674
The short answer is in this other thread:
On modern IA-32 hardware, the cache line size is 64. The value 128 is
a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium
D) where 64-byte lines are paired into 128-byte sectors. When a line
in a sector is fetched, the hardware automatically fetches the other
line in the sector too. So from a false sharing perspective, the
effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)
IIRC, GCC has a __builtin_prefetch hint.
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
has an excellent section on this. Basically, it suggests:
__builtin_prefetch (&array[i + LookAhead], rw, locality);
where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.
Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.
There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?
Determining the cache-size at compile-time
For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.
If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.
Determining the cache-size at runtime
At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.
I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.
Combining both approaches
If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.
The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.