Ti-83 Emulator question with the ROM - c++

I have been building knowledge of computers and C++ for quite a while now, and I've decided I want to try making an emulator to get an even better understanding. I want to try making a TI-83 Emulator (runs on a Zilog Z80 CPU). I currently have two problems:
The first is that the "PC" register that points to the current instruction is only 16 bits, but the Ti-83 ROM I downloaded is 256Kb. How is 16 bits of data supposed to point to an address beyond ~64Kb?
Secondly, where is the entry point on the ROM? Does the execution just begin at 0x0000?
Thanks, and hopefully you can help me understand a bit on how this works.

There's is most likely a programmable paging register outboard of the processor core that can be set to map a portion of the 256K at a time into part of the 64K address space. You will need to emulate that. Hopefully you can find out about this in official or unofficial documentation. If you have a schematic or PCB it might even be visible as an external PAL or collection of logic chips.
I forget off the top of my head where a z80 starts executing on reset, but I'm sure you will find it in the processor manual, which would be a necessary tool to write an emulator for it.
You'll want to make sure the core used is truly a z80 and not some kind of custom extended version thereof.
And of course I'm sure someone has already done this, so your project is likely to be more about learning - though in the end you might surpass any available solution if you work on it long enough.

The Developer Guide describes how memory is arranged, although it doesn't actually describe how the mapping works.
Short version: the address space is divided into four 16K pages, the first of which always maps page 0 of the 32-page flash ROM.

Related

Intel IA32 cheat sheet

I am looking for a document or textbook that covers the technical specifications, magic numbers etc for the IA32 architecture.
The Intel manuals are all well and good, but I am looking for something much more concise.
The particular project I am working on, (a new type of OS), requires an intimate knowledge of hardware addresses and basic systems architecture.
I have no need to make use of much of the detail covered in the Intel manuals - just the technical details required for implementing task switching and virtual memory - from scratch!
Could anyone point me in the direction of some good resources?
Thanks.
The Intel and AMD manuals are probably the best resources for this. You obviously don't need to read ALL of it, but the relevant sections - for example AMD's "AMD64 Architecture Programmers Manual, Volume 2", where chapter 5 covers "page translation", which is the basis for virtual memory.
Edit: Declaring bias: I have worked for AMD and I still prefer AMD to Intel - both when it comes to the literature and the actual product.
Task switching is typically done by simply saving the context of one process and restoring the new process's context using, mainly, regular instructions, with a couple of moves to CR3 and CR4 for setting up the page-table for the new process (usually don't have to save the CR3/CR4 values, since they are "fixed" per process, so you just load the new ones from whatever place they are stored in the data for that process).
In 32-bit mode, the x86 architecture does have "built in" task switching, but it's not used by any modern OS, and is quite a bit slower due to its "save everything, restore everything" approach. Manually writing the task save/restore code is generally not that hard, and you can clearly avoid saving and restoring a lot of data. You still need to use the "Task-state segment" (Chapter 12 in AMD's literature) to allow for stack switching between Kernel and User mode.
And of course, you also need to look at some of the interrupt and exception handling, how to deal with hardware registers for PCI access, and so on. I'm afraid that's something I look up in books that I don't have links for. These are currently stacked away in a box since a recent move, so can't give you exact titles.

C++ Ways to transfer large amounts of data between 32bit applications for video playback

I am aware of the basics of shared memory and inter process communication, but since my application is fairly specific I'm asking this question for general feedback.
I am working on 64 bit machines (MacOS and Win 64), using a 32bit visual coding toolkit. It is not practical to port the toolkit to 64bit at this time so I have memory limitations.
I am working on an application which must be able to scrub (go back and forth based on user input) high quality video at fast speeds. The obvious solutions are:
1 - Keep it all in memory.
2 - Stream from disk.
Putting it all in memory at the moment requires lowering the video quality to an unacceptable point, and streaming from disk causes the scrub to hang while loading.
My current train of thought is to run a master and multiple slave programs. Each slave will load up a segment of the video into ram, and when the master program needs to load a different section of the video it will request this data from the slave and have it transferred over.
My question is, what is an appropriate way to do this?
I suspect shared memory will not allow me to get past the 32bit memory limitations my application currently has. I could do something as simple as pipes, but I was wondering if there is something else that is more suitable.
Ideally this solution would be Mac/Win portable, but since the final solution must reside on a windows box I will opt for windows solutions. Also the easier the better, as I'm not looking to spend weeks in dev time on this.
Thanks in advanced.
I'm going to guess you are (or at least can be) using a 64-bit machine with a 64-bit OS, even though it's impractical to port all your code to 64 bits. I'm also assuming that your machine has enough memory available to hold the data you care about -- the real problem is getting access to enough of that memory from 32-bit code.
If that's the case, then I'd look at Windows' Address Windowing Extensions (AWE) functions, such as AllocateUserPhysicalPages and MapUserPhysicalPages. These work quite a bit like file mapping except that when you map data into your address space, it's already in physical memory instead of having to be read from the disk (i.e., the mapping is much faster).
I would embed or install, depending on your requirements for distribution, one or more instances of Memcached and have one (or more if necessary) thread feed blocks from disk into the memcache.
Once you moved your data onto memcached, you are pretty much immune to 32 bit limitations, especially if the memcached itself runs as a 64 bit process.
Basically you would in your program read from a socket instead of a file, and memcached would be a fancy file cache.

How to build an application layer pre-fetching system

I'm working in a C/C++ mixed project that has the following situation.
I need to have a iteration to go through very small chunks (rarely larger chunks as well) in a file one by one. Ideally, I should just read them once consecutively. I think will be a better solution in this case to read a big chunk into a buffer and consume it later, rather than read each of them instantly when I need.
The problem is, how do I balance the cache size? Is there any well-known algorithm/library that I can take advantage of?
UPDATE: (changes the title)
Thanks for you guys' replies and I understand there are different levels of caching mechanism working in our boxes. But that not enough in my case.
I think I missed something important here. Actually I'm building an application upon an existing framework, in which requesting reads to the engine frquently will cost too much for me. (Yes, i believe the engine do take advantage of OS and disk level caches.) And what I'm trying to do is indeed to build an application level pre-fetching system.
Thoughts?
in general you should try to use what the OS gives you, rather than creating your own cache (because you run the risk of caching twice). for linux, you can request OS level caching via readahead(); i don't know what the windows equivalent would be.
looking into this some more, there is also a block level (ie disk) parameter, set via blockdev --setra. it's probably not a good idea to change that on your system (unless it is dedicated to just this one task), but if the value there (blockdev --getra) is already larger than your typical chunk size then you may not need to do anything else.
[and just to address the other point mentioned in the question comments - while an OS will cache file data in free memory, i don't believe that it will pre-emptively read an otherwise unread file (apart from to meet the requirements above). but if anyone knows otherwise, please post details...]
Have you tried mmap()ing the file instead of read()ing from it? In some cases this might be more efficient, in some cases this might not. However it is usually best to let the system optimize for you, since it knows more about the hardware than an application. mmap() will let the system know that you need the whole file, so it might just be more optimal.

C++ cache aware programming

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible?
Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?
Thanks!
According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:
Once we have a formula for the memory
requirement we can compare it with the
cache size. As mentioned before, the
cache might be shared with multiple
other cores. Currently {There
definitely will sometime soon be a
better way!} the only way to get
correct information without hardcoding
knowledge is through the /sys
filesystem. In Table 5.2 we have seen
the what the kernel publishes about
the hardware. A program has to find
the directory:
/sys/devices/system/cpu/cpu*/cache
This is listed in Section 6: What Programmers Can Do.
He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.
There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.
C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.
Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.
read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.
Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.
TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).
Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).
http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c
The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.
It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/
You can see this thread: http://software.intel.com/en-us/forums/topic/296674
The short answer is in this other thread:
On modern IA-32 hardware, the cache line size is 64. The value 128 is
a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium
D) where 64-byte lines are paired into 128-byte sectors. When a line
in a sector is fetched, the hardware automatically fetches the other
line in the sector too. So from a false sharing perspective, the
effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)
IIRC, GCC has a __builtin_prefetch hint.
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
has an excellent section on this. Basically, it suggests:
__builtin_prefetch (&array[i + LookAhead], rw, locality);
where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.
Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.
There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?
Determining the cache-size at compile-time
For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.
If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.
Determining the cache-size at runtime
At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.
I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.
Combining both approaches
If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.
The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

Threading on bootloader

Where can I find resources/tutorials on how to implement threads on a x86 architecture bootloader... lets say I want to load resources in the background while displaying a progress bar..
That is a very unusual question...so allow me to provide my opinion on it...
Bootloaders, are really a limited bunch of assembly code, 464 bytes to be exact, 64 bytes for partition information and a final two bytes for the magic marker to indicate the end of the boot loader, that is 512bytes in total.
Bootloaders such as Grub can get around this limitation by implementing a two phase bootloader, the first phase is the 512 bytes as mentioned, then the second phase is loaded in which further options etc are performed.
Generally, the bootloader code is in 16 bit assembly because the original BIOS code is 16bit code, and that is what the processor 386 upwards to the modern processor today, boots up in, real mode.
Using a two phase bootloader, the first 512bytes is 16bit, then the second phase switches the processor into 32bit mode, setting up the registers and gate selectors in preparation, which in turn then jumps to the entry code of the actual program to do the booting up - this is taking into account in having to read from a specific location on disk or reading a configuration file which contains data on where the boot code is stored.
Implmenting threads in 32bit mode is something that will be tricky to produce as you will have to create some kind of a scheduler in Assembly (Since you mentioned implement threads on a x86 architecture bootloader).
You may get around this by implementing a second phase part of the bootloader using C (but the tricky bit is that no standard libraries are to be used as the runtime environment has not been set up yet!)
You may be better by using Grub or even check out this Open source BIOS bootloaders here, nowadays, bios's are flashable so you may be able to get an EFI (Extensible Firmware Interface here) which is pure 32bit bios - this will be dependant on your processor. There is also another website here which might provide further info here.
The progress bar on boot, is unfortunately written in C/C++ which (already, in 32bit, environment set up, tasking scheduler set up, threads included, virtual memory manager loaded etc - this is the kernel level, after boot up procedure is complete), in which is a process where a thread has been created, that runs in the background illustrating hardware detection/further environment set up etc by using a progress bar as a way to tell the user to "wait, the system is loading"
This book might help you somewhat -- it describes various aspects of the linux kernel -- including initialization. You might want to look at GRUB its pretty standard across UNIX flavours.
The book I mentioned should be your resource of choice, the kernel doesn't consider its metal thread-capable till quite late in the initialization cycle, and by this I mean setting up the data structures for threading is well-documented.
Although I can't seem to think of any real benefit of allowing threading constructs in a boot loader -- firstly its simpler to setup your basic hardware using single threaded procedural code and secondly you expect the code to be bullet-proof so threading as a defence mechanism isn't needed.
So I'd expect you're looking at emulating a progress bar :D