Where do memory mapped I/O addreses come from? - osdev

I am messing around with some hobbyist OS development, and I am a little confused on memory mapped I/O addresses. I understand the whole memory mapped I/O concept, but I am trying to figure out how developers get the addresses to manipulate hardware.
Are the addresses specified by the hardware vendors, or are they some sort of standard addresses for all computers? For example, VGA memory for text printing starts at address 0xB8000. Is that standard for every x86 machine? If so, who set that standard? And if I wanted to talk with an ethernet card, for example, how would I know the addresses or ports it uses for communication?
Thanks in advance.

I'm not 100% sure about who sets the addresses, but as far as I'm aware, hardware vendors can set their memory map however they want.
For what it's worth, Linux lets you see how memory is currently mapped on your machine by doing cat /proc/iomem:
00000000-0000ffff : reserved
00010000-0009f3ff : System RAM
0009f400-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000ca000-000cbfff : reserved
000ca000-000cafff : Adapter ROM
000cc000-000cffff : PCI Bus 0000:00
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000fffff : reserved
000f0000-000fffff : System ROM
00100000-3fedffff : System RAM
01000000-01536143 : Kernel code
01536144-017c007f : Kernel data
01875000-0194bfff : Kernel bss
3fee0000-3fefefff : ACPI Tables
....

You get the port with some hardware detection mechanism like PCI bus scanning, usb, and ACPI.
For example, if you found a supported display card on PCI, you query its BARs(base address registers) and you got the physical address, and/or IO port base, and/or IRQ number. Same for NIC and other card.
For things that are not on any bus, eg. ps/2 controller, the detection is a lot difficult and involve parsing the ACPI tables.

In Computer Architecture, the i/o devices are either mapped to an i/o-address-space (i/o mapped i/o) or to a memory-address-space (memory mapped i/o).
I/O Mapped I/O:
The processor can differentiate between memory and i/o devices. Thus i/o devices are mapped to the I/O address space which is considerably less than the memory spaces.
For example, let's the i/o-address-space is 8-bit address = 2^8 = 256 addresses
Here, we have the possibility of connecting 256 i/o devices to the system.
The processor uses unique control signals to the i/o and memory., 4 signals
MR (memory read), MW (memory write), IOR (input-output read), IOW
(input-output write)
Usage: Home computers, small offices...
Memory Mapped I/O:
The processor does not differentiate between memory and i/o devices. So i/o devices are mapped with respect to the bit address of the memory with memory addresses being allotted to the devices.
For example, consider a RAM with a 16-bit address = 2^16 (64K) = 65,536
i.e., there is a possibility of connecting 65,536 i/o devices.
Microcontrollers are used to manage the operations of the devices with two control signals
RD (Read) and WR (Write)
Usage: Industrial Applications

Related

Implementing user-space network card "bus-mastering" in C++ on Linux

I am interested in accessing network packets via "bus-mastering" in a C++ application on Linux. I have a few questions relating to this overall topic:
1) How will I know which memory address range the "bus-mastering"-enabled Network card is writing the data to and would this be kernel or user space?
2) If #2 is "Kernel space", how could I change the card so that it writes to memory in user space?
3a) How can I access this particular user-space memory area from C++?
3b) I understand you cannot just start accessing memory areas of other processes from one application, only those explicitly "shared"- so how do I ensure the memory area written to directly by the network card is explicitly for sharing?
4) How do I know whether a network card implements "bus-mastering"?
I have come across the term PACKET_MMAP - is this going to be what I need?
If you mmap a region of memory, and give the address of that to the OS, the OS can lock that region (so that it doesn't become swapped out) and get the physical address of the memory.
It is not at all used for that purpose, but the code in drivers/xen/privcmd.c, in the function mmap_mfn_range called from privcmd_ioctl_mmap (indirectly, by traverse_map). This in turn calls remap_area_mfn_pte_fn from xen_remap_domain_mfn_range.
So, if you do something along those lines in the driver, such that the pages are locked into memory and belong to the application, you can program the physical address(es) of the mmap'd region into the hardware of the network driver, and get the data directly to the user-mode memory that was mmap'd by the user code.

glMapBuffer() and glBuffers, how does the access with a (void*) work with hardware?

Reading through the OpenGL Programming Guide, 8th Edition.
This is really a hardware question, actually...
I come to a section on OpenGL buffers, and as far as I understand they are memory spaces allocated in graphics card memory, is this correct?
If so, how are we able to get a pointer to read or modify that memory using glMapBuffer() ? As far as I was aware, all possible memory addresses (eg on a 64bit system there are uint64_t num = 0x0; num = ~num; possible addresses) were used for system memory as in RAM / CPU side Memory.
glMapBuffers() returns a void* to some memory. How can that pointer point to memory inside the graphics card? Particularly if I had a 32 bit system, and more than 4GB of RAM, and then a Graphics Card with say 2GB/4GB of memory. Surely there aren't enough addresses?!
This is really a hardware question, actually...
No it's not. You'll see why in a moment.
I come to a section on OpenGL buffers, and as far as I understand they are memory spaces allocated in graphics card memory, is this correct?
Not quite. You must understand that while OpenGL gets you really close to the actual hardware, you're still very far from touching it directly. What glMapBuffer does is, that it sets up a virtual address range mapping. On modern computer systems the software doesn't operate on physical addresses. Instead a virtual address space (of some size) is used. This virtual address space looks like one large contiguous block of memory to the software, while in fact its backed by a patchwork of physical pages. Those pages can be implemented anyhow, they can be actual physical memory, they can be I/O memory, they can even be created in-situ by another program. The mechanism for that is provided by the CPU's Memory Management Unit in collaboration with the OS.
So for each process the OS manages a table of which part of the process virtual address space maps to what page handler. If you're running Linux have a look at /proc/$PID/maps. If you have a program that uses glMapBuffer read (with your program, don't call system) /proc/self/maps before and after the map buffer and look for the differences.
As far as I was aware, all possible memory addresses (eg on a 64bit system there are uint64_t num = 0x0; num = ~num; possible addresses) were used for system memory as in RAM / CPU side Memory.
What makes you think that? Whoever told you that (if somebody told you that) should be slapped in the face… hard.
What you have is a virtual address space. And this address space is completely different from the physical address space on the hardware side. In fact the size of the virtual address space and the size of physical address spaces can differ largely. For example for a long time there were 32 bit CPUs and 32 bit operating systems around. But already then it was desireable to have more than 4 GiB of system memory. So while the CPU would support only 32 bits of address space for a process (maximum size of a pointer), it may have provided 36 bits of physical address lines to memory, to support some 64 GiB of system RAM; it would then be the OS's job to manually switch those extra bits, so that while each process sees only some 3 GiB of system RAM (max.) processes in total could spread. A technique like that has become known as Physical Address Extension (PAE).
Furthermore not all of the address space in a process are backed by RAM. Like I already explained, address space mappings could be backed by anything. Often the memory pagefault handler will also implement swapping, i.e. if there's not enough free RAM around it will use HDD storage (in fact on Linux all userspace requests for memory are backed by the Disk I/O Cache handler). Also since the address space mappings are per process, some part of the address space is mapped kernel memory, which is the (physically) same for all processes and also resides at the same place in all processes. From user space this address space mapping is not accessible, but as soon as a syscall makes a transistion into kernel space it gets accessible; yes the OS kernel uses virtual memory internally, too. It just can't choose as broadly from the available backings (for example it would be very difficult for a network driver to operate, if its memory was backed by the network itself).
Anyway: On modern 64 bit systems you got a 64 bit pointer size, and with current hardware there are 48 bits of physical RAM address lines. Which leaves plenty of space, namely 16 × 48 bits (EDIT which means 2^16 - 1 times a 48 bit address space), for virtual mappings where there's no RAM around. And because there's so much to go around, each and every PCI card gets its very own address space, that behaves a little bit like RAM to the CPU (remember those PAE I mentioned earlier, well, in good old 32 bit times something like that had to be done to talk with extension cards already).
Now here comes the OpenGL driver. It simply provides a new address mapping handler, that usually just builds on top of the PCI address space handler, which will map a portion of virtual address space of a process. And whatever happens in that address space will be reflected by that mapping handler into a buffer ultimately accessed by the GPU. However the GPU itself may be accessing CPU memory directly. And what AMD plans is, that GPU and CPU will live on the same Die and access the same memory, so there's no longer a physical distinction there.
glMapBuffers() returns a pointer in the virtual memory space of the application, that's why the pointer could point in something above 4GB on a 64bits system.
The memory you manipulate with the mapped pointer could be a cpu copy (shadow) of the texture(or buffer) allocated on gpu or it could be the actual texture moved to system memory. It's often the operating system who decides if a texture resides on system memory or gpu memory. The operating system can move the texture from one location to another and can make a copy of it (shadow)
Computers have multiple layers of memory mapping. First all physically addressable components (including RAM, PCI memory windows, other device registers, etc) are all assigned a physical address. The size of this address varies. Even on 32-bit Intel devices the physical space might be a 36-bit address (to the CPU) while it is a 64-bit address in the PCI memory space.
On top of that mapping is a virtual mapping. There are many different virtual mappings, including one for each process. Within that space (which on a 64-bit system is as huge as you say) any combination (including repeats!) of physical space can be mapped as long as it fits. On a 64-bit system the available virtual space is so large that every possible physical device could easily be mapped.

How can I share user space static memory with a PCI device?

Context
Hi, I'm porting an ancient 1977 flight simulator program from a SEL computer to a Windows 7 x64 PC system. The program is 500.000 lines written in Fortran, with a large /common/ memory block that is accessed across all modules. This memory is allocated statically.
Additionally, and there my problems begin, there is also a hardware device, that used to access the /common/ block using DMA. We've successfully ported the hardware device to a FPGA PCI device, written a device driver for it and DMA works well.
The problem:
I want to share the static memory of the Fortran application with the PCI device.
Possible solutions
Things I have considered:
Allocate memory in the driver and re-map the user space Fortran common block to that area.
Lock the user space common block in physical memory and tell the PCI device where to read/write.
My preference would be the fist option, because that will avoid lifetime issues. So far I haven't found an acceptable solution. Any tips you could share with me?
Henk.
Note: we have full control over hardware and driver, since we built it ourselves, so exotic ideas might help too...
For those who wish to know: eventually I found no good solution for this and worked around it with read-copy-modify-write operations. Quite expensive, but since the original program is 45 years old, we had some cpu power to spare :-)

libpcap to capture 10 Gbps NIC

I want to capture packets from 10Gbps network card with 0 packet loss.
I am using lipcap for 100Mbps NIC and it is working fine.
Will libpcap be able to handle 10Gbps NIC traffic?
If not what are the other alternative ways to achive this?
Whether or not libpcap will handle 10Gbps with 0 packet loss is a matter of the machine that you are using and libpcap version. If the machine, CPU and HDD I/O are fast enough, you may get 0 packet loss. Otherwise you may need to perform the following actions:
Update your libpcap to the most recent version. Libpcap 1.0.0 or later, supposts zero-copy (memory-mapped) mechanism. It means that there is a buffer that's in both the kernel's address space and the application's address space, so that data doesn't need to be copied
from a kernel-mode buffer to a user-mode buffer. Packets are still copied from the skbuff (Linux) into the shared buffer, so it's really more like "one-copy", but that's still one fewer copy, so that could reduce the CPU time required to receive captured packets. Moreover more packets can be fetched from the buffer per application wake up call.
If you observe a high CPU utilization, it is probably your CPU that cannot handle the packet arrival rate. You can use xosview (a system load visualization tool) to check your system resources during the capture.
If the CPU drops packets, you can use PF_RING. PF_RING is an extension of libpcap with a circular buffer: http://www.ntop.org/products/pf_ring/. It is way faster and can capture with 10Gbps with commodity NICs http://www.ntop.org/products/pf_ring/hardware-packet-filtering/.
Another approach is to get a NIC that has an on-board memory and a specific HW design for packet capturing, see http://en.wikipedia.org/wiki/DAG_Technology.
If the CPU is not any more your problem, you need to test disk data transfer speed. hdparm is the simplest tool on Linux. Some distros come with a GUI, otherwise:
$ sudo hdparm -tT /dev/hda
If you are developing your own application based on libpcap:
Use pcap_stats to identify (a) the number of packets dropped because there was no room in the operating system's buffer when they arrived, because packets weren't being read fast enough; (b) number of packets dropped by the network interface or its driver.
Libpcap 1.0.0 has an API that lets an application set the buffer size, on platforms where the buffer size can be set.
b) If you find it hard to set the buffer, you can use Libpcap 1.1.0 or later in which the default capture buffer size has been increased from 32K to 512K.
c) If you are just using tcpdump, use 4.0.0 or later and use the -B flag for the size of the buffer
You don't say which Operating System or CPU. It doesn't matter whether you choose libpcap or not, the underlying network performance is still burdened by the Operating System Memory Management and its network driver. libpcap has kept up with the paces and can handle 10Gbps, but there's more.
If you want the best CPU so that you can do number-crunching, running virtual machines and while capturing packets, go with AMD Opteron CPU which still outperforms Intel Xeon Quadcore 5540 2.53GHz (despite Intel's XIO/DDIO introduction and mostly because of Intel dual-core sharing of same L2 cache). For best ready-made OS, go with latest FreeBSD as-is (which still outperforms Linux 3.10 networking using basic hardware.) Otherwise, Intel and Linux will works just fine for basic drop-free 10Gbps capture, provided you are eager to roll up your sleeves.
If you're pushing for breakneck speed all the time while doing financial-like or stochastic or large matrix predictive computational crunching (or something), then read-on...
As RedHat have discovered, 67.2 nanosecond is what it takes to process one minimal-sized packet at 10Gbps rate. I assert it's closer to 81.6 nanosecond for 64 byte Ethernet payload but they are talking 46-byte minimal as a theoretical.
To cut it short, you WON'T be able to DO or USE any of the following if you want 0% packet drop at full-rate by staying under 81.6 ns for each packet:
Make an SKB call for each packet (to minimize that overhead, amortized this over several 100s of packets)
TLB (Translation lookaside buffer, to avoid that, use HUGE page allocations)
Short latency (you did say 'capture', so latency is irrelevant here). It's called Interrupt Coalesce
(ethtool -C rx-frames 1024+).
Float processes across multi-CPU (must lock them down, one per network interface interrupt)
libc malloc() (must replace it with a faster one, preferably HUGE-based one)
So, Linux has an edge over FreeBSD to capture the 10Gbps rate in 0% drop rate AND run several virtual machines (and other overheads). Just that it requires a new memory management (MM) of some sort for a specific network device and not necessarily the whole operating system. Most new super-high-performance network driver are now making devices use HUGE memory that were allocated at userland then using driver calls to pass a bundle of packets at a time.
Many new network-driver having repurposed MM are out (in no particular order):
netmap
PF-RING
PF-RING+netmap
OpenOnload
DPDK
PacketShader
The maturity level of each code is highly dependent on which Linux (or distro) version you choose. I've tried a few of them and once I understood the basic design, it became apparent what I needed. YMMV.
Updated: White paper on high speed packet architecture: https://arxiv.org/pdf/1901.10664.pdf
Good luck.
PF_RING is a good solution an alternative can be netsniff-ng ( http://netsniff-ng.org/ ).
For both projects the gain of performance is reached by zero-copy mechanisms. Obviously, the bottleneck can be the HD, its data transfer rate.
If you have the time then move to Intel DPDK. It allows for Zero Copy access to the NIC's hardware register. I was able to achieve 0% drops at 10Gbps, 1.5Mpps on a single core.
You'll be better off in the long run

What real platforms map hardware ports to memory addresses?

I sometimes see statements that on some platforms the following C or C++ code:
int* ptr;
*ptr = 0;
can result in writing to a hardware input-output port if ptr happens to store the address to which that port is mapped. Usually they are called "embedded platforms".
What are real examples of such platforms?
Most systems in my experience use memory-mapped I/O. The x86 platform has a separate, non-memory-mapped I/O address space (that uses the in/out family of processor op-codes), but the PC architecture also extensively uses the standard memory address space for device I/O, which has a larger address space, faster access (generally), and easier programming (generally).
I think that the separate I/O address space was used initially because the memory address space of processors was sometimes quite limited and it made little sense to use a portion of it for device access. Once the memory address space was opened up to megabytes or more, that reason to separate I/O addresses from memory addresses became less important.
I'm not sure how many processors provide a separate I/O address space like the x86 does. As an indication of how the separate I/O address space has fallen out of favor, when the x86 architecture moved into the 32-bit realm, nothing was done to increase the I/O address space from 64KB (though they did add the ability to move 32-bit chunks of data in one instruction). When x86 moved into the 64-realm, the I/O address space remained at 64KB and they didn't even add the ability to move data in 64-bit units...
Also note that modern desktop and server platforms (or other systems that use virtual memory) generally don't permit an application to access I/O ports, whether they're memory-mapped or not. That access is restricted to device drivers, and even device drivers will have some OS interface to deal with virtual memory mappings of the physical address and/or to set up DMA access.
On smaller systems, like embedded systems, I/O addresses are often accessed directly by the application. For systems that use memory-mapped addresses, that will usually be done by simply setting a pointer with the physical address of the device's I/O port and using that pointer like any other. However, to ensure that the access occurs and occurs in the right order, the pointer must be declared as pointing to a volatile object.
To access a device that uses something other than a memory-mapped I/O port (like the x86's I/O address space), a compiler will generally provide an extension that allows you to read or write to that address space. In the absence of such an extension, you'd need to call an assembly language function to perform the I/O.
This is called Memory-mapped I/O, and a good place to start is the Wikipedia article.
Modern operating systems usually protect you from this unless you're writing drivers, but this technique is relevant even on PC architectures. Remember the DOS 640Kb limit? That's because memory addresses from 640K to 1Mb were allocated for I/O.
PlayStation. That was how we got some direct optimized access to low-level graphics (and other) features of the system.
An NDIS driver on Windows is an example. This is called memory mapped I/O and the benefit of this is performance.
See Embedded-Systems for examples of devices that use Memory-mapped I/O e.g. routers,adsl-modems, microcontroller etc.
It is mostly used when writing drivers, since most peripheral devices communicate with the main CPU through memory mapped registers.
Motorola 68k series and PowerPC are the big ones.
You can do this in modern Windows (and I'm pretty sure Linux offers it too). It's called memory mapped files. You can load a file into memory on Windows and then write/alter it just by manipulating pointers.