TrustZone of the Cortex-M23/33 vs. TrustZone of the Cortex-A

TrustZone of the Cortex-M23/33 vs. TrustZone of the Cortex-A - cortex-m

What is the difference between the TrustZone of Cortex M23/33 and the TrustZone of Cortex A? May I start to prototype my Cortex M23 application on a Cortex A processor and then migrate to Cortex M23 when chips with this core are available?

Disclaimer: I am not a TrustZone expert, I have read some articles and experimented a with Arm Trusted Firmware on an Armv8-a processor in Aarch64 state, and at EL3/EL2 exceptions level.According to this link, they seem very different:
Cortex-A are using the SMC instruction for switching between non-secure and secure world, and some specific pieces or software need to be written, such as a trusted boot, a secure world switch monitor, a small trusted OS and trusted apps.
Cortex-M are using their hardware for faster transitions and greater power efficiency betrween non-secure and secure worlds. There is no need for any secure monitor software.
Bottom line, you should probably not use a Cortex-A for starting developing your Cortex-M23 software.
You should rather have a look on the Arm MPS2+ FPGA Prototyping Board, verify that is is well suited to your needs and buy one: according to ARM, it "is supplied with fixed encrypted FPGA implementations of all the Cortex-M processors.", including Cortex-M23 and Cortex-M33 implementations.
There obviously will be differences in terms of performance between the FPGA implementation and a real Cortex-M23 implementation, but from a TrustZone-aware software point of view, there should be none.
If you think about it, USD $495.00 is less than 10 hours of an embedded software developer costing USD 50 per hour. This is not too huge a price for removing a huge risk from your project - my two cents.

I got an answer from ARM on this question through another channel and since the topic might be interesting for the community I want to share it here. Here is what ARM says:
While both of them are called TrustZone and at high level the
concepts are similar, at low level of the architecture there are many
differences between TrustZone on Cortex-M23/M33 and Cortex-A. The
following website summarized the key differences:
https://developer.arm.com/technologies/trustzone
Due to those architectural differences, you cannot use a Cortex-A
platform to develop TrustZone software for Cortex-M.

Trustzone in Cortex-A use a dedicated mode to handle the switch between security state, this mode is monitor mode.
In monitor mode processor will be always in secure state and will have access to NS bit in SCR register and this magic bit will define the security state of the mode that CPU will switch to after monitor mode.
As a result any switch between secure and non-secure state will go through single entry point which is monitor mode.
ARMv8m with security extension had different approach, although the concept is the same by having two states secure and non secure but we can implement multiple entry points to switch between CPU states, so you have three type of memory attributes : secure, non-secure and non-secure callable and non-secure callable will represent the entry point that ensure transition from non-secure to secure.
The entry point in non-secure callable memory has unique structure: it must start by SG (secure gate) instruction, once executed the CPU will switch to secure state.
Switching back to non-secure state is handled by executing others dedicated instruction : BXNS and BLXNS
ARMv8M follows different approach by having multiple entry points that can be implemented by user and has divide addressable memory pace between 3 memory attribute
for more details you can refer to following course:
https://www.udemy.com/course/arm-cortex-m33-trust-zone/?referralCode=6BDA6DF1E47A7CF53175

Related

C/C++ technologies involved in sending data across networks very fast

In terms of low latency (I am thinking about financial exchanges/co-location- people who care about microseconds) what options are there for sending packets from a C++ program on two Unix computers?
I have heard about kernel bypass network cards, but does this mean you program against some sort of API for the card? I presume this would be a faster option in comparison to using the standard Unix berkeley sockets?
I would really appreciate any contribution, especially from persons who are involved in this area.
EDITED from milliseconds to microseconds
EDITED I am kinda hoping to receive answers based more upon C/C++, rather than network hardware technologies. It was intended as a software question.

UDP sockets are fast, low latency, and reliable enough when both machines are on the same LAN.
TCP is much slower than UDP but when the two machines are not on the same LAN, UDP is not reliable.

Software profiling will stomp obvious problems with your program. However, when you are talking about network performance, network latency is likely to be you largest bottleneck. If you are using TCP, then you want to do things that avoid congestion and loss on your network to prevent retransmissions. There are a few things to do to cope:
Use a network with bandwidth and reliability guarantees.
Properly size your TCP parameters to maximize utilization without incurring loss.
Use error correction in your data transmission to correct for the small amount of loss you might encounter.
Or you can avoid using TCP altogether. But if reliability is required, you will end up implementing much of what is already in TCP.
But, you can leverage existing projects that have already thought through a lot of these issues. The UDT project is one I am aware of, that seems to be gaining traction.

At some point in the past, I worked with a packet sending driver that was loaded into the Windows kernel. Using this driver it was possible to generate stream of packets something 10-15 times stronger (I do not remember exact number) than from the app that was using the sockets layer.
The advantage is simple: The sending request comes directly from the kernel and bypasses multiple layers of software: sockets, protocol (even for UDP packet simple protocol driver processing is still needed), context switch, etc.

Usually reduced latency comes at a cost of reduced robustness. Compare for example the (often greatly advertised) fastpath option for ADSL. The reduced latency due to shorter packet transfer times comes at a cost of increased error susceptibility. Similar technologies migt exist for a large number of network media. So it very much depends on the hardware technologies involved. Your question suggests you're referring to Ethernet, but it is unclear whether the link is Ethernet-only or something else (ATM, ADSL, …), and whether some other network technology would be an option as well. It also very much depends on geographical distances.
EDIT:
I got a bit carried away with the hardware aspects of this question. To provide at least one aspect tangible at the level of application design: have a look at zero-copy network operations like sendfile(2). They can be used to eliminate one possible cause of latency, although only in cases where the original data came from some source other than the application memory.

As my day job, I work for a certain stock exchange. Below answer is my own opinion from the software solutions which we provide exactly for this kind of high throughput low latency data transfer. It is not intended in any way to be taken as marketing pitch(please i am a Dev.)This is just to give what are the Essential components of the software stack in this solution for this kind of fast data( Data could be stock/trading market data or in general any data):-
1] Physical Layer - Network interface Card in case of a TCP-UDP/IP based Ethernet network, or a very fast / high bandwidth interface called Infiniband Host Channel Adaptor. In case of IP/Ethernet software stack, is part of the OS. For Infiniband the card manufacturer (Intel, Mellanox) provide their Drivers, Firmware and API library against which one has to implement the socket code(Even infiniband uses its own 'socketish' protocol for network communications between 2 nodes.
2] Next layer above the physical layer we have is a Middleware which basically abstracts the lower network protocol nittigritties, provides some kind of interface for data I/O from physical layer to application layer. This layer also provides some kind of network data quality assurance (IF using tCP)
3] Last layer would be a application which we provide on top of middleware. Any one who gets 1] and 2] from us, can develop a low latency/hight throughput 'data transfer of network' kind of app for stock trading, algorithmic trading kind os applications using a choice of programming language interfaces - C,C++,Java,C#.
Basically a client like you can develop his own application in C,C++ using the APIs we provide, which will take care of interacting with the NIC or HCA(i.e. the actual physical network interface) to send and receive data fast, really fast.
We have a comprehensive solution catering to different quality and latency profiles demanded by our clients - Some need Microseconds latency is ok but they need high data quality/very little errors; Some can tolerate a few errors, but need nano seconds latency, Some need micro seconds latency, no errors tolerable, ...
If you need/or are interested in any way in this kind of solution , ping me offline at my contacts mentioned here at SO.

Need help to choose real-time OS and Hardware

I heard and read that Windows/Linux OS machines are not real-time.
I have read this article. It listed WindowsCE is one of RTOS. That's kind of confusing to me since I always thought WindowsCE is for a mobile or embeded device.
I need a real-time application running 24/7 and processing signals various sensors from each quick moving object to db and monitor by running several machine learning algorithms.
What would be proper real-time hardware and OS for this kind of applications? Development environment would be MFC or Qt C++. I really need opinions from experienced developers. Thanks

QNX has served me well in the past. I should warn you that it was only for training purposes (real-time industrial process control), and that I have implemented real time control programs with this OS by I've never really deployed one.
The first rule with real-time systems is to specify your real-time constraints, such as:
the system must be able to process up to 600 signals per minute; or
the system must spend no more than 1/10 second per signal.
The difference is subtle, but these are different constraints.
Just keep in mind that there is absolutely no way to decide if any hardward/OS/library combination is good enough for you unless you specify these constraints
For that, you think QNX might be proper? What would be its advantages over Windows/Linux systems with high priority setting?
If you look at the QNX documentation for many POSIX systems calls, you will notice they specify extra constraints on performance, which are possibly required to guarantee your real-time constraints. The OS is specifically designed to match these constraints. You won't get this on a system that is not officially an RTOS. If you are going to write real-time software, I recommend that you buy a good book on the subject. There is considerable literature given that the subject is very sensitive.
Get yourself a good book on real-time system design to get a feel of what questions to ask, and then read the technical documentation of each product you will use to see if it can match your constraints. Example of things to look in software libraries like Qt is when they allocate memory. If this is not documented in each class interface, there is no way to guarantee meeting your constraints since there is hidden algorithmic complexity.
Development environment would be MFC or Qt C++.
I would think that Qt compiles on QNX, but I'm not sure if Qt provides the guarantees required to match your real-time constraints. Libraries that abstract away too much stuff are risk since it's difficult to determine if they satisfy your requirements. Hidden memory management is often problematic, but there are other questions you should ask about too.
It seems to me that people say Real-time systems == embedded systems. Am I wrong?
Real-time system definitely does not equal "embedded system", though many embedded systems have real-time constraints.

How real time do you need?
Remember real time is about responsiveness, not speed. In fact most RTOS will be slower on average than general OS.
Do you need to guarrantee a certain average number of transactions/second or do you need to always respond within n seconds of an event?
Do you have custom hardware or are you relying on inputs over ethernet, USB, etc?
Are drivers for the hardware available on the RTOS or will you have to write them yourself ?
Windows and linux are possible RTOS. Windows embedded allows you to turn off services to give much more reliable response rate and there are both realtime kernels and realtime add-ons to Linux which give pretty much the same real time performance as something like VxWorks.
It also depends on how many tasks you need to handle. A lot of the complexity of true RTOS (like VxWorks) is that they can control many tasks at the same time while allowing each a guaranteed latency and CPU share - important for a Mars rover but not for a single data collection PC

How to program in Windows 7.0 to make it more deterministic?

My understanding is that Windows is non-deterministic and can be trouble when using it for data acquisition. Using a 32bit bus, and dual core, is it possible to use inline asm to work with interrupts in Visual Studio 2005 or at least set some kind of flags to be consistent in time with little jitter?
Going the direction of an RTOS(real time operating system): Windows CE with programming in kernel mode may get too expensive for us.

Real time solutions for Windows such as LabVIEW Real-time or RTX are expensive; a stand-alone RTOS would often be less expensive (or even free), but if you need Windows functionality as well, you are perhaps no further forward.
If cost is critical, you might run a free or low-cost RTOS in a virtual machine. This can work, though there is no cooperation over hardware access between the RTOS and Windows, and no direct communication mechanism (you could use TCP/IP over a virtual (or real) network I suppose.
Another alternative is to perform the real-time data acquisition on stand-alone hardware (a microcontroller development board or SBC for example) and communicate with Windows via USB or TCP/IP for example. It is possible that way to get timing jitter down to the microsecond level or better.

There are third-party realtime extensions to Windows. See, e. g. http://msdn.microsoft.com/en-us/library/ms838340(v=winembedded.5).aspx

Windows is not an RTOS, so there is no magic answer. However, there are some things you can do to make the system more "real time friendly".
Disable background processes that can steal system resources from you.
Use a multi-core processor to reduce the impact of context switching
If your program does any disk I/O, move that to its own spindle.
Look into process priority. Make sure your process is running as High or Realtime.
Pay attention to how your program manages memory. Avoid doing thigs that will lead to excessive disk paging.
Consider a real-time extension to Windows (already mentioned).
Consider moving to a real RTOS.
Consider dividing your system into two pieces: (1) real time component running on a microcontroller/DSP/FPGA, and (2) The user interface portion that runs on the Windows PC.

in virtualbox, what happens when you allocate more than one virtual core?

If I'm using Oracle's virtualbox, and I assign more than one virtual core to the virtual machine, how are the actual cores assigned? Does it use both real cores in the virtual machine, or does it use something that emulates cores?

Your question is almost like asking: How does an operating system determine which core to run a given process/thread on? Your computer is making that type of decision all the time - it has far more processes/threads running than you have cores available. This specific answer is similar in nature but also depends on how the guest machine is configured and what support your hardware has available to accelerate the virtualization process - so this answer is certainly not definitive and I won't really touch on how the host schedules code to be executed, but lets examine two relatively simple cases:
The first would be a fully virtualized machine - this would be a machine with no or minimal acceleration enabled. The hardware presented to the guest is fully virtualized even though many CPU instructions are simply passed through and executed directly on the CPU. In cases like this, your guest VM more-or-less behaves like any process running on the host: The CPU resources are scheduled by the operating system (to be clear, the host in this case) and the processes/threads can be run on whatever cores they are allowed to. The default is typically any core that is available, though some optimizations may be present to try and keep a process on the same core to allow the L1/L2 caches to be more effective and minimize context switches. Typically you would only have a single CPU allocated to the guest operating system in these cases, and that would roughly translate to a single process running on the host.
In a slightly more complex scenario, a virtual machine is configured with all available CPU virtualization acceleration options. In Intel speak these are referred to as VT-x for AMD it is AMD-V. These primarily support privileged instructions that would normally require some binary translation / trapping to keep the host and guest protected. As such, the host operating system loses a little bit of visibility. Include in that hardware accelerated MMU support (such that memory page tables can be accessed directly without being shadowed by the virtualization software) - and the visibility drops a little more. Ultimately though it still largely behaves as the first example: It is a process running on the host and is scheduled accordingly - only that you can think of a thread being allocated to run the instructions (or pass them through) for each virtual CPU.
It is worth noting that while you can (with the right hardware support) allocate more virtual cores to the guest than you have available, it isn't a good idea. Typically this will result in decreased performance as the guest potentially thrashes the CPU and can't properly schedule the resources that are being requested - even if the the CPU is not fully taxed. I bring this up as a scenario that shares certain similarities with a multi-threaded program that spawns far more threads (that are actually busy) than there are idle CPU cores available to run them. Your performance will typically be worse than if you had used fewer threads to get the work done.
In the extreme case, VirtualBox even supports hot-plugging CPU resources - though only a few operating systems properly support it: Windows 2008 Data Center edition and certain Linux kernels. The same rules generally apply where one guest CPU core is treated as a process/thread on a logical core for the host, however it is really up to the host and hardware itself to decide which logical core will be used for the virtual core.
With all that being said - your question of how VirtualBox actually assigns those resources... well, I haven't dug through the code so I certainly can't answer definitively but it has been my experience that it generally behaves as described. If you are really curious you could experiment with tagging the VirtualBox VBoxSvc.exe and associated processes in Task Manager and choosing the "Set Affinity" option and limiting their execution to a single CPU and see if those settings are honored. It probably depends on what level of HW assist you have available if those settings are honored by the host as the guest probably isn't really running as part of those.

Low latency trading systems using C++ in Windows?

It seems that all the major investment banks use C++ in Unix (Linux, Solaris) for their low latency/high frequency server applications. Why is Windows generally not used as a platform for this? Are there technical reasons why Windows can't compete?

The performance requirements on the extremely low-latency systems used for algorithmic trading are extreme. In this environment, microseconds count.
I'm not sure about Solaris, but the case of Linux, these guys are writing and using low-latency patches and customisations for the whole kernel, from the network card drivers on up. It's not that there's a technical reason why that couldn't be done on Windows, but there is a practical/legal one - access to the source code, and the ability to recompile it with changes.

Technically, no. However, there is a very simple business reason: the rest of the financial world runs on Unix. Banks run on AIX, the stock market itself runs on Unix, and therefore, it is simply easier to find programmers in the financial world that are used to a Unix environment, rather than a Windows one.

(I've worked in investment banking for 8 years)
In fact, quite a lot of what banks call low latency is done in Java. And not even Real Time Java - just normal Java with the GC turned off. The main trick here is to make sure you've exercised all of your code enough for the jit to have run before you switch a particular VM into prod ( so you have some startup looping that runs for a couple of minutes - and hot failover).
The reasons for using Linux are:
Familiarity
Remote administration is still better, and also low impact - it will have a minimal effect on the other processes on the machine. Remember, these systems are often co-located at the exchange, so the links to the machines (from you/your support team) will probably be worse than those to your normal datacentres.
Tunability - the ability to set swappiness to 0, get the JVM to preallocate large pages, and other low level tricks is quite useful.
I'm sure you could get Windows to work acceptably, but there is no huge advantage to doing so - as others have said, any employees you poached would have to rediscover all their latency busting tricks rather than just run down a checklist.

Linux/UNIX are much more usable for concurrent remote users, making it easier to script around the systems, use standard tools like grep/sed/awk/perl/ruby/less on logs... ssh/scp... all that stuff's just there.
There are also technical issues, for example: to measure elapsed time on Windows you can choose between a set of functions based on the Windows clock tick, and the hardware-based QueryPerformanceCounter(). The former is increments each 10 to 16 milliseconds (note: some documentation implies more precision - e.g. the values from GetSystemTimeAsFileTime() measure to 100ns, but they report the same 100ns edge of the clock tick until it ticks again). The latter - QueryPerformanceCounter() - has show-stopping issues where different cores/cpus can report clocks-since-startup that differ by several seconds due to being warmed up at different times during system boot. MSDN documents this as a possible BIOS bug, but it's common. So, who wants to develop low-latency trading systems on a platform that can't be instrumented properly? (There are solutions, but you won't find any software ones sitting conveniently in boost or ACE).
Many Linux/UNIX variants have lots of easily tweakable parameters to trade off latency for a single event against average latency under load, time slice sizes, scheduling policies etc.. On open source Operating Systems, there's also the assurance that comes with being able to refer to the code when you think something should be faster than it is, and the knowledge that a (potentially huge) community of people have been and are doing so critically - with Windows it's obviously mainly going to be the people who're assigned to look at it.
On the FUD/reputation side - somewhat intangible but an important part of the reasons for OS selection - I think most programmers in the industry would just trust Linux/UNIX more to provide reliable scheduling and behaviour. Further, Linux/UNIX has a reputation for crashing less, though Windows is pretty reliable these days, and Linux has a much more volatile code base than Solaris or FreeBSD.

Reason is simple, 10-20 years ago when such systems emerged, "hardcore" multi-CPU servers were ONLY on some sort of UNIX. Windows NT was in kinder-garden these days. So the reason is "historical".
Modern systems might be developed on Windows, it's just a matter of taste these days.
PS: I am currencly working on one of such systems :-)

I partially agree with most of the answers above. Though what I have realized is the biggest reason to use C++ is becuase it is relatively faster with a very vast STL library.
Apart from it, linux/unix system is also used to boost performance. I know many low latency team which go to a extent of tweaking the linux kernel. Obviously this level of freedom is not provided by windows.
Other reasons like legacy systems, license cost, resources count as well but are lesser driving factors. As "rjw" mentioned, I have seen teams use Java as well with a modified JVM.

There are a variety of reasons, but the reason is not only historical. In fact, it seems as if more and more server-side financial applications run on *nix these days than ever before (including big names like the London Stock Exchange, who switched from a .NET platform). For client-side or desktop apps, it would be silly to target anything other than Windows, as that is the established platform. However, for server-side apps, most places that I have worked at deploy to *nix.

I second the opinions of historical and access to kernel manipulation.
Apart from those reasons I also believe that just like how they turn off garbage collection of .NET and the similar mechanism in Java when using these technologies in some low latency. They might avoid Windows because of the API's at high level which interact with low level os and then the kernel.
So the core is of course the kernel which can be interacted with using the low level os. The high level APIs are provided just to make the common users life easier. But in case of Low latency this turns out to be a fatty layer and fraction seconds loss around each operation. So a lucrative option for gaining few fraction seconds per call.
Apart from this another thing to consider is integration. Most of the servers, data centers, exchanges use UNIX not windows so using the clients of same family makes the integration and communication easier.
Then you have security issues (many people out there might not agree with this point though) hacking UNIX is not easy compared to hacking WINDOWS. I don't agree licensing must be the issue for banks because they shower money on every single piece of hardware and software and the people who customize them, so buying licenses will not be as bigger the issue when considered what they gain by purchasing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js