Problem
I originally posted this question which was apparently something that did not meet my customer spec. Hence I am redefining the problem:
To understand the problem a bit more, the timing diagram on the original post can be used. The delayer needs to be platform independent. To be precise, I run a job scheduler and apparently my current delayer is not going to be compatible with it. What I am stuck with is the "Independent" bit of the delayer. I have already knocked out a delayer in SIMULINK using Probe (probes for Sampling Time) and Variable Integer Delay blocks. However, during our acceptance phase we realised that the scheduler does not comply with such configuration and needs to be something more intrinsic and basic - something like a while loop running in C/C++ application.
Initial Solution
What I can think of a solution is the following:
Define a global and static time-slice variable called tslc. Basically, this is how often the scheduler runs. The unit could be in seconds
Define a function that has the following body:
void hold_for_secs(float* tslc, float* _delay, float* _tmr, char* _flag) {
_delay[0] -= tslc[0];
if (_delay[0] < (float)(1e-5)) {
_flag[0] = '1';
} else {
_flag[0] = '0';
}
}
Users please forgive my poor function-coding skilss, but I merely tried to come up with a solution. I would really appreciate if people help me out a little bit with suggestions here!
Computing Platform
Windows 2000 server
Target computing platform
An embedded system card - something similar to a modern graphics card or sound card that goes along one of the PCI slot. We do testing on a testbed and finally implement the solution on that embedded system card.
Related
I am trying to control a robot using a template-based controller class written in c++. Essentially I have a UDP connection setup with the robot to receive the state of the robot and send new torque commands to the robot. I receive new observations at a higher frequency (say 2000Hz) and my controller takes about 1ms (1000Hz) to calculate new torque commands to send to the robot. The problem I am facing is that I don't want my main code to wait to send the old torque commands while my controller is still calculating new commands to send. From what I understand I can use Ubuntu with RT-Linux kernel, multi-thread the code so that my getTorques() method runs in a different thread, set priorities for the process, and use mutexes and locks to avoid data race between the 2 threads, but I was hoping to learn what the best strategies to write hard-realtime code for such a problem are.
// main.cpp
#include "CONTROLLER.h"
#include "llapi.h"
void main{
...
CONTROLLERclass obj;
...
double new_observation;
double u;
...
while(communicating){
get_newObs(new_observation); // Get new state of the robot (2000Hz)
obj.getTorques(new_observation, u); // Takes about 1ms to calculate new torques
send_newCommands(u); // Send the new torque commands to the robot
}
...
}
Thanks in advance!
Okay, so first of all, it sounds to me like you need to deal with the fact that you receive input at 2 KHz, but can only compute results at about 1 KHz.
Based on that, you're apparently going to have to discard roughly half the inputs, or else somehow (in a way that makes sense for your application) quickly combine the inputs that have arrived since the last time you processed the inputs.
But as the code is structured right now, you're going to fetch and process older and older inputs, so even though you're producing outputs at ~1 KHz, those outputs are constantly being based on older and older data.
For the moment, let's assume you want to receive inputs as fast as you can, and when you're ready to do so, you process the most recent input you've received, produce an output based on that input, and repeat.
In that case, you'd probably end up with something on this general order (using C++ threads and atomics for the moment):
std::atomic<double> new_observation;
std::thread receiver = [&] {
double d;
get_newObs(d);
new_observation = d;
};
std::thread sender = [&] {
auto input = new_observation;
auto u = get_torques(input);
send_newCommands(u);
};
I've assumed that you'll always receive input faster than you can consume it, so the processing thread can always process whatever input is waiting, without receiving anything to indicate that the input has been updated since it was last processed. If that's wrong, things get a little more complex, but I'm not going to try to deal with that right now, since it sounds like it's unnecessary.
As far as the code itself goes, the only thing that may not be obvious is that instead of passing a reference to new_input to either of the existing functions, I've read new_input into variable local to the thread, then passed a reference to that.
I'm trying to speed up the inference on a people counter application, in order to use the GPU I've set the inference engine configuration setting as described:
device_name = "GPU"
ie.SetConfig({ {PluginConfigParams::KEY_CONFIG_FILE, "./cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml"} }, device_name);
and loading the network on the inference engine I've set the target device like described below:
CNNNetwork net = netReader.getNetwork();
TargetDevice t_device = InferenceEngine::TargetDevice::eGPU;
network.setTargetDevice(t_device);
const std::map<std::string, std::string> dyn_config = { { PluginConfigParams::KEY_DYN_BATCH_ENABLED, PluginConfigParams::YES } };
ie_.LoadNetwork(network,device_name, dyn_config);
but the inference engine use the CPU yet, and this slow down the inference time. There is a way to use the Intel GPU at maximum power to do inference on a particular network? I'm using the person-detection-retail-0013 model.
Thank's.
Have you meant person-detection-retail-0013? Because I haven't found pedestrian-detection-retail-013 in open_model_zoo repo.
This might be expected that you see a slowdown while using GPU. The network, you tested, has the following layers as part of the network topology: PriorBox, DetectionOutput . Those layers are executed on CPU as documentation says: https://docs.openvinotoolkit.org/latest/_docs_IE_DG_supported_plugins_CL_DNN.html
I have a guess that this may be the reason of the slowdown.
But to be 100% percent sure I would suggest to run benchmark_app tool to do bench-marking of the model. This tool can print detailed performance information about each layer. It should help to shed light what is the real root cause of the slowdown. More information about benchmark_app can be found here: https://docs.openvinotoolkit.org/latest/_inference_engine_samples_benchmark_app_README.html
PS: Just a piece of advice regarding usage of IE API. network.setTargetDevice(t_device); - setTargetDevice is a deprecated method. It is enough to set a device using LoadNetwork like in your example: ie_.LoadNetwork(network,device_name, dyn_config);
Hope it will help.
So we are using a stack consisting of c++ media foundation code in order to playback video files. An important requirement is the ability to play these videos in constantly repeating sequences, so every single video slot will periodically change the video it is playing. In our current example we are creating 16 HWNDs to render video into and 16 corresponding player objects. The main application loops over all of them one after another and does the following:
Shutdown the last player
Release the object
CoCreateinstance for a new player
Initialize the player with the (old) HWND
Start Playback
The media player is called "MediaPlayer2", this needs to be built and registered as COM (regsvr32). The main application is to be found in the TAPlayer2 Project. It searches for the CLSID of the player in the registry and instantiates it. As current test file we use a test.mp4 that has to reside on the disk like C:\test.mp4
Now everything goes fine initially. The program loops through the players and the video keeps restarting and playing. The memory footprint is normal and all goes smooth. After a timeframe of anything between 20 minutes and 4 days, all of the sudden things will get weird. At this point it seems as if calls to "InitializeRenderer" by the EVR slow down and eventually don't go through anymore at all. With this, also thread count and memory footprint will start to increase drastically and after a certain amount of time depending on existing RAM all the memory will be exhausted and our application crashes, usually somewhere in the GPU driver or near the EVR DLL.
I am happy to try out any other code examples that propose to solve my issue: displaying multiple video windows at the same time, and looping through them like in a playlist. Needs to be running on Windows 10!
I have been going at this for quite a while now and am pretty hard stuck. I uploaded the above mentioned code example and added the link to this post. This should work out of the box afaik. I can also provide code excerpts in here in the thread if that is preferred.
Any help is appreciated, thanks
Thomas
Link to demo project (VS2015): https://filebin.net/l8gl79jrz6fd02vt
edit: the following code from the end of winmain.cpp is used to restart the players:
do
{
for (int i = 0; i < PLAYER_COUNT; i++)
{
hr = g_pPlayer[i]->Shutdown();
SafeRelease(&g_pPlayer[i]);
hr = CoCreateInstance(CLSID_AvasysPlayer, // CLSID of the coclass
NULL, // no aggregation
CLSCTX_INPROC_SERVER, // the server is in-proc
__uuidof(IAvasysPlayer), // IID of the interface we want
(void**)&g_pPlayer[i]); // address of our interface pointer
hr = g_pPlayer[i]->InitPlayer(hwndPlayers[i]);
hr = g_pPlayer[i]->OpenUrl(L"C:\\test.mp4");
}
} while (true);
Some MediaFoundation interface like
IMFMediaSource
IMFMediaSession
IMFMediaSink
need to be Shutdown before Release them.
At this point it seems as if calls to "InitializeRenderer" by the EVR slow down and eventually don't go through anymore at all.
... usually somewhere in the GPU driver or near the EVR DLL.
a good track to make a precise search in your code.
In your file PlayerTopoBuilder.cpp, at CPlayerTopoBuilder::AddBranchToPartialTopology :
if (bVideo)
{
if (false) {
BREAK_ON_FAIL(hr = CreateMediaSinkActivate(pSD, hVideoWnd, &pSinkActivate));
BREAK_ON_FAIL(hr = AddOutputNode(pTopology, pSinkActivate, 0, &pOutputNode));
}
else {
//// try directly create renderer
BREAK_ON_FAIL(hr = MFCreateVideoRenderer(__uuidof(IMFMediaSink), (void**)&pMediaSink));
CComQIPtr<IMFVideoRenderer> pRenderer = pMediaSink;
BREAK_ON_FAIL(hr = pRenderer->InitializeRenderer(nullptr, nullptr));
CComQIPtr<IMFGetService> getService(pRenderer);
BREAK_ON_FAIL(hr = getService->GetService(MR_VIDEO_RENDER_SERVICE, __uuidof(IMFVideoDisplayControl), (void**)&pVideoDisplayControl));
BREAK_ON_FAIL(hr = pVideoDisplayControl->SetVideoWindow(hVideoWnd));
BREAK_ON_FAIL(hr = pMediaSink->GetStreamSinkByIndex(0, &pStreamSink));
BREAK_ON_FAIL(hr = AddOutputNode(pTopology, 0, &pOutputNode, pStreamSink));
}
}
You create a IMFMediaSink with MFCreateVideoRenderer and pMediaSink. pMediaSink is release because of the use of CComPtr, but never Shutdown.
You must keep a reference on the Media Sink and Shutdown/Release it when the Player Shutdown.
Or you can use a different approach with MFCreateVideoRendererActivate.
IMFMediaSink::Shutdown
If the application creates the media sink, it is responsible for calling Shutdown to avoid memory or resource leaks.
In most applications, however, the application creates an activation object for the media sink, and the Media Session uses that object to create the media sink.
In that case, the Media Session — not the application — shuts down the media sink. (For more information, see Activation Objects.)
I also suggest you to use this kind of code at the end of CPlayer::CloseSession (after release all others objects) :
if(m_pSession != NULL){
hr = m_pSession->Shutdown();
ULONG ulMFObjects = m_pSession->Release();
m_pSession = NULL;
assert(ulMFObjects == 0);
}
For the use of MFCreateVideoRendererActivate, you can look at my MFNodePlayer project :
MFNodePlayer
EDIT
I rewrote your program, but i tried to keep your logic and original source code, like CComPtr/Mutex...
MFMultiVideo
Tell me if this program has memory leaks.
It will depend on your answer, but then we can talk about best practices with MediaFoundation.
Another thought :
Your program uses 1 to 16 IMFMediaSession. On a good computer configuration, you could use only one IMFMediasession, i think (Never try to aggregate 16 MFSource).
Visit :
CustomVideoMixer
to understand the other way to do it.
I think your approach to use 16 IMFMediasession is not the best approach on a modern computer. VuVirt talk about this.
EDIT2
I've updated MFMultiVideo using Work Queues.
I think the problem can be that you call MFStartup/MFShutdown for each players.
Just call MFStartup/MFShutdown once in winmain.cpp for example, like my program does.
I'm currently building a robot which has some sensors attached to it. The control unit on the robot is an ARM Cortex-M3, all sensors are attached to it and it is connected via Ethernet to the "ground station".
Now I want to read and write settings on the robot via the ground station. Therefore I thought about implementing a "virtual register" on the robot that can be manipulated by the ground station.
It could be made up of structs and look like this:
// accelerometer register
struct accel_reg {
// accelerations
int32_t accelX;
int32_t accelY;
int32_t accelZ;
};
// infrared distance sensor register
struct ir_reg {
uint16_t dist; // distance
};
// robot's register table
struct {
uint8_t status; // current state
uint32_t faultFlags; // some fault flags
accel_reg accelerometer; // accelerometer register
ir_reg ir_sensors[4]; // 4 IR sensors connected
} robot;
// usage example:
robot.accelerometer.accelX = -981;
robot.ir_sensors[1].dist = 1024;
On the robot the registers will be constantly filled with new values and configuration settings are set by the ground station and applied by the robot.
The ground station and the robot will be written in C++ so they both can use the same struct datatype.
The question I have now is how to encapsulate the read/write operations in a protocol without writing tons of meta data?
Let's say I want to read the register robot.ir_sensors[2].dist. How would I address this register in my protocol?
I already thought about transmitting a relative offset in bytes (i.e the relative position in memory inside the struct) but I think memory alignment and padding may cause problems, especially because the ground station runs on a x86_64 architecture and the robot runs on a 32-bit ARM processor.
Thanks for any hints! :)
I'm also going to suggest Google Protocol Buffers.
In the simplest case, you could implement one message RobotState like this:
message RobotState {
optional int32_t status = 1;
optional int32_t distance = 2;
optional int32_t accelX = 3;
...
}
Then when the robot receives the message, it will take new values from any optional field that is present. It will then reply with a message containing the current value of all fields.
This way it is quite easy to implement field update using the "merge message" functionality of most protobuf implementations. Also you can keep it very simple at start because you only have one message type, but if you need to expand later you can add submessages.
It is true that protobuf does not support int8_t or int16_t. Just use int32_t instead.
I think the Google protocol buffers are an excellent session/presentation layer tool to use. Actually, Google protocol buffers do not support the syntax I am thinking of. So, I will change this part of my answer to recommend XSD by Code Synthesis. Although it is primarily used with XML, it supports different presentation layers such as XDR and may be more efficient than protocol buffers with large amounts of optional data. The generate code is also very nice to work with. XSD is free to use with OpenSource software and even commercial use with limited message structures.
I don't believe you want to read/write register sets at random. You can prefix a message with an enum that denotes a message such as, IR update, distance, accel, etc. These are register groups. Then the robot responds with the register set. All the registers you've given so far are sensors. The write ones must be motor control?
You want to think about what control you want to perform and the type of telemetry you would like to receive. Then come up with a message structure and bundle the information together. You could use sequence diagrams, and remote procedure API's like SOA/SOAP, RPC, REST, etc. I don't mean these RPC frameworks directly, but the concepts such as request/response and perhaps message that are just sent periodically (telemetry) without specific requests. So there would be a telemetry request from the ground station with some sort of interval and then the robot would respond periodically with unsolicited data. You always need a message id (enum above), unless your protocol is going to be stateful, which I would discourage for robustness reasons.
You haven't described how the control system might work or if you wish to do this remotely. Describing that may lead to more ideas on the protocol. I believe we are talking about layers 5,6,7 of OSI. Have fun.
Basic Question
Is there any way to event recording an playback within a time-sensitive (framerate independent) system?
Any help - including a simple "No sorry it is impossible" - would be greatly appreciated. I have spent almost 20 hours working on this over the past few weekends and am driving myself crazy.
Full Details
This is being currently aimed at a game but the libraries I'm writing are designed to be more general and this concept applies to more than just my C++ coding.
I have some code that looks functionally similar this... (it is written in C++0x but I'm taking some liberties to make it more compact)
void InputThread()
{
InputAxisReturn AxisState[IA_SIZE];
while (Continue)
{
Threading()->EventWait(InputEvent);
Threading()->EventReset(InputEvent);
pInput->GetChangedAxis(AxisState);
//REF ALPHA
if (AxisState[IA_GAMEPAD_0_X].Changed)
{
X_Axis = AxisState[IA_GAMEPAD_0_X].Value;
}
}
}
And I have a separate thread that looks like this...
//REF BETA
while (Continue)
{
//Is there a message to process?
StandardWindowsPeekMessageProcessing();
//GetElapsedTime() returns a float of time in seconds since its last call
UpdateAll(LoopTimer.GetElapsedTime());
}
Now I'd like to record input events for playback for testing and some limited replay functionality.
I can easily record the events with precision timing by simply inserting the following code where I marked //REF ALPHA
//REF ALPHA
EventRecordings.pushback(EventRecording(TimeSinceRecordingBegan, AxisState));
The real issue is playing these back. My LoopTimer is extremely high precision using the High Performance Counter (QueryPreformanceCounter). This means that it is nearly impossible to hit the same time difference using code like below in place of //REF BETA
// REF BETA
NextEvent = EventRecordings.pop_back()
Time TimeSincePlaybackBegan;
while (Continue)
{
//Is there a message to process?
StandardWindowsPeekMessageProcessing();
//Did we reach the next event?
if (TimeSincePlaybackBegan >= NextEvent.TimeSinceRecordingBegan)
{
if (NextEvent.AxisState[IA_GAMEPAD_0_X].Changed)
{
X_Axis = NextEvent.AxisState[IA_GAMEPAD_0_X].Value;
}
NextEvent = EventRecordings.pop_back();
}
//GetElapsedTime() returns a float of time in seconds since its last call
Time elapsed = LoopTimer.GetElapsedTime()
UpdateAll(elapsed);
TimeSincePlabackBegan += elapsed;
}
The issue with this approach is that you will almost never hit the exact same time so you will have a few microseconds where the playback doesn't match the recording.
I also tried event snapping. Kind of a confusing term but basically if the TimeSincePlaybackBegan > NextEvent.TimeSinceRecordingBegan then TimeSincePlaybackBegan = NextEvent.TimeSinceRecordingBegan and ElapsedTime was altered to suit.
It had some interesting side effects which you would expect (like some slowdown) but it unfortunately still resulted in the playback de-synchronizing.
For some more background - and possibly a reason why my time snapping approach didn't work - I'm using BulletPhysics somewhere down that UpdateAll call. Kind of like this...
void Update(float diff)
{
static const float m_FixedTimeStep = 0.005f;
static const uint32 MaxSteps = 200;
//Updates the fps
cWorldInternal::Update(diff);
if (diff > MaxSteps * m_FixedTimeStep)
{
Warning("cBulletWorld::Update() diff > MaxTimestep. Determinism will be lost");
}
pBulletWorld->stepSimulation(diff, MaxSteps, m_FixedTimeStep);
}
But I also tried with pBulletWorkd->stepSimulation(diff, 0, 0) which according to http://www.bulletphysics.org/mediawiki-1.5.8/index.php/Stepping_the_World should have done the trick but still with no avail.
Answering my own question for anyone else who stumbles upon this.
Basically if you want deterministic recording and playback you need to lock the frame-rate. If the system cannot handle the frame-rate you must introduce slowdown or risk dsyncronization.
After two weeks of additional research I've decided it is just not possible due to floating point inadequacies and the fact that floating point numbers are not necessarily associative.
The only solution to have a deterministic engine that relies on floating point numbers is to have a stable and fixed frame-rate. Any change in the frame-rate will - across a long term - result in the playback becoming desynchronized.