I am on linux using nvidia.
I am doing a GPGPU thing usinga compute shader that takes a long time, might take a few minutes overall.
I am getting this error:
Compute waiting for fence failed with error ErrorDeviceLost.
When my computation takes longer than a certain amount of time. Is there a way for me to configure my machine/driver to extend the amount of time my compute shader is allowed to run?
I tried asking on the discord chat for NVidia but it is a nightmare merely to be granted access.
Related
I've been writing some basic OpenCL programs and running them on a single device. I have been timing the performance of the code, to get an idea of how well each program is performing.
I have been looking at getting my kernels to run on the platforms GPU device, and CPU device at the same time. The cl::context constructor can be passed a std::vector of devices, to initialise a context with multiple devices. I have a system with a single GPU and a single CPU.
Is constructing a context with a vector of available devices all you have to do for kernels to be distributed to multiple devices? I noticed a significant performance increase when I did construct the context with 2 devices, but it seems too simple.
There is a DeviceCommandQueue object, perhaps I should be using that to create a queue for each device explicitly?
I did some testing on my system. Indeed you can do something like this:
using namespace cl;
Context context({ devices[0], devices[1] });
queue = CommandQueue(context); // queue to push commands for the device, however do not specify which device, just hand over the context
Program::Sources source;
string kernel_code = get_opencl_code();
source.push_back({ kernel_code.c_str(), kernel_code.length() });
Program program(context, source);
program.build("-cl-fast-relaxed-math -w");
I found that if the two devices are from different platforms (like one Nvidia GPU and one Intel GPU), either clCreateContext will throw read access violation error at runtime or program.build will fail at runtime. If however the two devices are from the same platform, the code will compile and run; but it won't run on both devices. I tested with an Intel i7-8700K CPU and its integraed Intel UHD 630 GPU, and no matter the order of the devices in the vector that context is created on, the code will always be executed on the CPU in this case. I checked with Windows Task-Manager and also the results from kernel execution time measurement (execution times are specific for each device).
You could also monitor device usage with some tool like Task-Manager to see which device is actually running. Let me know if it is any different on your system than what I observed.
Generally parallelization across multiple devices is not done by handing the context a vector of devices, but instead you give each device a dedicated context and queue and explicitely handle which kernels are executed on which queue. This gives you full control on memory transfers and execution order / synchronization points.
I was seeing a performance increase when passing in the vector of devices. I downloaded a CPU/GPU profiler to actually check the activity of my GPU and CPU while I was running the code and it seemed that I was seeing activity on both devices. The CPU was registering around 95-100% activity, and the GPU was getting up to 30-40%, so OpenCL must be splitting the kernels between the 2 devices. My computer has a CPU with an integrated GPU, which may play a role in why the kernels are being shared across the devices, because its not like it has a CPU and a completely seperate GPU, they're connected on the same component.
I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc
How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar
m_audioEngine->CreateMasteringVoice(
&m_masteringVoice,
XAUDIO2_DEFAULT_CHANNELS,
sampleRate,
0,
NULL
)
);
m_audioEngine->CreateSourceVoice(
&implData->sourceVoice,
format,
0,
XAUDIO2_DEFAULT_FREQ_RATIO,
reinterpret_cast<IXAudio2VoiceCallback*>(&implData->callbackHander),
nullptr,
nullptr
)
);
One of the above code when I have my earphones in seems to always run fine.
If I start my game without earphones in, sometimes (not always) the above function fails. It always throws the same HRESULT: 0x88890017
any ideas?
If I put a breakpoint directly after this, it seems to not throw an error... Does this task run asynchronously?
EDIT---------------------------------
My IXAudio2SourceVoice keeps getting lost randomly
what can cause that to lose itself?
this is why my program crashes...
it only loses itself when earphones are not plugged in (when creating XAudio2 objects)
What does it mean?
This error code is known as "*AUDCLNT_E_CPUUSAGE_EXCEEDED*" and occurs when the audio engine is taking too long to process audio packets. This typically occurs when the CPU-usage of the audio engine exceeds a certain treshold. The audio engine will fail creating new streams if its CPU-usage exceeds this threshold.
Resolving: The User
CPU-usage is subject to various things, like the processing power of your CPU, like the number of channels you're using and like the audio device enhancements you have enabled on a system level. Some possible solutions are to ensure a decent CPU (check the minimum system requirements specification), in the application/game-settings lower the amount of channels in use, or to disable some system-level audio device enhancements in your operating system. For the latter check your task manager for CPU-usage, and if one of the suspicious processes is "audiodg.exe", go into the Sound control panel, double-click each of your playback devices in turn, go to the Enhancements tab, and check the "disable all enhancements" box. This should lower the required CPU-usage and solve your problem.
Resolving: The Coder
Keep in mind that the more your audio code is doing, the more CPU-cycles it will require. If you have an IXAudio2 device created with a ton of effect processors in the chain, 1000 SubmixVoices and hundreds of SourceVoices, that's like asking for trouble. Before you point your fingers to the CPU or to the system-level audio device enhancements, do ensure that it isn't just your code being inefficient.
Your big friend here is IXAudio2::GetPerformanceData, which will query the device and fill in a XAUDIO2_PERFORMANCE_DATA-structure for you. This gives you some information about the CPU-cycles used. Chances are good you can intercept this error before it actually occurs. When you detect a heavy CPU-usage, or when the error actually occurs, it's not necessarily a reason to have things fail in your game/engine/framework. You could retry. Or you could adjust the number of SubmixVoices. Or you could choose to not create a SourceVoice. Or you could temporarily suspend audio/switch to a null-device, and inform the user about all of this.
You could setup an event or callback to inform the user of heavy CPU-usage in the audio engine. This enables the application to inform the user of this heavy CPU-usage, and inform the user to lower the amount of channels in the settings (alternatively you can have your application adjust things automatically) or to turn off some system-level audio device enhancements.
In my current project, I have two levels of tasking, in a VxWorks system, a higher priority (100) task for number crunching and other work and then a lower priority (200) task for background data logging to on-board flash memory. Logging is done using the fwrite() call, to a file stored on a TFFS file system. The high priority task runs at a periodic rate and then sleeps to allow background logging to be done.
My expectation was that the background logging task would run when the high priority task sleeps and be preempted as soon as the high priority task wakes.
What appears to be happening is a significant delay in suspending the background logging task once the high priority task is ready to run again, when there is sufficient data to keep the logging task continuously occupied.
What could delay the pre-emption of a lower priority task under VxWorks 6.8 on a Power PC architecture?
You didn't quantify significant, so the following is just speculation...
You mention writing to flash. One of the issue is that writing to flash typically requires the driver to poll the status of the hardware to make sure the operation completes successfully.
It is possible that during certain operations, the file system temporarily disables preemption to insure that no corruption occurs - coupled with having to wait for hardware to complete, this might account for the delay.
If you have access to the System Viewer tool, that would go a long way towards identifying the cause of the delay.
I second the suggestion of using the System Viewer, It'll show all the tasks involved in TFFS stack and you may be surprised how many layers there are. If you're making an fwrite with a large block of data, the flash access may be large (and slow as Benoit said). You may try a bunch of smaller fwrites. I suggest doing a test to see how long fwrites() take for various sizes, and you may see differences from test to test with the same sizea as you cross flash block boundaries.