This question already has answers here:
How taxing are OpenGL glDrawElements() calls compared to basic logic code?
(2 answers)
Closed 6 years ago.
I have read Batch,Batch,Batch.
In the Batching process, there are two main things:
1 Submit n number of triangles
2 SetState
So which one is more cpu time consuming?
Or the SetState itself actually does not matter at all. It matters only because once the state has been changed, we have to submit triangles again?
All in all, it doesn't really matter (like you say at the end of your question)
If you do a SetState without submitting data to draw with that state, that's just dumb. Don't do the SetState.
If you draw several batch with the same state, you should probably have submitted them as one single batch.
What the "set state" does is going to be very driver dependent, and which state you change. Some changes might need a lot of validation, and that could be done when you set the state, or when it's actually going to be sent to the GPU, no way to know for sure.
In general, I would count that "submitting a draw" counts as 1 batch, no matter whether the state was changed before doing it or not.
I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.
Solved: For when simple profiling isn't effective enough, I have written a tool to show me where performance hits occur. Basic information about how the tool works is in the accepted answer below. The source can be found here: (be sure to turn debugging symbols on in the program you're testing)
I've built a game engine in C++ and I have noticed in one particular area of a level that there is a brief performance hit. The game will stop completely for about half a second, and then continue on merrily. I've tried to profile this, but it's difficult isolate the condition since I also have to load the map and perform the in-game task which causes the performance hit. I can make a map load automatically and skip showing menus, etc, and comparing those profile results against a set of similar control data (all the same steps but without actually initiating the performance hit), but it doesn't show anything obvious.
I'm using gmon to profile.
This is a large application with many, many classes and functions. The performance hit only happens once, so there's no way to just trigger the problem many times during one execution to saturate my profiling results in order to make the offending functions more obvious in the profiling results.
What else can I do?
What I would do is try to grab a stack sample in that half second when it's frozen.
This would require an alarm clock timer set to go off some small time in the future, like 100ms.
Then in some loop, like the frame display loop, that normally takes less than 100ms to repeat, keep resetting the timer.
That way, it will act as a watchdog that barks if you don't keep petting it.
Then, stick a breakpoint in the timer interrupt handler.
When it gets there, you know you're in the bad slice of time.
Then just display the call stack, and it should show you what the problem is.
You might have to repeat the process a few times.
You are not saying anything about whether your application is threaded, but I will assume that it is not.
As per suggestion from mike, get insights by getting a stack trace at and see where it is freezing, you can do that with a bit of luck using pstack, so
while usleep 100000; do
pstack processid
done >/tmp/stack.log
Should give you some output to go on -- my guess is that you are calling a blocking IO operation, like reading some assets from disk.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
A heads up, i am very new to audio so please bear with me =)
I am trying to interpret audio signals into an AVR (its a classical myAVR MK2 board). Now normally, the interupt signal is always some kind of switch. So if i press this swich, go into that interupt.
My goal is to interpret audio signals via microphone into the board, and have the board react to it. My first question is, when sending the microphone signal, do i have to put it through the A/D Converter, since technically it is an anolag signal ??
My second and more complicated question is, how would i actually interpret the audio signal coming in?
For example, if i scream "GREEN" then what ever the programm was doing should be stopped, the interupt should be called and the green LED should come on. Now the mic is preatty much always on ... how do i control so that only if GREEN is said, the interupt signal is sent. I dont want him constantly going in and out of the interupts just because someone made some noise ...
Would i have to save "GREEN" as a bit-combination somewhere and compare the incoming signal with the saved bits ... or ??
Some answers: i have to put it through the A/D Converter, since technically it is an anolag signal ?
Yes, digital chips may fry when exposed to analog signals.
Be aware that you may have delay some time after starting the ADC before the signals are accurate.
how would i actually interpret the audio signal coming in?
Basically you have digital values coming in at a frequency. You will need to store those values and then analyze them. You must trade memory capacity/usage for accuracy. The more samples you take, the better your data and results; but this occupies more memory.
You will also need to filter out noise from the signal and from layered sounds.
You may get some benefits from researching on FFTs.
You should compare using "fuzzy logic" because in the real world, nothing is exact; for example your voice signal could be +/- 30 counts and still be "correct".
I setup this algorithm to share data between different processors, and it has worked so far, but I'm trying to throw a much larger problem at it and I'm witnessing some very strange behavior. I'm losing pieces of data between MPI_Isend's and MPI_Recv's.
I present a snippet of the code below. It is basically comprised of three stages. First, a processor will loop over all elements in a given array. Each element represents a cell in a mesh. The processor checks if the element is being used on other processors. If yes, it does a non-blocking send to that process using the cell's unique global ID as the tag. If no, it checks the next element, and so on.
Second, the processor then loops over all elements again, this time checking if the processor needs to update the data in that cell. If yes, then the data has already been sent out by another process. The current process simply does a blocking receive, knowing who owns the data and the unique global ID for that cell.
Finally, MPI_Waitall is used for the request codes that were stored in the 'req' array during the non-blocking sends.
The issue I'm having is that this entire process completes---there is no hang in the code. But some of the data being received by some of the cells just isn't correct. I check that all data being sent is right by printing each piece of data prior to the send operation. Note that I'm sending and receiving a slice of an array. Each send will pass 31 elements. When I print the array from the process that received it, 3 out of the 31 elements are garbage. All other elements are correct. The strange thing is that it is always the same three elements that are garbage---the first, second and last element.
I want to rule out that something isn't drastically wrong in my algorithm which would explain this. Or perhaps it is related to the cluster I'm working on? As I mentioned, this worked on all other models I threw at it, using up to 31 cores. I'm getting this behavior when I try to throw 56 cores at the problem. If nothing pops out as wrong, can you suggest a means to test why certain pieces of a send are not making it to their destination?
do i = 1, num_cells
! skip cells with data that isn't needed by other processors
if (.not.needed(i)) cycle
tag = gid(i) ! The unique ID of this cell in the entire system
ghoster = ghosts(i) ! The processor that needs data from this cell
call MPI_Isend(data(i,1:tot_levels),tot_levels,mpi_datatype,ghoster,tag,MPI_COMM,req(send),mpierr)
send = send + 1
end do
sends = send-1
do i = 1, num_cells
! skip cells that don't need a data update
if (.not.needed_here(i)) cycle
tag = gid(i)
owner = owner(i)
call MPI_Recv(data(i,1:tot_levels),tot_levels,mpi_datatype,owner,tag,MPI_COMM,MPI_STATUS_IGNORE,mpierr)
end do
call MPI_Waitall(sends,req,MPI_STATUSES_IGNORE,mpierr)
Is your problem that you're not receiving all of the messages? Note that just because an MPI_SEND or MPI_ISEND completes, doesn't mean that the corresponding MPI_RECV was actually posted/completed. The return of the send call only means that the buffer can be reused by the sender. That data may still be buffered internally somewhere on either the sender or the receiver.
If it's critical that you know that the message was actually received, you need to use a different variety of the send like MPI_SSEND or MPI_RSEND (or the nonblocking versions if you prefer). Note that this won't actually solve your problem. It will probably just make it easier for you to figure out which messages aren't showing up.
I figured out a way to get my code to work, but I'm not entirely sure why, so I'm going to post the solution here and maybe somebody could comment on why this is the case and possibly offer a better solution.
As I indicated in my question and as we have discussed in the comments, it appeared that pieces of data were being lost between sends/receives. The concept of the buffer is a mystery to me, but I thought that maybe there wasn't enough space to hold my Isends, allowing for them to get lost before they could be received. So I swapped out the MPI_Isend calls with MPI_Bsend calls. I figure out how big my buffer needs to be using MPI_Pack_size. This way, I know I will have ample space for all my messages I send. I allocate my buffer size using MPI_Buffer_attach. I got rid of the MPI_Waitall, since it is no longer needed, and I replaced it with a call to MPI_Buffer_detach.
The code runs without issue and arrives at identical results to the serial case. I'm able to scale the problem size up to what I tried before and it works now. So based on these results, I'd have to assume that pieces of messages were being lost due to insufficient buffer space.
I have concerns about the impact on code performance. I did a scaling study on different problem sizes. See the image below. The x-axis gives the size of the problem (5 means the problem is 5 times bigger than 1). The y-axis gives the time to finish executing the program. There are three lines shown. Running the program in serial is shown in blue. The size=1 case is extrapolated out linearly with the green line. We see that the code execution time is linearly correlated with problem size. The red line shows running the program in parallel---we use a number of processors that matches the problem size (e.g. 2 cores for size=2, 4 cores for size=4, etc.).
You can see that the parallel execution time increases very slowly with problem size, which is expected, except for the largest case. I feel that the poor performance for the largest case is being caused by an increased amount of message buffering, which was not needed in smaller cases.
I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?