btrfs-transacti kills my VMWare guests, because it makes I/O unresponsive

btrfs-transacti kills my VMWare guests, because it makes I/O unresponsive - vmware

Most of the time when BTRFS maintenance starts with process btrfs-transacti, my VMWare guest cannot access disk and restarts. Is there anything I can do? I tried to give to guest machine high io priority with ionice, but that does not help.

OK, no one answers, so my answer can come handy. The CPU usage of btrfs-transactio (and the cleaner process) can be limited with tool cpulimit. It sits in background and if btrfs-transacti takes too much cpu, it will throttle it. That should theoretically give more space to I/O processes of vmware guests.

Related

Client-server what is the limitation of server's processes run time?

I have a certain program that recieves input and returnes an output with a run-time of about 2 seconds,
Now, i want to run this program online on a server that can handle multiple connections (lets say up to 100k),
on each client-server session the program will launch,
the client will hand the server the program's input and will wait for the program to end to recieve the server's respond (program's output),
Lets say the server's host is a very powerful machine - e.g 16 cores,
Can this work or it is to much runtime for each client?
What is the maximum runtime this kind of program can have?

I'm posting this as an answer because it's too large to place as a comment.
Can this work? It depends. It depends because there are a lot of variables in this problem. Let's look at some of them:
you say it takes 2 seconds to compute a result. Where and how are those seconds spent? Is this pure computation or are you accessing a database, or the file system? Is this CPU bound or I/O bound? If you run computations for the full 2 seconds then you are consuming CPU which means that you can simultaneously serve only 16 clients, one per core. Are you hitting a database? Is this on the powerful server or on some other machine? If the database is the bottleneck than move this to the powerful server and have SSD drives on it.
can you improve processing for one client? What's more efficient, doing the processing on one core or spread it across all the cores? If you can parallelize, can you limit thread contention?
is CPU all you need? How about memory? Any backend service you access? Are those over the network? Do you have enough bandwidth?
related to memory, what language/platform are you using? Does it have a garbage collector? Do you generate a lot of object to compute a result? Does the GC kick in and pauses your application so it cleans up and compacts the memory? Do you allocate enough memory for the application to run?
can you cache responses and serve them to other clients or are responses custom to each client? Can you precompute the results and then just serve them to clients or can't you predict the inputs?
Have you tried running some performance tests and profile the application to see where hotspots might show up? Can you do something about them?
have you any imposed performance criteria? How many clients do you want to support simultaneously? Is 2 seconds too much? Can clients live with more? How much more? How many seconds does it mean an unacceptable response time?
do you need a big server to run this setup or smaller ones work better (i.e. scale horizontally instead of vertically)?
etc
Nobody can answer this for you. You have to do an analysis of your application, run some tests, profile it, optimize it, then repeat until you are satisfied with the results.

QProcess ProcessState sufficient for Blocked Processes?

I want to know if a process (started with a QProcess class) doesn't respond anymore. For instance, my process is an application that only prints 1 every seconds.
My problem is that I want to know if (for some mystical reason), that process is blocked for a short period of time (more than 1 second, something noticeable by a human).
However, the different states of a QProcess (Not Running, Starting, Running) don't include a "Blocked" state.
I mean blocked as "Don't Answer to the OS" when we got the "Non Responding" message in the Task Manager. Such as when a Windows MMI (like explorer.exe) is blocked and becomes white.
But : I want to detect that "Not Responding" state for ANY processes. Not just MMI.
Is there a way to detect such a state ?

Qt doesn't provide any api for that. You'd need to use platform-specific mechanisms. On some platforms (Windows!), there is no notion of a hung application, merely that of a hung window. You can have one application that has both responsive and unresponsive windows :)
On Windows, you'd enumerate all windows using EnumWindows, check if they belong to your process by comparing the pid from GetWindowThreadProcessId to process->pid(), and finally checking if the window is hung through IsHungAppWindow.
Caveats
Generally, there's is no such thing as an all-encompassing notion of a "non responding" process.
Suppose you have a web server. What does it mean that it's not responding? It's under heavy load, so it may deny some incoming connections. Is that "non responding" from your perspective? It may be, but there's nothing you can do about it - killing and restarting the process won't fix it. If anything, it will make things worse for the already connected clients.
Suppose you have a process that is blocking on a filesystem read because the particular drive it tries to access is slow, or under heavy load. Does it mean that it's not responding? Will killing and restarting it always fix this? If the process then retries the read from the beginning of the file, it may well make things worse.
Suppose you have a poorly designed process with a GUI. It's doing blocking serial port reads in the GUI thread. The read it's doing takes long time, and the GUI is nonresponsive for several seconds. You kill the process, it restarts and tries that long read again - you've only made things worse.
You have to tread very carefully here.
Solution Ideas
There are multiple approaches to determining what is a "responsive" process. It was already mentioned that processes with a GUI are monitored by the operating system on both Windows and OS X. Thus one can use native APIs that can query whether a window or a process is hung or not. This makes sense for applications that offer a UI, and subject to caveats above.
If the process is providing a service, you may periodically use the service to determine if it's still available, subject to some deadlines. Any elections as to what to do with a "hung" process should take into account CPU and I/O load of the system.
It may be worthwhile to keep a history of the latency of the service's response to the service request. Only "large" changes to the latency should be taken to be an indication of a problem. Suppose you're keeping track of the average latency. One could have set an ultimate deadline to 50x the previous average latency. Missing this deadline, the service is presumed dead and up for forced recycling. An "action flag" deadline may be set to 5-10x the average latency. A human would then be given an option to orderly restart the service. The flag would be automatically removed when latency backs down to, say, 30% below the deadline that triggered the flag.
If you are the developer of the monitored process, then you can invert the monitoring aspect and become a passive watchdog of the monitored process. The monitored process must then periodically, actively "wake" the watchdog to indicate that it's alive. The emission of the wake signal (in generic terms) should be performed in strategic location(s) in the code. Periodic reception of wake "signals" should allow you to reason that the process is still alive. You may have multiple wake signals, tagged with the location in the watched process. Everything depends on how many threads the process has, what is it doing, etc.

How to figure out why UDP is only accepting packets at a relatively slow rate?

I'm using Interix on Windows XP to port my C++ Linux application more readily to port to Windows XP. My application sends and receives packets over a socket to and from a nearby machine running Linux. When sending, I'm only getting throughput of around 180 KB/sec and when receiving I'm getting around 525 KB/sec. The same code running on Linux gets closer to 2,500 KB/sec.
When I attempt to send at a higher rate than 180 KB/sec, packets get dropped to bring the rate back down to about that level.
I feel like I should be able to get better throughput on sending than 180 KB/sec but am not sure how to go about determining what is the cause of the dropped packets.
How might I go about investigating this slowness in the hopes of improving throughput?
--Some More History--
To reach the above numbers, I have already improved the throughput a bit by doing the following (that made no difference on Linux, but help throughput on Interix):
I changed SO_RCVBUF and SO_SNDBUF from 256KB to 25MB, this improved throughput about 20%
I ran optimized instead of debug, this improved throughput about 15%
I turned off all logging messages going to stdout and a log file, this doubled throughput.
So it would seem that CPU is a limiting factor on Interix, but not on Linux. Further, I am running on a Virtual Machine hosted in a hypervisor. The Windows XP is given 2 cores and 2 GB of memory.
I notice that the profiler shows the cpu on the two cores never exceeding 50% utilization on average. This even occurs when I have two instances of my application running, still it hovers around 50% on both cores. Perhaps my application, which is multi-threaded, with a dedicated thread to read from UDP socket and a dedicated thread to write to UDP socket (only one is active at any given time) is not being scheduled well on Interix and thus my packets are dropping?

In answering your question, I am making the following assumptions based on your description of the problem:
(1) You are using the exact same program in Linux when achieving the throughput of 2,500 KB/sec, other than the socket library, which is of course, going to be different between Windows and Linux. If this assumption is correct, we probably shouldn't have to worry about other pieces of your code affecting the throughput.
(2) When using Linux to achieve 2,500 KB/sec throughput, the node is in the exact same location in the network. If this assumption is correct, we don't have to worry about network issues affecting your throughput.
Given these two assumptions, I would say that you likely have a problem in your socket settings on the Windows side. I would suggest checking the size of the send-buffer first. The size of the send-buffer is 8192 bytes by default. If you increase this, you should see an increase in throughput. Use setsockopt() to change this. Here is the usage manual: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx
EDIT: It looks like I misread your post going through it too quickly the first time. I just noticed you're using Interix, which means you're probably not using a different socket library. Nevertheless, I suggest checking the send buffer size first.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.

Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?

Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml

I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

Conceptual question on (a tool like) LoadRunner

I'm using LoadRunner to stress-test a J2EE application.
I have got: 1 MySQL DB server, and 1 JBoss App server. Each is a 16-core (1.8GHz) / 8GB RAM box.
Connection Pooling: The DB server is using max_connections = 100 in my.cnf. The App Server too is using min-pool-size and max-pool-size = 100 in mysql-ds.xml and mysql-ro-ds.xml.
I'm simulating a load of 100 virtual users from a 'regular', single-core PC. This is a 1.8GHz / 1GB RAM box.
The application is deployed and being used on a 100 Mbps ethernet LAN.
I'm using rendezvous points in sections of my stress-testing script to simulate real-world parallel (and not concurrent) use.
Question:
The CPU utilization on this load-generating PC never reaches 100% and memory too, I believe, is available. So, I could try adding more virtual users on this PC. But before I do that, I would like to know 1 or 2 fundamentals about concurrency/parallelism and hardware:
With only a single-core load generator as this one, can I really simulate a parallel load of 100 users (with each user using operating from a dedicated PC in real-life)? My possibly incorrect understanding is that, 100 threads on a single-core PC will run concurrently (interleaved, that is) but not parallely... Which means, I cannot really simulate a real-world load of 100 parallel users (on 100 PCs) from just one, single-core PC! Is that correct?
Network bandwidth limitations on user parallelism: Even assuming I had a 100-core load-generating PC (or alternatively, let's say I had 100, single-core PCs sitting on my LAN), won't the way ethernet works permit only concurrency and not parallelism of users on the ethernet wire connecting the load-generating PC to the server. In fact, it seems, this issue (of absence of user parallelism) will persist even in a real-world application usage (with 1 PC per user) since the user requests reaching the app server on a multi-core box can only arrive interleaved. That is, the only time the multi-core server could process user requests in parallel would be if each user had her own, dedicated physical layer connection between it and the server!!
Assuming parallelism is not achievable (due to the above 'issues') and only the next best thing called concurrency is possible, how would I go about selecting the hardware and network specification to use my simulation. For example, (a) How powerful my load-generating PCs should be? (b) How many virtual users to create per each of these PCs? (c) Does each PC on the LAN have to be connected via a switch to the server (to avoid) broadcast traffic which would occur if a hub were to be used in instead of a switch?
Thanks in advance,
/HS

Not only are you using Ethernet, assuming you're writing web services you're talking over HTTP(S) which sits atop of TCP sockets, a reliable, ordered protocol with the built-in round trips inherent to reliable protocols. Sockets sit on top of IP, if your IP packets don't line up with your Ethernet frames you'll never fully utilize your network. Even if you were using UDP, had shaped your datagrams to fit your Ethernet frames, had 100 load generators and 100 1Gbit ethernet cards on your server, they'd still be operating on interrupts and you'd have time multiplexing a little bit further down the stack.
Each level here can be thought of in terms of transactions, but it doesn't make sense to think at every level at once. If you're writing a SOAP application that operates at level 7 of the OSI model, then this is your domain. As far as you're concerned your transactions are SOAP HTTP(S) requests, they are parallel and take varying amounts of time to complete.
Now, to actually get around to answering your question: it depends on your test scripts, the amount of memory they use, even the speed your application responds. 200 or more virtual users should be okay, but finding your bottlenecks is a matter of scientific inquiry. Do the experiments, find them, widen them, repeat until you're happy. Gather system metrics from your load generators and system under test and compare with OS provider recommendations, look at the difference between a dying system and a working system, look for graphs that reach a plateau and so on.

It sounds to me like you're over thinking this a bit. Your servers are fast and new, and are more than suited to handle lots of clients. Your bottleneck (if you have one) is either going to be your application itself or your 100m network.
1./2. You're testing the server, not the client. In this case, all the client is doing is sending and receiving data - there's no overhead for client processing (rendering HTML, decoding images, executing javascript and whatever else it may be). A recent unicore machine can easily saturate a gigabit link; a 100 mbit pipe should be cake.
Also - The processors in newer/fancier ethernet cards offload a lot of work from the CPU, so you shouldn't necessarily expect a CPU hit.
3. Don't use a hub. There's a reason you can buy a 100m hub for $5 on craigslist.

Without having a better understanding of your application it's tough to answer some of this, but generally speaking you are correct that to achieve a "true" stress test of your server it would be ideal to have 100 cores (using a target of a 100 concurrent users), i.e. 100 PC's. Various issues, though, will probably show this as a no-brainer.
I have a communication engine I built a couple of years back (.NET / C#) that uses asyncrhonous sockets - needed the fastest speeds possible so we had to forget adding any additional layers on top of the socket like HTTP or any other higher abstractions. Running on a quad core 3.0GHz computer with 4GB of RAM that server easily handles the traffic of ~2,200 concurrent connections. There's a Gb switch and all the PC's have Gb NIC's. Even with all PC's communicating at the same time it's rare to see processor loads > 30% on that server. I assume this is because of all the latency that is inherent in the "total system."
We have a new requirement to support 50,000 concurrent users that I'm currently implementing. The server has dual quad core 2.8GHz processors, a 64-bit OS, and 12GB of RAM. Our modeling shows this computer is more than enough to handle the 50K users.
Issues like the network latency I mentioned (don't forget CAT 3 vs. CAT 5 vs. CAT 6 issue), database connections, types of data being stored and mean record sizes, referential issues, backplane and bus speeds, hard drive speeds and size, etc., etc., etc. play as much a role as anything in slowing down a platform "in total." My guess would be that you could have 500, 750, a 1,000, or even more users to your system.
The goal in the past was to never leave a thread blocked for too long ... the new goal is to keep all the cores busy.
I have another application that downloads and analyzes the content of ~7,800 URL's daily. Running on a dual quad core 3.0GHz (Windows Ultimate 7 64-bit edition) with 24GB of RAM that process used to take ~28 minutes to complete. By simply swiching the loop to a Parallel.ForEach() the entire process now take < 5 minutes. My processor load that we've seen is always less than 20% and maximum network loading of only 14% (CAT 5 on a Gb NIC through a standard Gb dumb hub and a T-1 line).
Keeping all the cores busy makes a huge difference, especially true on applications that spend allot of time waiting on IO.

As you are representing users, disregard the rendezvous unless you have either an engineering requirement to maintain simultaneous behavior or your agents are processes and not human users and these agents are governed by a clock tick. Humans are chaotic computing units with variant arrival and departure windows based upon how quickly one can or cannot read, type, converse with friends, etc... A great book on the subject of population behavior is "Chaos" by James Gleik (sp?)
The odds of your 100 decoupled users being highly synchronous in their behavior on an instant basis in observable conditions is zero. The odds of concurrent activity within a defined time window however, such as 100 users logging in within 10 minutes after 9:00am on a business morning, can be quite high.
As a side note, a resume with rendezvous emphasized on it is the #1 marker for a person with poor tool understanding and poor performance test process. This comes from a folio of over 1500 interviews conducted over the past 15 years (I started as a Mercury Employee on april 1, 1996)
James Pulley
Moderator
-SQAForums WinRunner, LoadRunner
-YahooGroups LoadRunner, Advanced-LoadRunner
-GoogleGroups lr-LoadRunner
-Linkedin LoadRunner (owner), LoadrunnerByTheHour (owner)
Mercury Alum (1996-2000)
CTO, Newcoe Performance Engineering

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js