ColdFusion Template Request count optimization - coldfusion

In ColdFusion, under Request Tuning in the administrator, how do I determine what is an optimal number (or at least a good guess) for the Maximum Number of Simultaneous Template Requests?
Environment:
CF8 Standard
IIS 6
Win2k3
SQL2k5 on a separate box

The way of finding the right number of requests is load testing. That is, measuring changes in throughput under load when you vary the request number. Any significant change would require retesting. But I suspect most folks are going to baulk at that amount of work.
I think a good rule of thumb is about 8 threads per CPU (core).
In terms of efficiency, the lower the thread count (up to a point) the less swapping will be going on as the CPU processes your requests. If your pages execute very quickly then a lower number of requests is optimal.
If you have longer running requests, and especially if you have requests that are waiting on third-parties (like a database) then increasing the number of working threads will improve your throughput. That is, if your CPU is not tied up processing stuff you can afford to have more simultaneous requests working on the tasks at hand.
Although its a little bit dated, many of the principles on request tuning in Grant Straker's book on CF Performance & Troubleshooting would be worthwhile.

I would say at least 8 per core, not per CPU. And I think 8 is a little low given modern CPU cores, I would say at least 12.

Related

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)
This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

how to design threading for many short tasks

I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.

What factors should decide the number of threads on my C++ webserver?

I understand that generally the number of worker threads should be equal to the number of CPUs on your server (in my case it is 8), unless the thread is doing some kind of I/O etc.
My webserver provides services that require an extensive search in my mysql database, some queries takes about 16seconds in worst case. So two questions here,
1. How do I decide on the number of threads that will be optimal?
2. How can I simulate thousands of user and test my server against thousands of requests?
That depends on your definition of "threads".
If your threads are "blocking"... as in they only handle one client at a time -- 8 threads would be a terrible choice. If your worker threads all do non-blocking I/O, then yes, matching the threads to the cpu count would be a good choice.
The other thing to consider is whether your DB calls are non-blocking. I am not sure what language you are doing your scripting in (or are you doing that in C++ too?), but the c version of Mysql for example does blocking queries.
If you are looking for massive performance, I would take a look at NGiNX and G-WAN -- they are the leaders in this area.
http://nginx.org/
http://www.trustleap.com/
Also, for load testing: http://httpd.apache.org/docs/2.0/programs/ab.html or http://www.hpl.hp.com/research/linux/httperf/
As with any performance measure, there is really no substitute for performance testing. You could consider theoretical measures like n * number of CPUs where n is some small constant, but there's really no substitute for empirical validation.
A couple of additional notes:
Because of the cost of CPU Cache synchronization between Cores, it makes sense to limit your HTTP worker threads to the physical CPU Cores (ignoring hyper-threading).
Further, weighttp (a multi-threaded client made by Lighty's team) is more relevant than the (single-threaded) ApacheBench to test multicore servers:
One single thread for the client cannot saturate several threads on
the server (when the server is fast).
You can compare the results given by AB to weighttp's with this public-domain ab.c wrapper which has been used to test more than 30 Web servers, cache servers and application servers.
As a bonus, ab.c also collects RAM and CPU resource usage in addition to requests/second.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.