Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !
I have a certain program that recieves input and returnes an output with a run-time of about 2 seconds,
Now, i want to run this program online on a server that can handle multiple connections (lets say up to 100k),
on each client-server session the program will launch,
the client will hand the server the program's input and will wait for the program to end to recieve the server's respond (program's output),
Lets say the server's host is a very powerful machine - e.g 16 cores,
Can this work or it is to much runtime for each client?
What is the maximum runtime this kind of program can have?
I'm posting this as an answer because it's too large to place as a comment.
Can this work? It depends. It depends because there are a lot of variables in this problem. Let's look at some of them:
you say it takes 2 seconds to compute a result. Where and how are those seconds spent? Is this pure computation or are you accessing a database, or the file system? Is this CPU bound or I/O bound? If you run computations for the full 2 seconds then you are consuming CPU which means that you can simultaneously serve only 16 clients, one per core. Are you hitting a database? Is this on the powerful server or on some other machine? If the database is the bottleneck than move this to the powerful server and have SSD drives on it.
can you improve processing for one client? What's more efficient, doing the processing on one core or spread it across all the cores? If you can parallelize, can you limit thread contention?
is CPU all you need? How about memory? Any backend service you access? Are those over the network? Do you have enough bandwidth?
related to memory, what language/platform are you using? Does it have a garbage collector? Do you generate a lot of object to compute a result? Does the GC kick in and pauses your application so it cleans up and compacts the memory? Do you allocate enough memory for the application to run?
can you cache responses and serve them to other clients or are responses custom to each client? Can you precompute the results and then just serve them to clients or can't you predict the inputs?
Have you tried running some performance tests and profile the application to see where hotspots might show up? Can you do something about them?
have you any imposed performance criteria? How many clients do you want to support simultaneously? Is 2 seconds too much? Can clients live with more? How much more? How many seconds does it mean an unacceptable response time?
do you need a big server to run this setup or smaller ones work better (i.e. scale horizontally instead of vertically)?
etc
Nobody can answer this for you. You have to do an analysis of your application, run some tests, profile it, optimize it, then repeat until you are satisfied with the results.
I'm using LoadRunner to stress-test a J2EE application.
I have got: 1 MySQL DB server, and 1 JBoss App server. Each is a 16-core (1.8GHz) / 8GB RAM box.
Connection Pooling: The DB server is using max_connections = 100 in my.cnf. The App Server too is using min-pool-size and max-pool-size = 100 in mysql-ds.xml and mysql-ro-ds.xml.
I'm simulating a load of 100 virtual users from a 'regular', single-core PC. This is a 1.8GHz / 1GB RAM box.
The application is deployed and being used on a 100 Mbps ethernet LAN.
I'm using rendezvous points in sections of my stress-testing script to simulate real-world parallel (and not concurrent) use.
Question:
The CPU utilization on this load-generating PC never reaches 100% and memory too, I believe, is available. So, I could try adding more virtual users on this PC. But before I do that, I would like to know 1 or 2 fundamentals about concurrency/parallelism and hardware:
With only a single-core load generator as this one, can I really simulate a parallel load of 100 users (with each user using operating from a dedicated PC in real-life)? My possibly incorrect understanding is that, 100 threads on a single-core PC will run concurrently (interleaved, that is) but not parallely... Which means, I cannot really simulate a real-world load of 100 parallel users (on 100 PCs) from just one, single-core PC! Is that correct?
Network bandwidth limitations on user parallelism: Even assuming I had a 100-core load-generating PC (or alternatively, let's say I had 100, single-core PCs sitting on my LAN), won't the way ethernet works permit only concurrency and not parallelism of users on the ethernet wire connecting the load-generating PC to the server. In fact, it seems, this issue (of absence of user parallelism) will persist even in a real-world application usage (with 1 PC per user) since the user requests reaching the app server on a multi-core box can only arrive interleaved. That is, the only time the multi-core server could process user requests in parallel would be if each user had her own, dedicated physical layer connection between it and the server!!
Assuming parallelism is not achievable (due to the above 'issues') and only the next best thing called concurrency is possible, how would I go about selecting the hardware and network specification to use my simulation. For example, (a) How powerful my load-generating PCs should be? (b) How many virtual users to create per each of these PCs? (c) Does each PC on the LAN have to be connected via a switch to the server (to avoid) broadcast traffic which would occur if a hub were to be used in instead of a switch?
Thanks in advance,
/HS
Not only are you using Ethernet, assuming you're writing web services you're talking over HTTP(S) which sits atop of TCP sockets, a reliable, ordered protocol with the built-in round trips inherent to reliable protocols. Sockets sit on top of IP, if your IP packets don't line up with your Ethernet frames you'll never fully utilize your network. Even if you were using UDP, had shaped your datagrams to fit your Ethernet frames, had 100 load generators and 100 1Gbit ethernet cards on your server, they'd still be operating on interrupts and you'd have time multiplexing a little bit further down the stack.
Each level here can be thought of in terms of transactions, but it doesn't make sense to think at every level at once. If you're writing a SOAP application that operates at level 7 of the OSI model, then this is your domain. As far as you're concerned your transactions are SOAP HTTP(S) requests, they are parallel and take varying amounts of time to complete.
Now, to actually get around to answering your question: it depends on your test scripts, the amount of memory they use, even the speed your application responds. 200 or more virtual users should be okay, but finding your bottlenecks is a matter of scientific inquiry. Do the experiments, find them, widen them, repeat until you're happy. Gather system metrics from your load generators and system under test and compare with OS provider recommendations, look at the difference between a dying system and a working system, look for graphs that reach a plateau and so on.
It sounds to me like you're over thinking this a bit. Your servers are fast and new, and are more than suited to handle lots of clients. Your bottleneck (if you have one) is either going to be your application itself or your 100m network.
1./2. You're testing the server, not the client. In this case, all the client is doing is sending and receiving data - there's no overhead for client processing (rendering HTML, decoding images, executing javascript and whatever else it may be). A recent unicore machine can easily saturate a gigabit link; a 100 mbit pipe should be cake.
Also - The processors in newer/fancier ethernet cards offload a lot of work from the CPU, so you shouldn't necessarily expect a CPU hit.
3. Don't use a hub. There's a reason you can buy a 100m hub for $5 on craigslist.
Without having a better understanding of your application it's tough to answer some of this, but generally speaking you are correct that to achieve a "true" stress test of your server it would be ideal to have 100 cores (using a target of a 100 concurrent users), i.e. 100 PC's. Various issues, though, will probably show this as a no-brainer.
I have a communication engine I built a couple of years back (.NET / C#) that uses asyncrhonous sockets - needed the fastest speeds possible so we had to forget adding any additional layers on top of the socket like HTTP or any other higher abstractions. Running on a quad core 3.0GHz computer with 4GB of RAM that server easily handles the traffic of ~2,200 concurrent connections. There's a Gb switch and all the PC's have Gb NIC's. Even with all PC's communicating at the same time it's rare to see processor loads > 30% on that server. I assume this is because of all the latency that is inherent in the "total system."
We have a new requirement to support 50,000 concurrent users that I'm currently implementing. The server has dual quad core 2.8GHz processors, a 64-bit OS, and 12GB of RAM. Our modeling shows this computer is more than enough to handle the 50K users.
Issues like the network latency I mentioned (don't forget CAT 3 vs. CAT 5 vs. CAT 6 issue), database connections, types of data being stored and mean record sizes, referential issues, backplane and bus speeds, hard drive speeds and size, etc., etc., etc. play as much a role as anything in slowing down a platform "in total." My guess would be that you could have 500, 750, a 1,000, or even more users to your system.
The goal in the past was to never leave a thread blocked for too long ... the new goal is to keep all the cores busy.
I have another application that downloads and analyzes the content of ~7,800 URL's daily. Running on a dual quad core 3.0GHz (Windows Ultimate 7 64-bit edition) with 24GB of RAM that process used to take ~28 minutes to complete. By simply swiching the loop to a Parallel.ForEach() the entire process now take < 5 minutes. My processor load that we've seen is always less than 20% and maximum network loading of only 14% (CAT 5 on a Gb NIC through a standard Gb dumb hub and a T-1 line).
Keeping all the cores busy makes a huge difference, especially true on applications that spend allot of time waiting on IO.
As you are representing users, disregard the rendezvous unless you have either an engineering requirement to maintain simultaneous behavior or your agents are processes and not human users and these agents are governed by a clock tick. Humans are chaotic computing units with variant arrival and departure windows based upon how quickly one can or cannot read, type, converse with friends, etc... A great book on the subject of population behavior is "Chaos" by James Gleik (sp?)
The odds of your 100 decoupled users being highly synchronous in their behavior on an instant basis in observable conditions is zero. The odds of concurrent activity within a defined time window however, such as 100 users logging in within 10 minutes after 9:00am on a business morning, can be quite high.
As a side note, a resume with rendezvous emphasized on it is the #1 marker for a person with poor tool understanding and poor performance test process. This comes from a folio of over 1500 interviews conducted over the past 15 years (I started as a Mercury Employee on april 1, 1996)
James Pulley
Moderator
-SQAForums WinRunner, LoadRunner
-YahooGroups LoadRunner, Advanced-LoadRunner
-GoogleGroups lr-LoadRunner
-Linkedin LoadRunner (owner), LoadrunnerByTheHour (owner)
Mercury Alum (1996-2000)
CTO, Newcoe Performance Engineering
Given: multithreaded (~20 threads) C++ application under RHEL 5.3.
When testing under load, top shows that CPU usage jumps in range 10-40% every second.
The design mostly pretty simple - most of the threads implement active object design pattern: thread has a thread-safe queue, requests from other queues are pushed to the queue, while the thread only polling on the queue and process incomming requests. Processed request causes to a new request to be pushed to next processing thread.
The process has several TCP/UDP connection over each a data is received/sent in a high load.
I know I did not provided sufficiant data. This is pretty big application, and I'n not familiar well with all it's parts. It's now ported from Windows on Linux over ACE library (used for networking part).
Suppusing the problem is in the application and not external one, what are the techicues/tools/approaches can be used to discover the problem. For example I suspect that this maybe caused by some mutex contention.
I have faced similar problem some time back and here are the steps that helped me.
1) Start with using strace to see where the application is spending the time executing system calls.
2) Use OProfile to profile both the application and the kernel.
3) If you are using an SMP system , look at the numa settings,
In my case that caused a havoc .
/proc/appPID/numa_maps will give a quick look at how the access to the memory is happening.
numa misses can cause the jumps.
4) You have mentioned about TCP connections in your app.
Look at the MTU size and see its set to right value and
Depending upon the type of Data getting transferred use the Nagles Delay appropriately.
Nagles Delay
I'm building my first web application after many years of desktop application development (I'm using Django/Python but maybe this is a completely generic question, I'm not sure). So please beware - this may be an ultra-newbie question...
One of my user processes involves heavy processing in the server (i.e. user inputs something, server needs ~10 minutes to process it). On a desktop application, what I would do it throw the user input into a queue protected by a mutex, and have a dedicated background thread running in low priority blocking on the queue using that mutex.
However in the web application everything seems to be oriented towards synchronization with the HTTP requests.
Assuming I will use the database as my queue, what is best practice architecture for running a background process?
There are two schools of thought on this (at least).
Throw the work on a queue and have something else outside your web-stack handle it.
Throw the work on a queue and have something else in your web-stack handle it.
In either case, you create work units in a queue somewhere (e.g. a database table) and let some process take care of them.
I typically work with number 1 where I have a dedicated windows service that takes care of these things. You could also do this with SQL jobs or something similar.
The advantage to item 2 is that you can more easily keep all your code in one place--in the web tier. You'd still need something that triggers the execution (e.g. loading the web page that processes work units with a sufficiently high timeout), but that could be easily accomplished with various mechanisms.
Since:
1) This is a common problem,
2) You're new to your platform
-- I suggest that you look in the contributed libraries for your platform to find a solution to handle the task. In addition to queuing and processing the jobs, you'll also want to consider:
1) status communications between the worker and the web-stack. This will enable web pages that show the percentage complete number for the job, assure the human that the job is progressing, etc.
2) How to ensure that the worker process does not die.
3) If a job has an error, will the worker process automatically retry it periodically?
Will you or an operations person be notified if a job fails?
4) As the number of jobs increase, can additional workers be added to gain parallelism?
Or, even better, can workers be added on other servers?
If you can't find a good solution in Django/Python, you can also consider porting a solution from another platform to yours. I use delayed_job for Ruby on Rails. The worker process is managed by runit.
Regards,
Larry
Speaking generally, I'd look at running background processes on a different server, especially if your web server has any kind of load.
Running long processes in Django: http://iraniweb.com/blog/?p=56