How to profile a web crawler?

How to profile a web crawler? - profiling

I have a 2 slightly different versions of a web-crawler. I want to compare them in performance (spesifically time taken to crawl a given domain). I have considered these two options:
Run them one at a time, compare time taken.
Run both of them at the same time, compare time taken.
The drawback of 1 is, network can be slower/faster when running second one. The drawback of 2 is, one can hijack most of the bandwidth and seems to be working faster, while other could work better given the same bandwidth.
I don't know how to (if possible) limit bandwidth (and cpu usage maybe?) per process. If I could do that, I would give each a fair share and run them at the same time, so it could work.
Any ideas how to do this?

Select Option 1 and take a lot of samples. Run one for a week, then run the other for a week. The network bandwidth will of course vary, but should average out.
On another note, you'll probably want to find a way to throttle your crawler so it doesn't consume all your resources. Once you have that, option 2 becomes a better choice.

Related

Are materialized views in redshift worth their costs

I'm working on a project in aws redshift with a few billion rows where the main queries are rollups on time units. The current implementation has mvs for all these rollups. It seems to me that if redshift is all it's cracked up to be and the dist and sort keys are defined correctly the mvs should not be necessary and their costs in extra storage and maintenance (refresh). I'm wondering if anyone has analyzed this in a similar application.

You're thinking along the right path but the real world doesn't always allow for 'just do it better'.
You are correct that sometimes MVs are just used to forego the effort of optimizing a complex query but sometimes not. The selection of keys, especially distribution key, is a compromise between optimizing different workloads. Distribute one way and query A gets faster but query B gets slower. But if the results of query B don't need to be completely up to date, one can make an MV out of B and only pay the price on refresh.
Sometimes queries are very complex and time consuming (and not because they aren't optimized). The results of this query doesn't need to include the latest info to be valid so an MV can can make the cost of this query infrequent. [In reality MVs often represent complex subqueries that are referenced by a number of other queries which makes accentuates the frequent vs. infrequent value of the MV.]
Sometimes query types don't match well to Redshift's distributed, columnar nature and just don't perform well. Again, current-ness of data can be played off against cluster workload and these queries can be run at low usage times.
With all that said I think you are on the right path as I've also been trying to get people to see that many, many queries are just poorly written. Too often in the data world functionally correct equals done and in reality this is only half done. I've rewritten queries that were taking 90 minutes to execute (browning out the cluster when they ran) and got them down to 17 seconds. So keep up the good fight but use MVs as a last resort when compromise is the only solution.

Redshift CPU utilisation is 100 percent most of the time

I have a 96 Vcpu Redshift ra3.4xlarge 8 node cluster, Most of the times the CPU utilisation is 100 percent , It was a dc2.large 3 node cluster before , that was also always 100 percent that's why we increased it to ra3. We are doing most of our computes on Redshift but the data is not that much! I read somewhere Doesn't matter how much compute you increase unless its significantly , there will only be a slight improvement in the Computation. Can anyone explain this?

I can give it a shot. Having 100% CPU for long stretches of time is generally not a good (optimal) thing in Redshift. You see Redshift is made for performing analytics on massive amounts of structured data. To do this it utilizes several resources - disks/disk IO bandwidth, memory, CPU, and network bandwidth. If you workload is well matched to Redshift your utilization of all these things will average around 60%. Sometimes CPU bound, sometimes memory bound, sometimes network bandwidth bound, etc. Lots of data being read means disk IO bandwidth is at a premium, lots of redistribution of data means network IO bandwidth is constraining. If you are using all these factors above 50% capacity you are getting what you paid for. Once any of these factors gets to 100% there is a significant drop-off of performance as working around the oversubscribed item steals performance.
Now you are in a situation where you are see 100% for a significant portion of the operating time, right? This means you have all these other attributes you have paid for but are not using AND inefficiencies are being realized to manage through this (though of all the factors, high CPU has the lease overhead). The big question is why.
There are a few possibilities but the most likely, in my experience, is inefficiently queries. An example might be the best way to explain this. I've seen queries that are intended to find all the combinations of certain factors from several tables. So they cross join these tables but this produces lots of repeats so they add DISTINCT, problem solved. But this still creates all the duplicates and then reduces the set down. All the work is being done and most of the results thrown away. However, if they pared down the factors in the tables first, then cross joined them, the total work will be significantly lower. This example will do exactly what you are seeing, high CPU as it spins making repeat combinations and then throwing most of them away.
If you have many of this type of "fat in the middle" query where lots of extra data is made and immediately reduced, you won't get a lot of benefit for adding CPU resources. Things will get 2X faster with 2X the cluster size but you are buying 2X of all these other resources that aren't helping you. You would expect that buying 2X CPU and 2X memory and 2X disk IO etc. would give you much more than a 2X improvement. Being constrained on 1 thing make scaling costly. Also, you are unlikely to see the CPU utilization come down as your queries just "spin the tires" of the CPU. More CPUs will just mean you can run more queries resulting in the spinning more tires.
Now the above is just my #1 guess based on my consulting experience. It could be that your workload just isn't right for Redshift. I've seen people try to put many small database problems into Redshift thinking that it's powerful so it must be good at this too. They turn up the slot count to try to pump more work into Redshift but just create more issues. Or I've seem people try to run transactional workloads. Or ... If you have the wrong tool for the job it may not work well. One 6 ton dump truck isn't the same thing as 50 motorcycle delivery team - each has their purpose but they aren't interchangeable.
Another possibility is that you have a very unusual workload but Redshift is still the best tool for the job. You don't need all the strengths of Redshift but this is ok, you are getting the job done at an appropriate cost. If this case 100% CPU is just how your workload uses Redshift. It's not a problem, just reality. Now I doubt this is the case, but it is possible. I'd want to be sure I'm getting all the value from the money I'm spending before assuming everything is ok.

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

running time of two programs run separately and then together

I was recently asked this question in an interview, and while I did alright on the first two parts [I am assuming] I struggled a bit on the third. Here's the question:
You have two Linux programs, A and B. When run separately, A and B each take one minute to complete on a system that has just been restarted. [ie: fresh system: you reboot it, log in, get a shell prompt, run the program.]
What can you tell me about the programs if:
a) when run together, they take 2 minutes
b) when run together, they take 1 minute
c) when run together, they take 30 seconds
I said for a) that if they take exactly double the time when run together, they share no mutual exclusion and are vying for all the same resources, probably don't share any sort of cache data or instructions [and thus don't help each other out from a cache perspective] and each program needs the full utilization of said resource to complete such that the OS cannot parallelize them.
For b), I said that if they can run just as fast together, they probably share some spacial/temporal locality in the cash, and may lend themselves to being properly pipelined in such a way that while program A is waiting on something, program B can run in between those stages, and vice versa-- effectively running them both in 1 minute.
For c), I was a bit stuck. In retrospect, I probably should have said that perhaps program A and B were both doing a common task, where two of them running at once could complete said task faster than one running alone-- such as a garbage collector. But the best that I could come up with was that perhaps they loaded out of the same sector on the hard disk, and that helped them both together run quickly.
I am just looking for some input from some of the smarties here on things I probably missed. The position was for a platforms/systems position that require a good understanding of hardware/software and operating systems, and namely interactions between them which is why [I'm assuming] the question was asked.
I was also trying to think of examples that I could apply to each part to help show my knowledge of the questions real life applications, but on the spot I was coming up short.

Together they take 2 minutes to complete
In this case, I think that each program is fully CPU-bound and can saturate 100% of the CPUs available on the machine. Therefore when the programs run together, each runs at half speed.
It's also possible that this would be the observed behavior if both programs were able and willing to saturate some other resource apart from the CPU, for example some I/O device. However, since in practice, usually the performance of I/O devices does not decrease linearly with the load applied to them if they are oversaturated, I would consider that a less likely scenario and go with CPU-bound as a first guess.
Together they take 1 minute to complete
The two programs do not contest the same resources, or there are ample resources in the system to satisfy the demands of both. Therefore, they end up not interfering with each other.
Together they take half a minute to complete
The programs operate on the same input, and both can tell when all input is used up, so each ends up doing half the work it would do if launched alone at half the running time. Also, the system obviously has the capacity to supply double the amount of whatever resource these programs are constrained by.
Since in this case the running time decreases linearly with the amount of processes (perfect scaling), it seems more likely that the resource constraining the programs is CPU for the same reasons explained in the "2 minutes" scenario. This also fits in well with the "common input" assumption, as the input would not be very likely to be coming from one source if there were e.g. different I/O devices supplying it.
Therefore, the first guess in this case is that each program is CPU-bound and written such that it consumes at most half the CPU resources in the system.

For A, They're programs that are in competition for a mutually exclusive resource.
For B, They're independent programs that don't really interact.
For C, which is the one you're struggling with, it seems they both have the same work to pick from. For example, there's a queue of tasks to do, both programs are capable of doing the tasks, and they know what tasks have been done. So if they both run at the same time (assuming multi core machine, but even then not necessarily, all that's important is that they don't have a resource bottleneck) they get the work done in half the time.

See Performance in multithreaded Java application for another possible reason why processes can run faster when you have more than one.
Although I admit that the queue of tasks that canbeperformed concurrently is a much simpler reason to explain this reduced running time.

Predict C++ program running time

How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?

They do not predict the time. They calculate the number of operations to be done on a total of operations.

You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.

There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.

Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.

Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.

In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem

As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).

A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)

You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..

As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)

You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js