I have a special sort of problem.
I have some research code that I have developed on my macbook using CUDA 4.1, especially using batchedgemm. I now have to run it on a cluster of gpu's that I have loaned from another institution.
My problem is that the Cluster only has CUDA 4.0 installed, and they are reluctant to upgrade fast.
Does anyone know if I can get the source for batchedgemm somewhere and compile it to work under 4.0?
I've writen my own kernel for doing batched multiplications, but it performes an order of about 10 slower than the library one - I would like to stand on the shoulders of great men instead of on their toes.
I understand the reluctance to quickly upgrade a production cluster. Many clusters use a module system which means multiple versions of the CUDA toolkit can coexist. The driver, however, needs to be upgraded to a version that supports the latest CUDA in use. This is why they would be reluctant because they would need to test their users' production codes and applications to avoid regression or failure.
Since CUBLAS is not open source, I recommend you try to develop your code on a separate machine and if you get a large speed up from batching, present that to the administrators as a reason to upgrade.
Related
I want to extract features from images in MS COCO dataset using a fine-tuned VGG-19 network.
However, it takes about 6~7 seconds per image, roughly 2 hours per 1k images. (even longer for other fine-tuned models)
There are 120k images in MS COCO dataset, so it'll take at least 10 days.
Is there any way that I can speed up the feature extraction process?
Well, this is not just a command. First you must check whether your GPU is powerful enough to wrestle with deep CNNs. Knowing your GPU model can answer this question.
Second, you have to compile and build Caffe framework with CUDA and GPU-enabled (CPU_Only disabled) in the Makefile.config (or CMakeLists.txt).
Passing all required steps (installing Nvidia Driver, installing CUDA and etc.) you can build caffe for GPU-use. Then by passing the GPU_Device_ID in your command-line you can benefit from speed provided by them.
Follow this link for building Caffe using GPU.
Hope it helps
This ipython notebook example explains the steps to extract features out of any caffe model really well: https://github.com/BVLC/caffe/blob/master/examples/00-classification.ipynb
In pycaffe, you can set gpu mode simply by using caffe.set_mode_gpu().
I am writing a parallel build farm to build C++ cross-platform applications against various platforms / environments. Every time new code is pushed to a git repo, I build and test the latest code against all the platforms.
I've setup parallel to correctly distribute the jobs among several hosts using the --sshlogin option.
I transfer files, collect output and results. It's all working more than fine and I love the tool.
The build time being sometimes quite long for some platforms, I would like the build to be as incremental as possible.
My only issue is that the build is only incremental if the scheduler sends the jobs to the same machine and reuse the artefacts of the previous build on this specific host.
Say I have 3 hosts, I have 1 chance in 3 for the build to be incremental. If a hosts hasn't built this platform in a while, it might take a long time.
Is it possible to gain control over the host a specific input source will run on and only fallback to the other hosts if the host is busy?
Ideally, I would love to see a tag system where I tag input source with a name and tag several hosts with a name, creating pools of jobs and pools of machines specialized into that type of build.
But a very simple implementation where the input sources are distributed in the same order as the order the sshlogins are defined could be a simple & quick fix in my situation.
I tried to find the source code to implement it myself but I only see doc generation when I browse the code on Savannah.
Any ideas?
Thanks,
M
There is currently no support for prioritizing a given argument to a given sshlogin. The source code is at https://savannah.gnu.org/git/?group=parallel
Feel free to join the mailing list and discuss the idea: https://lists.gnu.org/mailman/listinfo/parallel
The only priority in the code is when a job has failed on an sshlogin, then GNU Parallel prefers to retry that job on another sshlogin. Maybe that could be extended?
If a job is marked as having failed -1 time for a given sshlogin, then GNU Parallel ought to prefer to run the job on that sshlogin.
I've been trying to discuss this idea on the mailing list as you suggested but never had any respone in more than 10 days... I guess you must be busy with other things at the moment. So I went along and forked the source code to make the necessary changes and make my solution work.
I pushed it there a week ago:
http://michakfromparis.github.io/gnu-parallel-sticky/
the source code is available on github here:
https://github.com/michaKFromParis/gnu-parallel-sticky
Wasn't exactly easy without any guidance as the source code has a lot of history so I tried to keep the changes surgical to ease merge of your future releases.
I've been using it in production for more than a week now and it works perfectly in my configuration.
It is also compatible with older formats, should be a drop-in replacement for usual parallel uses with extra features on the side.
Would love to get feedback from other users though as it might not be completely dry.
Thanks for sharing the original source code.
Best Regards,
M
I am noticing that device is not part of the 3.0 api ... what do I use instead?
zmq::device (ZMQ_QUEUE, clients, workers);
I found that the devices have been moved to here: https://github.com/zeromq/libzfl
It's a little confused, so here's the story.
When I inherited maintenance of 0MQ/2.x, it had a zmq_device() function, and a set of external device apps, small main programs with XML configuration.
I'd previously tried to improve and document these two layers, which people were playing with, patches refused by the maintainers. We then moved the external apps to the zdevices project, with more flexible configuration, etc. In the end these got no adoption and were abandoned. zdevices used libzfl (and XML) for its configuration. Most of libzfl got refactored into the CZMQ API (which people do use, a lot).
Sustrik then decided to remove the zmq_device call from 0MQ/3.0, which I explained the the list with the "less is more" argument. People didn't really like this since it broke a lot of existing applications, for a fairly weak reason.
So after the XS fork, I patched zmq_device back into 0MQ/3.1. The C++ API isn't part of the core library, but anyone using it is welcome to patch a device method back into that.
HTH.
AFAIK, currently there no devices available for 3.x but according to the readme
Less is More
Pre-built devices and zmq_device() removed. Should be made available
as a separate project(s).
Exactly one year ago, pieterh wrote the following on the site about the reasons for removing the devices:
It's mostly about being able to improve the device layers independently from the libzmq core. It's been hard to improve these device layers as part of libzmq core, mainly because the core API is considered sacred in ways that other stuff isn't. I.e. one does not touch a core API except between major versions. So, one does not touch devices if they are part of the core, except between major versions.
Just use C API for now:
zmq_device (ZMQ_QUEUE, clients, workers);
I am working on Debian and I have this server we want to monitor.
The application is ours and there are around a hundred real-time counters we want to export for monitoring purposes, graphs and alarms.
I've been looking at the Debian way of doing this because we do use Debian packaging to install the app, and Debian uses snmpd daemon, based on net-snmp, to export SNMP.
So far every approach I've seen looks very complicated, from recompiling snmpd to load a dynamic library into it, and compiling a form of subagent that replicates what snmpd does.
While all of those options make me think I should go for something else than SNMP I don't want to give up that early and I was wondering if anybody has found a feasible implementation.
Ideally it should be coded in C or C++ as the app is in C++, but I'm open to wrappers or other kind of suggestions.
net-snmp supports both the smux and agentx agent extension protocols, allowing sub-agents to live in different processes. They also have a tutorial on writing AgentX subagents in C.
An often overlooked solution is Agent++ API, which to me looks pretty nice and is under the Apache license. As far as I understand, you can modify that agent to answer to your own MIBs.
That said, doing a subagent isn't such a bad choice. You start the standard unpatched snmpd (from net-snmp). Then you connect to it with your subagent, which only adds those OIDs you want it to add. The net-snmp kit for coding AgentX (as the protocol is called) sub-agents is not dead simple to use, but not very hard either. There is also a Perl module for sub-agent development: https://metacpan.org/pod/NetSNMP::agent
The traditional way to do this in linux is to use the net-snmp package. Make sure you write the MIB first. Everything is based on the MIB and changes to the MIB usually results in lots of changes in the code. Coding for net-snmp is not difficult and there is lots of documentation to get you started, eg: http://www.net-snmp.org/wiki/index.php/Tutorials#Coding_Tutorials
Have you tried net-snmp?
I just downloaded and built the libraries/executables of Google Performance Tools. Before I run the CPU profiler on the application that I want to investigate, I want to learn how to use the tools properly perhaps on a sample application. What would be a good example to run the Google CPU profiler on? Thanks in advance.
The following paragraph appears in the README.windows file distributed with perftools 1.3:
The heap-profiler has had a preliminary port to Windows. It has not been well tested, and probably does not work at all when Frame Pointer Optimization (FPO) is enabled -- that is, in release mode. The other features of perftools, such as the cpu-profiler and leak-checker, have not yet been ported to Windows at all.
In my experience, for performance tuning, stack-sampling is the method of choice.
Google perftools contains a stack-sampler, and I believe its visual analyzer can be made to show the cost of individual statements, not just functions.
What you need to know is the percent of time the stack contains that statement, because that is how much time would be saved if the statement were removed.