kyoto cabinet scan_parallel not really parallel? - c++

I just spent a day creating an abstraction layer to kyotodb to remove global locks from my code, I was busy porting my algorithms to this new abstraction layer when I discover that scan_parallel isn't really parallel. It only maxes out one core -- For jollies I stuck in a billion-int-countdown spin-loop in my code(empty stubs as I port) to try and simulate some processing time. still only one core maxed. Do I need to move to berkley db or leveldb ? I thought kyotodb was meant for internet scale problems :/. I must be doing something wrong or missing some gotchas.
top or iostat never went above 100% / 25% (iostat one cpu maxed = 1/number of cores * 100):/ On a quad core i5.
source db is 10gigs corpus of protocol buffer encoded data (treedb) with the following flags (picked these up from the documentation).
index_db.tune_options(TreeDB::TLINEAR | TreeDB::TCOMPRESS);
index_db.tune_buckets(1LL * 1000);
index_db.tune_defrag(8);
index_db.tune_page(32768);
edit
Do not remove the IR TAG. Please think before you wave arround the detag bat.
This IS an IR related question, its about creating GINORMOUS (40 gig +) inverted files ONLINE, inverted indices are the base of IR data access methods, and inverted index creation has a unique transactional profile. By removing the IR tag you rob me of the wisdom of IR researchers who have used a database library to create such large database files.

Related

Need kind of 'interval ranges' computations

I need a library for my C++ program.
But the problem, I don't know the name of this data type I want.
I have NPAPI plugin (I know this API is deprecated and removed from modern browsers) which issues to a server
HTTP range requests. Request is asyncronious and the data may arraive in any order with any chunks size.
So I need to track ranges I already have requested from a server.
For example, if initially I requested bytes [10-20] (inclusevely), then I requested [30-40] the data type I need should keep it as two intervals:
[10-20],[30-40]
But if I request [21-29] or even [15-35] it should be merged in one interval:
[10-20],[30-40] + [15-35] = [10-40]
Also I need a substraction when a requested block arrives:
[10-40] - [20-30] = [10-19],[31-40]
(requested - arrived = we're still waiting for)
I had a look at boost::numeric::intervals library but at first glance it is too big for this task (1583 files, 13 Mb of sources after './dist/bin/bcp numeric/interval ~/boost').
Also, GNU ddrescue has some similar arithmetics inside but the code isn't a library there, it coupled too much with the applications specifics.
UPDATE:
Here is what I've found on my way:
A container for integer intervals, such as RangeSet, for C++
https://en.wikipedia.org/wiki/Interval_tree
Boost.ICL
NCBI C++ Toolkit, CIntervalTree

Maximize tensorflow multi gpu performance

I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting.
As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example.
My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data.
(when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k)
In my experiment I have a few questions in hopes to speed things up.
GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?
Is there anyway to perform multiple train operations in one execution of session.run()? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with QueueRunners's), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.
Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?
Is there something else that can be done?
I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried.
Edit:
I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside:
void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>,
unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const,
int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)
and 20.0% of the time were spent in
void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4>
const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>,
int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)
Off of the Signature I am not exactly sure what these are doing. Do these make sense?
In addition to this, the analysis reports low kernel concurrency, 0%, as expected.
And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.)
Edit 2:
I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time.
In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a
Edit 3:
Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot.
On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second.
Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second.
Edit 4:
In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.
It's a bit hard to diagnose your program's performance problem without seeing the code. Is it possible for us to read your test code somehow?
TensorPadding showing on the top is a bit strange. I'd expect cudnn calls should be on the top of the profile. Anyway, showing us the test code will be helpful.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

in depth explanation of the side effects interface in clojure overtone generators

I an new to overtone/supercollider. I know how sound forms physically. However I don't understand the magic inside overtone's sound generating functions.
Let's say I have a basic sound:
(definst sin-wave [freq 440 attack 0.01 sustain 0.4 release 0.1 vol 0.4]
(* (env-gen (lin-env attack sustain release) 1 1 0 1 FREE)
(+ (sin-osc freq)
(sin-osc (* freq 2))
(sin-osc (* freq 4)))
vol))
I understand the ASR cycle of sound envelope, sin wave, frequency, volume here. They describe the amplitude of the sound over time. What I don't understand is the time. Since time is absent from the input of all functions here, how do I control stuffs like echo and other cool effects into the thing?
If I am to write my own sin-osc function, how do I specify the amplitude of my sound at specific time point? Let's say my sin-osc has to set that at 1/4 of the cycle the output reaches the peak of amplitude 1.0, what is the interface that I can code with to control it?
Without knowing this, all sound synth generators in overtone doesn't make sense to me and they look like strange functions with unknown side-effects.
Overtone does not specify the individual samples or shapes over time for each signal, it is really just an interface to the supercollider server (which defines a protocol for interaction, of which the supercollider language is the canonical client to this server, and overtone is another). For that reason, all overtone is doing behind the scenes is sending signals for how to construct a synth graph to the supercollider server. The supercollider server is the thing that is actually calculating what samples get sent to the dac, based on the definitions of the synths that are playing at any given time. That is why you are given primitive synth elements like sine oscillators and square waves and filters: these are invoked on the server to actually calculate the samples.
I got an answer from droidcore at #supercollider/Freenode IRC
d: time is really like wallclock time, it's just going by
d: the ugen knows how long each sample takes in terms of milliseconds, so it knows how much to advance its notion of time
d: so in an adsr, when you say you want an attack time of 1.0 seconds, it knows that it needs to take 44100 samples (say) to get there
d: the sampling rate is fixed and is global. it's set when you start the synthesis process
d: yeah well that's like doing a lookup in a sine wave table
d: they'll just repeatedly look up the next value in a table that
represents one cycle of the wave, and then just circle around to
the beginning when they get to the end
d: you can't really do sample-by sample logic from the SC side
d: Chuck will do that, though, if you want to experiment with it
d: time is global and it's implicit it's available to all the oscillators all the time
but internally it's not really like it's a closed form, where you say "give me the sample for this time value"
d: you say "time has advanced 5 microseconds. give me the new value"
d: it's more like a stream
d: you don't need to have random access to the oscillators values, just the next one in time sequence

How to read a multi-session DVD disk size in Windows?

Trying to read the sizes of disks that were created in multiple sessions using GetDiskFreeSpaceEx() gives the size of the last session only. How do I read correctly the number and sizes of all sessions in C/C++?
Thanks.
You might want to look at the DeviceIoControl API function. See here for control codes. Here is a code example that retrieves the size of a CD disk. Substitute
CreateFile(TEXT("\\\\.\\PhysicalDrive0")
for e.g.
CreateFile(TEXT("\\\\.\\F:") /* Drive is F: */
if you wish.
Note: The page says that DeviceIoControl can be used to "retrieve information about a floppy disk drive, hard disk drive, tape drive, or CD-ROM drive", but I have also tested it on a DVD, and it seemed to work perfectly. I did not have access to any multisession DVDs to test, so you'll have to test if that works yourself. If it doesn't work, I'd try some of the other control codes, at least IOCTL_DISK_GET_DRIVE_GEOMETRY_EX, IOCTL_DISK_GET_DRIVE_LAYOUT_EX, IOCTL_DISK_GET_LENGTH_INFO and IOCTL_DISK_GET_PARTITION_INFO_EX.
If all fails with DeviceIoControl, you could possibly make use of the Windows Image Mastering API (IMAPI). You'll need v2 of the API (included with Vista & later, can be added to XP & 2003 too, see here: What's new in IMAPIv2) for DVD support. This API is primarily for CD burning, but does perhaps contain some functionality for retrieving disk size, I'd find it weird if it didn't. Particularly, this example seems to be interesting. I do not know if this one works for multisession disks either, but since it can create them, I guess it's likely.
Here are some resources for IMAPI:
MSDN - IMAPI
MSDN - IMAPI interfaces
MSDN - Creating multisession disks with IMAPI (note: example with VB, not C or C++)
Hey I got at least 2 solutions for you:
1) Download dvd+rw-mediainfo.exe from http://fy.chalmers.se/~appro/linux/DVD+RW/tools/win32/, it's a tool that reads info about your disc. Then just make a system call from your app and parse the results. Here's example output:
D:\Downloads>"dvd+rw-mediainfo.exe" f:
INQUIRY: [HL-DT-ST][DVDRAM GT30N ][1.01]
GET [CURRENT] CONFIGURATION:
Mounted Media: 10h, DVD-ROM
Current Write Speed: 1.0x1385=1385KB/s
Write Speed #0: 8.0x1385=11080KB/s
Write Speed #1: 4.0x1385=5540KB/s
Write Speed #2: 2.0x1385=2770KB/s
Write Speed #3: 1.0x1385=1385KB/s
Speed Descriptor#0: 00/2292991 R#8.0x1385=11080KB/s W#8.0x1385=11080KB/s
READ DVD STRUCTURE[#0h]:
Media Book Type: 01h, DVD-ROM book [revision 1]
Legacy lead-out at: 2292992*2KB=4696047616
READ DISC INFORMATION:
Disc status: complete
Number of Sessions: 1
State of Last Session: complete
Number of Tracks: 1
READ TRACK INFORMATION[#1]:
Track State: complete
Track Start Address: 0*2KB
Free Blocks: 0*2KB
Track Size: 2292992*2KB
Last Recorded Address: 2292991*2KB
FABRICATED TOC:
Track#1 : 17#0
Track#AA : 17#2292992
Multi-session Info: #1#0
READ CAPACITY: 2292992*2048=4696047616
2) Investigate mciSendString from [DllImport("winmm.dll", EntryPoint = "mciSendStringA", CharSet = CharSet.Ansi)], I suspect you can send some command and get the desired results.
PS: of course you may download dvd+rw-mediainfo.exe sources from here and investigate further, I am just giving you ideas to think of.
UPDATE
Link to source code updated, thanks #oystein
There are many way to do this since the DVD drives have several interfaces for this due to legacy and backward-compatibility issues.
You could send an IOCTL_SCSI_PASSTHROUGH_DIRECT command to the DVD-drive ( the physicaldevice handle for it). With it you issue a SCSI commands that will be answered by the drive. You can read session information, disk information disk capcity and more.
I believe that dvd+rw-mediainfo.exe issues these.
Unfortunatly, the interface is a bit tricky and obscure, since it is a command within a command. Th passthrough has a byte buffer you will have to fill in yourself with the command structure.
Or you can call IOCTL_CDROM_READ_TOC_EX:
http://www.osronline.com/ddkx/storage/k306_2cs2.htm
I also believe that the exact set of the IOCTL / commands that will work depends on on the drive and its firmaware.
Older drives will not support the newr interfaces and some of the newer drives will not support legacy interfaces.
Thus, some of the libraries & tools might use one or more of these interfaces.
Accseeing the older sessons is all quite messy, really, since most OS will not care about them, only the most recent ones.