How to force page file expansion when using boost::file_mapping - c++

in my current genetical algorithm I'm iterating over a couple of rather large files. Right now I'm using boost::file_mapping to access this data.
I have 3 different testcases I can launch the program on: (my computer has 8GB RAM, Windows 8.1, my different attempts at page file limits, read below)
1000 files, about 4MB in size, so 4 GB total.
This case is, when executed first a bit sluggish, but from the second iteration onwards, the memoryaccess isn't bottlenecking it anymore, and the speed is entirely limited by my CPU.
1000 files, about 6MB in size, so 6 GB total.
This is an entirely different scenario... The first iteration is proportionally slow, but even following iterations do not speed up. I have actually considered trying to load 4 GB to my memory and keep 2 GB mapped... Not sure this would actually work, but it may be worth a test... But even if this would work, this would not help with case c)...
1000 files, about 13 MB in size, so 13 GB total.
This is entirely hopeless. The first iteration is incredibly slow (which is understandable considering the amount of data), but even further iterations show no sign of speed improvement. And even a partial load to memory won't help much here.
Now I tried various settings for the page file limits:
managed by Win - the size of the pagefil stops at around 5-5.2 GB... never gets bigger. This obviously does not help with cases b) and c) and actually causes the files to cycle through... (it would actually be helpful, if at least the first 4 GB would stay, as it is right now, basically nothing is reused from the pagefile)
manual: min 1 GB, max 32 GB: the page file does not grow above 4.5GB
manual: min 16GB, max 32 GB: In case you haven't tried this yourself... don't do it. It makes booting almost impossible, and nothing runs smooth anymore... Yeah, I didn't test my program with this, as this was unacceptable.
So, what I'm looking for is some way to tell my Windows, that, when using page file settings 1) or 2), that i really really want to use a very large page file in this case for my program. But I don't want my Computer to entirely be run on the page file (as it basically happens with 3)) Is there any way I could force this?
Or is there any other way how to properly load the data in a way, so that at least from the second iteration onwards the access is done quick? The data consist only out of huge numbers of 64bit integers that are bitchecked against by my algorithm(there are a bunch of formating symbols in between every 200-300 ints), so I only need read access.
In case the info is needed, I'm using VS Pro 2013. Portability of the code isn't an issue, it only has to run on my notebook. And of course it is a 64bit application, and my processor supports that ;)

Related

EBR block in Lattice Diamond

I have a MachXO3 chip. Family datasheet is available here: http://www.latticesemi.com/~/media/LatticeSemi/Documents/DataSheets/MachXO23/DS1047-MachXO3-Family-Data-Sheet.pdf?document_id=50121
The datasheet says that EBR is composed of 9-kbit on page 2-10. But the table 1-1 on page 1-2 lists numbers that are not dividable by 9 at all...
Also, I have the following code:
reg [7:0] lineB0[1:0][127:0];
reg [7:0] lineB1[1:0][127:0];
and the report says that it takes 4 EBR. That sounds completely un-optimized. Why is that? How can I craft my table of 2*(2*128) bytes = 512 bytes = 4096 bit = 4kbit which should hold in 1 EBR?
The automatic inference algorithm seems to be not always super efficient. I generally would recommend to use the IPexpress to create a RAM or ROM if ressource usage is an issue. The tool report a resource use of 1 EBR for a 512*8 dual pipeline ram (RAM_DP). Depending on the organisation/application of your RAM a 128*(8+8) layout might be a good alternative under the assumption that you want to read always the same index for lineB0 and lineB1.
A friendly reminder: premature optimization is the root of all (or at least many) evil. So investing your time in other topics might be more worthwhile if the used amount of memory EBR is actually no limitation right now.

C++ environment/IDE to avoid multiple reads of big data sets

I am currently working on a big dataset (approximately a billion data points) and I have decided to use C++ over R in particular for convenience in memory allocation.
However, there does not seem to exist an equivalent to R Studio for C++ in order to "store" the data set and avoid to have to read the data every time I run the program, which is extremely time consuming...
What kind of techniques do C++ users use for big data in order to read the data "once for all" ?
Thanks for your help!
If I understand what you are trying to achieve, i.e. load some data into memory once and use the same data (in memory) with multiple runs of your code, with possible modifications to that code, there is no such IDE, as IDE are not ment to store any data.
What you can do is first load your data into some in-memory database and write your c++ program to read data from that database instead of reading it directly from data-source in C++.
how avoid multiple reads of big data set.
What kind of techniques do C++ users use for big data in order to read
the data "once for all" ?
I do not know of any C++ tool with such capabilities, but I doubt that I have ever searched for one ... seems like something you might do. Keywords appear to be 'data frame' and 'statistical analysis' (and C++).
If you know the 'data set' format, and wish to process raw data no more than one time, you might consider using Posix shared memory.
I can imagine that (a) the 'extremely time consuming' effort could (read and) transform the 'raw' data, and write into a 'data set' (a file) suitable for future efforts (i.e. 'once and for all').
Then (b) future efforts can 'simply' "map" the created 'data set' (a file) into the program's memory space, all ready for use with no (or at least much reduced) time consuming effort.
Expanding the memory map of your program is about using 'Posix' access to shared memory. (Ubuntu 17.10 has it, I have 'gently' used it in C++) Terminology includes, shm_open, mmap, munmap, shm_unlink, and a few others.
From 'man mmap':
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for
the new mapping is specified in ...
how avoid multiple reads of big data set. What kind of techniques do
C++ users use for big data in order to read the data "once for all" ?
I recently retried my hand at measuring std::thread context switch duration (on my Ubuntu 17.10, 64 bit desktop). My app captured <30 million entries over 10 seconds of measurement time. I also experimented with longer measurement times, and with larger captures.
As part of debugging info capture, I decided to write intermediate results to a text file, for a review of what would be input to the analysis.
The code spent only about 2.3 seconds to save this info to the capture text file. My original software would then proceed with analysis.
But this delay to get on with testing the analysis results (> 12 sec = 10 + 2.3) quickly became tedious.
I found the analysis effort otherwise challenging, and recognized I might save time by capturing intermediate data, and thus avoiding most (but not all) of the data measurement and capture effort. So the debug capture to intermediate file became a convenient split to the overall effort.
Part 2 of the split app reads the <30 million byte intermediate file in somewhat less 0.5 seconds, very much reducing the analysis development cycle (edit-compile-link-run-evaluate), which was was (usually) no longer burdened with the 12+ second measure and data gen.
While 28 M Bytes is not BIG data, I valued the time savings for my analysis code development effort.
FYI - My intermediate file contained a single letter for each 'thread entry into the critical section event'. With 10 threads, the letters were 'A', 'B', ... 'J'. (reminds me of dna encoding)
For each thread, my analysis supported splitting counts per thread. Where vxWorks would 'balance' the threads blocked at a semaphore, Linux does NOT ... which was new to me.
Each thread ran a different number of times through the single critical section, but each thread got about 10% of the opportunities.
Technique: simple encoded text file with captured information ready to be analyzed.
Note: I was expecting to test piping the output of app part 1 into app part 2. Still could, I guess. WIP.

Is it okay to set reduce_limit = false config in couchdb configuration?

I am working on a map/reduce review and I always have reduce_overflow_error each time I run the view, if I set reduce_limit = false in couchdb configuration, it is working, I want to know if there is negative effect if I change this config setting? thank you
The setting reduce_limit=true enforces CouchDB to control the size of reduced output on each step of reduction. If stringified JSON output of a reduction step has more than 200 chars and it‘s twice or more longer than input, CouchDB‘s query server throws an error. Both numbers, 2x and 200 chars, are hard-coded.
Since a reduce function runs inside SpiderMonkey instance(s) with only 64Mb RAM available, the limitation set by default looks somehow reasonable. Theoretically, reduce must fold, not blow up the data given.
However, in real life it‘s quite hard to fly under the limitation in all cases. You can not control number of chunks for a (re)reduction step. It means you can run into situation, when your output for a particular chunk is more than twice longer in chars, although other chunks reduced are much shorter. In this case even one uncomfortable chunk breaks entire reduction if reduce_limit is set.
So unsetting reduce_limit might be helpful, if your reducer can sometimes output more data, than it received.
Common case – unrolling arrays into objects. Imagine you receive list of arrays like [[1,2,3...70], [5,6,7...], ...] as input rows. You want to aggregate your list in a manner {key0:(sum of 0th elts), key1:(sum of 1st elts)...}.
If CouchDB decides to send you a chunk with 1 or 2 rows, you have an error. Reason is simple – object keys are also accounted calculating result length.
Possible (but very hard to achieve) negative effect is SpiderMonkey instance constantly restarting/falling on RAM overquota, when trying to process a reduction step or entire reduction. Restarting SM is CPU and RAM intensive and costs hundreds milliseconds in general.

High CPU usage by Django App

I've created a pretty simple Django app which somewhat produces a high CPU load: rendering a simple generic view with a list of simple models (20 of them) and 5-6 SQL queries per page produce an apache process which loads CPU by 30% - 50%. While memory usage is pretty ok (30MB), CPU load is not ok to my understanding and this is not because of apache/wsgi settings or something, the same CPU load happens when I run the app via runserver.
Since, I'm new to Django I wanted to ask:
1) Are these 30-50% figures an usual thing for a Django app? (Django 1.4, ubuntu 12.04, python 2.7.3)
2) How do I profile CPU load? I used a profile middleware from here: http://djangosnippets.org/snippets/186/ but it shows only ms numbers not CPU load numbers and there was nothing special, so how do I identify what eats up so much CPU power?
CPU usage itself doesn't tell how efficient your app is. More important metric to measure the performance is how many requests/second your app can process. The kind of processor your machine has naturally also has a huge effect on the results.
I suggest you to run ab with multiple concurrent requests and compare the requests/second number to some benchmarks (there should be many around the net). ab will try to test the maximum throughput, so it's natural that one of the resources will be fully utilized (bottleneck), usually this is disk-io. As an example if you happen to get CPU usage close to 100% it may mean you are wasting CPU somewhere (reqs/second is low) or you that have optimized disk-io well (reqs/s high).
Looking at the %CPU column is not very accurate. I certainly see spikes of 50%-100% CPU all of the time.. it does not indicate how long a cpu is being used, just that we hit that value at that specific moment. These would fall into min / max figures, not your average cpu usage.
Another important piece: say you have 4 cores as I do which means the 30-50% figure on top is out of a maximum of 400%. 50% on top means 50% of one core, 12.5% on all four, etc.
You can press 1 in top to see individual core cpu figures.

Anyone benchmarked virtual machine performance for build servers?

We have been trying to use virtual machines for build servers. Our build servers are all running WinXP32 and we are hosting them on VMWare Server 2.0 running on Ubuntu 9.10. We build a mix of C, C++, python packages, and other various deployment tasks (installers, 7z files, archives, etc). The management using VMWare hosted build servers is great. We can move them around, shared system resources on one large 8-core box, remotely access the systems through a web interface, and just basically manage things better.
But the problem is that the performance compared to using a physical machine seems to range from bad to horrid depending upon what day it is. It has proven very frustrating. Sometimes the system load for the host will go above 20 and some times it will be below 1. It doesn't seem to be based on how much work is actually being done on the systems. I suspect there is a bottleneck in the system, but I can't seem to figure out what it is. (most recent suspect is I/O, but we have a dedicated 1TB 7200RPM SATA 2 drive with 32MB of cache doing nothing but the virtual machines. Seems like enough for 1-2 machines. All other specs seem to be enough too. 8GB RAM, 2GB per VM, 8 cores, 1 per vm).
So after exhausting everything I can think of, I wanted to turn to the Stack Overflow community.
Has anyone run or seen anyone else run benchmarks of software build performance within a VM.
What should we expect relative to a physical system?
How much performance are we giving up?
What hardware / vm server configurations are people using?
Any help would be greatly appreciated.
Disk IO is definitely a problem here, you just can't do any significant amount of disk IO activity when you're backing it up with a single spindle. The 32MB cache on a single SATA drive is going to be saturated just by your Host and a couple of Guest OS's ticking over. If you look at the disk queue length counter in your Ubuntu Host OS you should see that it is high (anything above 1 on this system with 2 drive for any length of time means something is waiting for that disk).
When I'm sizing infrastructure for VM's I generally take a ballpark of 30-50 IOPS per VM as an average, and that's for systems that do not exercise the disk subsystem very much. For systems that don't require a lot of IO activity you can drop down a bit but the IO patterns for build systems will be heavily biased towards lots of very random fairly small reads. To compound the issue you want a lot of those VM's building concurrently which will drive contention for the disk through the roof. Overall disk bandwidth is probably not a big concern (that SATA drive can probably push 70-100Meg/sec when the IO pattern is totally sequential) but when the files are small and scattered you are IO bound by the limits of the spindle which will be about 70-100 IO per second on a 7.2k SATA. A host OS running a Type 2 Hypervisor like VMware Server with a single guest will probably hit that under a light load.
My recommendation would be to build a RAID 10 array with smaller and ideally faster drives. 10k SAS drives will give you 100-150 IOPs each so a pack of 4 can handle 600 read IOPS and 300 write IOPs before topping out. Also make sure you align all of the data partitions for the drive hosting the VMDK's and within the Guest OS's if you are putting the VM files on a RAID array. For workloads like these that will give you a 20-30% disk performance improvement. Avoid RAID 5 for something like this, space is cheap and the write penalty on RAID 5 means you need 4 drives in a RAID 5 pack to equal the write performance of a single drive.
One other point I'd add is that VMware Server is not a great Hypervisor in terms of performance, if at all possible move to a Type 1 Hypervisor (like ESXi v4, it's also free). It's not trivial to set up and you lose the Host OS completely so that might be an issue but you'll see far better IO performance across the board particularly for disk and network traffic.
Edited to respond to your comment.
1) To see whether you actually have a problem on your existing Ubuntu host.
I see you've tried dstat, I don't think it gives you enough detail to understand what's happening but I'm not familiar with using it so I might be wrong. Iostat will give you a good picture of what is going on - this article on using iostat will help you get a better picture of the actual IO pattern hitting the disk - http://bhavin.directi.com/iostat-and-disk-utilization-monitoring-nirvana/ . The avgrq-sz and avgwq-sz are the raw indicators of how many requests are queued. High numbers are generally bad but what is actually bad varies with the disk type and RAID geometry. What you are ultimately interested in is seeing whether your disk IO's are spending more\increasing time in the queue than in actually being serviced. The calculation (await-svctim)/await*100 really tells you whether your disk is struggling to keep up, above 50% and your IO's are spending as long queued as being serviced by the disk(s), if it approaches 100% the disk is getting totally slammed. If you do find that the host is not actually stressed and VMware Server is actually just lousy (which it could well be, I've never used it on a Linux platform) then you might want to try one of the alternatives like VirtualBox before you jump onto ESXi.
2) To figure out what you need.
Baseline the IO requirements of a typical build on a system that has good\acceptable performance - on Windows look at the IOPS counters - Disk Reads/sec and Disk Writes/sec counters and make sure the average queue length is <1. You need to know the peak values for both while the system is loaded, instantaneous peaks could be very high if everything is coming from disk cache so watch for sustained peak values over the course of a minute or so. Once you have those numbers you can scope out a disk subsystem that will deliver what you need. The reason you need to look at the IO numbers is that they reflect the actual switching that the drive heads have to go through to complete your reads and writes (the IO's per second, IOPS) and unless you are doing large file streaming or full disk backups they will most accurately reflect the limits your disk will hit when under load.
Modern disks can sustain approximately the following:
7.2k SATA drives - 70-100 IOPS
10k SAS drives - 120-150 IOPS
15k SAS drives - 150-200 IOPS
Note these are approximate numbers for typical drives and represent the saturated capability of the drives under maximum load with unfavourable IO patterns. This is designing for worst case, which is what you should do unless you really know what you are doing.
RAID packs allow you to parallelize your IO workload and with a decent RAID controller an N drive RAID pack will give you N*(Base IOPS for 1 disk) for read IO. For write IO there is a penalty caused by the RAID policy - RAID 0 has no penalty, writes are as fast as reads. RAID 5 requires 2 reads and 2 writes per IO (read parity, read existing block, write new parity, write new block) so it has a penalty of 4. RAID 10 has a penalty of 2 (2 writes per IO). RAID 6 has a penalty of 5. To figure out how many IOPS you need from a RAID array you take the basic read IOPS number your OS needs and add to that the product of the write IOPS number the OS needs and the relevant penalty factor.
3) Now work out the structure of the RAID array that will meet your performance needs
If your analysis of a physical baseline system tells you that you only need 4\5 IOPS then your single drive might be OK. I'd be amazed if it does but don't take my word for it - get your data and make an informed decision.
Anyway let's assume you measured 30 read IOPS and 20 write IOPS during your baseline exercise and you want to be able to support 8 instances of these build systems as VM's. To deliver this your disk subsystem will need to be able to support 240 read IOPS and 160 write IOPS to the OS. Adjust your own calculations to suit the number of systems you really need.
If you choose RAID 10 (and I strongly encourage it, RAID 10 sacrifices capacity for performance but when you design for enough performance you can size the disks to get the capacity you need and the result will usually be cheaper than RAID5 unless your IO pattern involves very few writes) Your disks need to be able to deliver 560 IOPS in total (240 for read, and 320 for write in order to account for the RAID 10 write penalty factor of 2).
This would require:
- 4 15k SAS drives
- 6 10k SAS drives (round up, RAID 10 requires an even no of drives)
- 8 7.2k SATA drives
If you were to choose RAID 5 you would have to adjust for the increased write penalty and will therefore need 880 IOPS to deliver the performance you want.
That would require:
- 6 15k SAS drives
- 8 10k SAS drives
- 14 7.2k SATA drives
You'll have a lot more space this way but it will cost almost twice as much because you need so many more drives and you'll need a fairly big box to fit those into. This is why I strongly recommend RAID 10 if performance is any concern at all.
Another option is to find a good SSD (like the Intel X-25E, not the X-25M or anything cheaper) that has enough storage to meet your needs. Buy two and set them up for RAID 1, SSD's are pretty good but their failure rates (even for drives like the X-25E's) are currently worse than rotating disks so unless you are prepared to deal with a dead system you want RAID 1 at a minimum. Combined with a good high end controller something like the X-25E will easily sustain 6k IOPS in the real world, that's the equivalent of 30 15k SAS drives. SSD's are quite expensive per GB of capacity but if they are used appropriately they can deliver much more cost effective solutions for tasks that are IO intensive.