EBR block in Lattice Diamond - lattice

I have a MachXO3 chip. Family datasheet is available here: http://www.latticesemi.com/~/media/LatticeSemi/Documents/DataSheets/MachXO23/DS1047-MachXO3-Family-Data-Sheet.pdf?document_id=50121
The datasheet says that EBR is composed of 9-kbit on page 2-10. But the table 1-1 on page 1-2 lists numbers that are not dividable by 9 at all...
Also, I have the following code:
reg [7:0] lineB0[1:0][127:0];
reg [7:0] lineB1[1:0][127:0];
and the report says that it takes 4 EBR. That sounds completely un-optimized. Why is that? How can I craft my table of 2*(2*128) bytes = 512 bytes = 4096 bit = 4kbit which should hold in 1 EBR?

The automatic inference algorithm seems to be not always super efficient. I generally would recommend to use the IPexpress to create a RAM or ROM if ressource usage is an issue. The tool report a resource use of 1 EBR for a 512*8 dual pipeline ram (RAM_DP). Depending on the organisation/application of your RAM a 128*(8+8) layout might be a good alternative under the assumption that you want to read always the same index for lineB0 and lineB1.
A friendly reminder: premature optimization is the root of all (or at least many) evil. So investing your time in other topics might be more worthwhile if the used amount of memory EBR is actually no limitation right now.

Related

C++ environment/IDE to avoid multiple reads of big data sets

I am currently working on a big dataset (approximately a billion data points) and I have decided to use C++ over R in particular for convenience in memory allocation.
However, there does not seem to exist an equivalent to R Studio for C++ in order to "store" the data set and avoid to have to read the data every time I run the program, which is extremely time consuming...
What kind of techniques do C++ users use for big data in order to read the data "once for all" ?
Thanks for your help!
If I understand what you are trying to achieve, i.e. load some data into memory once and use the same data (in memory) with multiple runs of your code, with possible modifications to that code, there is no such IDE, as IDE are not ment to store any data.
What you can do is first load your data into some in-memory database and write your c++ program to read data from that database instead of reading it directly from data-source in C++.
how avoid multiple reads of big data set.
What kind of techniques do C++ users use for big data in order to read
the data "once for all" ?
I do not know of any C++ tool with such capabilities, but I doubt that I have ever searched for one ... seems like something you might do. Keywords appear to be 'data frame' and 'statistical analysis' (and C++).
If you know the 'data set' format, and wish to process raw data no more than one time, you might consider using Posix shared memory.
I can imagine that (a) the 'extremely time consuming' effort could (read and) transform the 'raw' data, and write into a 'data set' (a file) suitable for future efforts (i.e. 'once and for all').
Then (b) future efforts can 'simply' "map" the created 'data set' (a file) into the program's memory space, all ready for use with no (or at least much reduced) time consuming effort.
Expanding the memory map of your program is about using 'Posix' access to shared memory. (Ubuntu 17.10 has it, I have 'gently' used it in C++) Terminology includes, shm_open, mmap, munmap, shm_unlink, and a few others.
From 'man mmap':
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for
the new mapping is specified in ...
how avoid multiple reads of big data set. What kind of techniques do
C++ users use for big data in order to read the data "once for all" ?
I recently retried my hand at measuring std::thread context switch duration (on my Ubuntu 17.10, 64 bit desktop). My app captured <30 million entries over 10 seconds of measurement time. I also experimented with longer measurement times, and with larger captures.
As part of debugging info capture, I decided to write intermediate results to a text file, for a review of what would be input to the analysis.
The code spent only about 2.3 seconds to save this info to the capture text file. My original software would then proceed with analysis.
But this delay to get on with testing the analysis results (> 12 sec = 10 + 2.3) quickly became tedious.
I found the analysis effort otherwise challenging, and recognized I might save time by capturing intermediate data, and thus avoiding most (but not all) of the data measurement and capture effort. So the debug capture to intermediate file became a convenient split to the overall effort.
Part 2 of the split app reads the <30 million byte intermediate file in somewhat less 0.5 seconds, very much reducing the analysis development cycle (edit-compile-link-run-evaluate), which was was (usually) no longer burdened with the 12+ second measure and data gen.
While 28 M Bytes is not BIG data, I valued the time savings for my analysis code development effort.
FYI - My intermediate file contained a single letter for each 'thread entry into the critical section event'. With 10 threads, the letters were 'A', 'B', ... 'J'. (reminds me of dna encoding)
For each thread, my analysis supported splitting counts per thread. Where vxWorks would 'balance' the threads blocked at a semaphore, Linux does NOT ... which was new to me.
Each thread ran a different number of times through the single critical section, but each thread got about 10% of the opportunities.
Technique: simple encoded text file with captured information ready to be analyzed.
Note: I was expecting to test piping the output of app part 1 into app part 2. Still could, I guess. WIP.

Is it okay to set reduce_limit = false config in couchdb configuration?

I am working on a map/reduce review and I always have reduce_overflow_error each time I run the view, if I set reduce_limit = false in couchdb configuration, it is working, I want to know if there is negative effect if I change this config setting? thank you
The setting reduce_limit=true enforces CouchDB to control the size of reduced output on each step of reduction. If stringified JSON output of a reduction step has more than 200 chars and it‘s twice or more longer than input, CouchDB‘s query server throws an error. Both numbers, 2x and 200 chars, are hard-coded.
Since a reduce function runs inside SpiderMonkey instance(s) with only 64Mb RAM available, the limitation set by default looks somehow reasonable. Theoretically, reduce must fold, not blow up the data given.
However, in real life it‘s quite hard to fly under the limitation in all cases. You can not control number of chunks for a (re)reduction step. It means you can run into situation, when your output for a particular chunk is more than twice longer in chars, although other chunks reduced are much shorter. In this case even one uncomfortable chunk breaks entire reduction if reduce_limit is set.
So unsetting reduce_limit might be helpful, if your reducer can sometimes output more data, than it received.
Common case – unrolling arrays into objects. Imagine you receive list of arrays like [[1,2,3...70], [5,6,7...], ...] as input rows. You want to aggregate your list in a manner {key0:(sum of 0th elts), key1:(sum of 1st elts)...}.
If CouchDB decides to send you a chunk with 1 or 2 rows, you have an error. Reason is simple – object keys are also accounted calculating result length.
Possible (but very hard to achieve) negative effect is SpiderMonkey instance constantly restarting/falling on RAM overquota, when trying to process a reduction step or entire reduction. Restarting SM is CPU and RAM intensive and costs hundreds milliseconds in general.

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?
TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.
Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

Time required to open very small to very large leveldb databases

I have to give some background first. I want to implement an optimized storage engine for OSM planet data (50GB+). The purpose of this engine is to enable map area extractions as fast as possible - while also remaining the ability for minutely updates. The design I've chosen for several reasons (not mentioning all of them here) is to use one data cell per grid. E.g. think of a all cells on a map being distinct files or databases: http://3.bp.blogspot.com/_CntRFtGsdQo/TTU5UMlLkTI/AAAAAAAAARk/_hW8n33t4Ok/s1600/utmworld.gif
(Jut to get the idea though, this is not the actual cell grid I'll be using)
I have never used leveldb before, but settled on it for it's bulk insert and update performance. However, I'd like to know about the "performance characteristics" when opening many very small and very large leveldb databases. very small meaning just a few kB, very large meaning a few hundred MB
I expect that I have to open / close somewhere between 10-100 dbs per minute. I'd rule out leveldb if it needs significant initialization time.
An answer to this question could be either concrete figures, or insight to what leveldb does during initialization and how it relates to data / index size.
PS. I'll also do my own measurements of course. But as with all tests, I may draw wrong conclusions from my sample data.

How to force page file expansion when using boost::file_mapping

in my current genetical algorithm I'm iterating over a couple of rather large files. Right now I'm using boost::file_mapping to access this data.
I have 3 different testcases I can launch the program on: (my computer has 8GB RAM, Windows 8.1, my different attempts at page file limits, read below)
1000 files, about 4MB in size, so 4 GB total.
This case is, when executed first a bit sluggish, but from the second iteration onwards, the memoryaccess isn't bottlenecking it anymore, and the speed is entirely limited by my CPU.
1000 files, about 6MB in size, so 6 GB total.
This is an entirely different scenario... The first iteration is proportionally slow, but even following iterations do not speed up. I have actually considered trying to load 4 GB to my memory and keep 2 GB mapped... Not sure this would actually work, but it may be worth a test... But even if this would work, this would not help with case c)...
1000 files, about 13 MB in size, so 13 GB total.
This is entirely hopeless. The first iteration is incredibly slow (which is understandable considering the amount of data), but even further iterations show no sign of speed improvement. And even a partial load to memory won't help much here.
Now I tried various settings for the page file limits:
managed by Win - the size of the pagefil stops at around 5-5.2 GB... never gets bigger. This obviously does not help with cases b) and c) and actually causes the files to cycle through... (it would actually be helpful, if at least the first 4 GB would stay, as it is right now, basically nothing is reused from the pagefile)
manual: min 1 GB, max 32 GB: the page file does not grow above 4.5GB
manual: min 16GB, max 32 GB: In case you haven't tried this yourself... don't do it. It makes booting almost impossible, and nothing runs smooth anymore... Yeah, I didn't test my program with this, as this was unacceptable.
So, what I'm looking for is some way to tell my Windows, that, when using page file settings 1) or 2), that i really really want to use a very large page file in this case for my program. But I don't want my Computer to entirely be run on the page file (as it basically happens with 3)) Is there any way I could force this?
Or is there any other way how to properly load the data in a way, so that at least from the second iteration onwards the access is done quick? The data consist only out of huge numbers of 64bit integers that are bitchecked against by my algorithm(there are a bunch of formating symbols in between every 200-300 ints), so I only need read access.
In case the info is needed, I'm using VS Pro 2013. Portability of the code isn't an issue, it only has to run on my notebook. And of course it is a 64bit application, and my processor supports that ;)