Efficient and safe storage of string pairs C++ - c++

I am making a volume (voxel) based terrain generation engine using polyvox, and will be need to store lots of volume information. The Polyvox library makes it easy to pull out the values. However the large number of chunks, makes it infeasible to store each chunk as a separate file.
It would likely be easiest to pull out the volume information with a hexadecimal number for the chunk id, and a string for the volume information, but how do I store this information efficiently?
I have considered databases (really wanted to use Tokyo Cabinet!), but I have not found any libraries for c++, that are compatible with windows, that fit my needs. Also databases can be susceptible to corruption, and I would like to protect the user's world data as much as possible.
Does anyone have a thought on how to organize and save this information effectively? I have been pulling my hair out all day on this one. Does anyone know any good libraries that could help?
Thank you!

kyoto cabinet.
http://fallabs.com/kyotocabinet/
Alternativly, boost interprocess seems to have map container which may solve your problem.

Related

Creating maps of maps from a dictionary file in c++

I have a text file containing list of words (about 35 MB of data). I wrote an application that works pretty much like Scrabble helper or so. I find it insufficient to load the whole file into a set since it takes like 10 minutes to do it. I am not so experienced in C++ and thus I want to ask you what's a better way to achieve it? In my first version of application I just binary searched through it. So I managed to solve this problem by doing a binary search on a file (without loading it, just moving file pointer using seekg). But this solution isn't as fast as using maps of maps. When searching for word I look up it's first letter in a map. Then I retrieve a map of possible second letters and I do another search (for the second letter) and so on. In that way I am able to tell if the word is in dictionary much faster. How can I acheive it without loading whole file into a program to make these maps? Can I save them in a database and read them? Would that be faster?
35MB of data is tiny. There's no problem with loading it all into memory, and no reason for it to take 10 minutes to load. If it takes so long, I suspect your loading scheme recopies maps.
However, instead of fixing this, or coming up with your own scheme, perhaps you should try something ready.
Your description sounds like you could use a database of nested structures. MongoDB, which has a C++ interface, is one possible solution.
For improved efficiency, you could go a bit fancy with the scheme. Say up to 5 letter words, you could use a multikey index. Beyond that, you could go with completely nested structure.
Just don't do it yourself. Concentrate on your program logic.
First, I agree with Ami that 35 MB shouldn't in principle take that long to load and store in memory. Could there be a problem with your loading code (for example accidentally copying maps, causing lots of allocation/deallocation) ?
If I understand well your intention, you build a kind of trie structure (trie and not tree) using maps of maps as you described it. This can be very nice if in memory, but if you want to load only part of the maps in memory, it'll become very difficult (not to do it technically, but to determine which maps to load, and which not to load). You'd then risk to read much more data from disk than actually needed, although there are some implementations of persistend tries around.
If your intend is to have the indexing scheme on disk, I'd rather advise you to use a traditional B-tree data structure, which is designed to optimize loading of partial indexes. You can write your own, but there are already a couple of implementations acround (see this SO question).
Now you could also go to use something like sqlite which is a lightweitght DMS that you can easily embed in your applciation.

Best approach to storing scientific data sets on disk C++

I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.

Offline embedded realtime routing

I am currently working on a senior design project for school and have come across a design issue that i do not know how to solve. I need to have realtime, offline routing for an embedded walking application.
I have not been able to find any libraries that suit my need. I understand i might either have to make my own vectorized map of my local town or routing algorithm. I will not go into much detail what my project entails but it does not require a large map. Maybe a 5x5 mile grid. The maps can be loaded by SD if need to be changed.
I see there are GpsMid, YOURs, and others all using OpenStreetMap data.
We will have a TI micro-controller for processing and GPS card for real time lat/lon I just do not know how to take the real time info and route using a static map.
Thanks,
Matt
I'm not well versed in what is typically used for real-time routing with GPS and vectorized maps, but I can recommend some general algorithms that can be used as tools to help you get your project done.
A* search is a pretty typical path finding algorithm. http://en.wikipedia.org/wiki/A_star
Depending on how you organize your data, you may also find to Dijkstra's algorithm to be helpful. http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
These algorithms are popular enough that you should be able to find example code in whatever language you want, although I'd be very skeptical of the quality. I'd recommend writing your own, since you are in school, as it'd be beneficial for you to have written and debugged them on your own at least once in your career. When you are done, you'll have a tried and true implementation to call your own.
Seems to me there are two parts to this:
1 - Identifying map data that tells you what's a road/path (potential route), I would expect this is already in the data in some way. It could be as simple as which colour any given line is.
2 - Calculating a route over those paths. This is well documented/discussed and there are plenty of algorithms etc. out there on the problem. These days it's hardly worth trying very hard for elegance/efficiency, you can just throw CPU cycles at it until an answer pops out.
Also, should this be tagged [homework] ?

c++ pivot table implementation

Similar to this question Pivot Table in c#, I'm looking to find an implementation of a pivot table in c++. Due to the project requirements speed is fairly critical and the rest of the performance critical part project is written in c++ so an implementation in c++ or callable from c++ would be highly desirable. Does anyone know of implementations of a pivot table similar to the one found in Excel or open office?
I'd rather not have to code such a thing from scratch, but if I was to do this how should I go about it? What algorithms and data structures would be good to be aware of? Any links to an algorithm would be greatly appreciated.
I am sure you are not asking full feature of pivot table in Excel. I think you want simple statistics table based on discrete explanatory variables and given statistics. If you do, I think this is the case that writing from scratch might be faster than looking at other implementations.
Just update std::map (or similar data structure) of key representing combination of explanatory variables and value of given statistics when program reading each data point.
After done with the reading, it's just matter of organizing output table with the map which might be trivial depending on your goal.
I believe most of C# examples in that question you linked do this approach anyway.
I'm not aware of an existing implementation that would suit your needs, so, assuming you were to write one...
I'd suggest using SQLite to store your data and use SQL to compute aggregates (Note: SQL won't do median, I suggest an abstraction at some stage to allow such behavior), The benefit of using SQLite is that it's pretty flexible and extremely robust, plus it lets you take advantage of their hard work in terms of storing and manipulating data. Wrapping the interface you expect from your pivot table around this concept seems like a good way to start, and save you quite a lot of time.
You could then combine this with a model-view-controller architecture for UI components, I anticipate that would work like a charm. I'm a very satisfied user of Qt, so in this regard I'd suggest using Qt's QTableView in combination with QStandardItemModel (if I can get away with it) or QAbstractItemModel (if I have to). Not sure if you wanted this suggestion, but it's there if you want it :).
Hope that gives you a starting point, any questions or additions, don't hesitate to ask.
I think the reason your question didn't get much attention is that it's not clear what your input data is, nor what options for pivot table you want to support.
A pivot table is in it's basic form, running through the data, aggregating operations into buckets. For example, you want to see how many items you shipped each week from each warehouse for the last few weeks:
You would create a multi-dimensional array of buckets (rows are weeks, columns are warehouses), and run through the data, deciding which bucket that data belongs to, adding the amount in the record you're looking at, and moving to the next record.

Streaming Real and Debug Data To Disk in C++

What is a flexible way to stream data to disk in a c++ program in Windows?
I am looking to create a flexible stream of data that may contain arbitrary data (say time, average, a flag if reset, etc) to disk for later analysis. Data may come in at non-uniform, irregular intervals. Ideally this stream would have minimal overhead and be easily readable in something like MATLAB so I could easily analyze events and data.
I'm thinking of a binary file with a header file describing types of packets followed by a wild dump of data tagged with . I'm considering a lean, custom format but would also be interested in something like HDF5.
It is probably better to use an existing file format rather than a custom one. First you don't reinvent the wheel, second you will benefit from a well tested and optimized library.
HFD5 seems like a good bet. It is fast and reliable, and easy to read from Matlab. It has some overhead but it is to allow great flexibility and compatibility.
This requirement sounds suspiciously like a "database"