What are the advantages and disadvantages of using bitvectors/bit manipulation? - bit-manipulation

This question came up while I was working with bitvectors in C. From what I understand, they're good to use since they allow you to store the same data but with less memory, although I could be wrong. However, I find that the outcome when printing that data can be confusing, I don't know what the numbers being printed in the terminal represent since they've been manipulated. So why would you or would you not use bitvectors?

Related

How to access multiple elements of an array efficienty in C++?

This if my first post, I hope I'll meet the standards...
I'm translating into c++ (at which I'm quite new) a program originally written in MATLAB for reasons of efficiency. The piece of code I am actually working on resumes accesses to various indexes of a vector (matrix) in one step. For example, if M1 is a matrix of size, let's say, 10x15, the program would define a new one as follows:
idxs1 = [1 2 3];
idxs2 = [1 2 3 4 5];
M2 = M1 (idxs1 , idxs2);
resulting M2 as a matrix of size 3x5. Now, I guess what MATLAB actually does is access one by one the various places of M1 given by the indexes and then construct M2 by rearranging the many contents acquired, all very efficiently.
My question is, how can I reproduce such mechanism in c++? As far as I know there is no direct way to access in a row various indexes of an array, and the for loop I'm using seems rather cumbersome. Maybe there's some intelligent way to do it without demanding 'too much' processor time? Also, for the sake of educational purposes, I would be grateful if someone could explain what MATLAB actually does when such operation is performed.
Thanks in advance and sorry for the eventual inconveniences!
P.S: Just in case it adds anything to the question, I'm working with MEX files to link both languages.
P.S2: By the way, I found some related questions but regarding other languages:
python: Access multiple elements of an array
perl: How can I use an array slice to access several elements of an array simultaneously?
c: How to access multiple elements of array in one go?
"Armadillo is a high quality C++ linear algebra library, aiming towards a good balance between speed and ease of use
Useful for algorithm development directly in C++, or quick conversion of research code into production environments; the syntax (API) is deliberately similar to Matlab"
Link: http://arma.sourceforge.net/
Math program data structures can be some of the most elaborate out there. I can't even figure out what the 3rd line of your example actually does, so I can't even guess how MATLAB implements anything.
What I can tell you is, that one line of MATLAB is almost certainly hiding a ton of operations. If you want to recreate it you just need to make a utility function with a couple of for-loops that copy over all the correct indices one-by-one. Ultimately, this can't be too much different than one MATLAB does.
If you have a ton of matrix operations you need to support and you're on a large project, you might want to look into finding a C++ matrixes library. I don't have one to recommend, but Boost is a popular C++ library for many purposes including matrixes. (You can also make your own data structure, but not recommended for a novice.)
What MATLAB exactly does is left unspecified, and might very well differ from case to case depending on the indices, and even for a given set of indices it could differ from machine to machine. So let's not speculate.
In particular, it's left unspecified whether MATLAB physically copies M1. Such a copy can be faked, which saves time. The technique is known as "copy on write".
In C++, this would be possible as well, but harder. Furthermore, none of the existing container classes supports it.
If you're going to copy the elements, the CPU won't be the bottleneck. Instead, the memory bus will limit you. This is particularly the case when the indices aren't contiguous. For a 3x5 matrix, the time will be dominated by overhead - contiguity doesn't matter yet.

Best approach to storing scientific data sets on disk C++

I'm currently working on a project that requires working with gigabytes of scientific data sets. The data sets are in the form of very large arrays (30,000 elements) of integers and floating point numbers. The problem here is that they are too large too fit into memory, so I need an on disk solution for storing and working with them. To make this problem even more fun, I am restricted to using a 32-bit architecture (as this is for work) and I need to try to maximize performance for this solution.
So far, I've worked with HDF5, which worked okay, but I found it a little too complicated to work with. So, I thought the next best thing would be to try a NoSQL database, but I couldn't find a good way to store the arrays in the database short of casting them to character arrays and storing them like that, which caused a lot of bad pointer headaches.
So, I'd like to know what you guys recommend. Maybe you have a less painful way of working with HDF5 while at the same time maximizing performance. Or maybe you know of a NoSQL database that works well for storing this type of data. Or maybe I'm going in the totally wrong direction with this and you'd like to smack some sense into me.
Anyway, I'd appreciate any words of wisdom you guys can offer me :)
Smack some sense into yourself and use a production-grade library such as HDF5. So you found it too complicated, but did you find its high-level APIs ?
If you don't like that answer, try one of the emerging array databases such as SciDB, rasdaman or MonetDB. I suspect though, that if you have baulked at HDF5 you'll baulk at any of these.
In my view, and experience, it is worth the effort to learn how to properly use a tool such as HDF5 if you are going to be working with large scientific data sets for any length of time. If you pick up a tool such as a NoSQL database, which was not designed for the task at hand, then, while it may initially be easier to use, eventually (before very long would be my guess) it will lack features you need or want and you will find yourself having to program around its deficiencies.
Pick one of the right tools for the job and learn how to use it properly.
Assuming your data sets really are large enough to merit (e.g., instead of 30,000 elements, a 30,000x30,000 array of doubles), you might want to consider STXXL. It provides interfaces that are intended to (and largely succeed at) imitate those of the collections in the C++ standard library, but are intended to work with data too large to fit in memory.
I have been working on scientific computing for years, and I think HDF5 or NetCDF is a good data format for you to work with. It can provide efficient parallel read/wirte, which is important for dealing with big data.
An alternate solution is to use array database, like SciDB, MonetDB, or RasDaMan. However, it will be kinda painful if you try to load HDF5 data into an array database. I once tried to load HDF5 data into SciDB, but it requires a series of data transformations. You need to know if you will query the data often or not. If not often, then the time-consuming loading may be unworthy.
You may be interested in this paper.
It can allow you to query the HDF5 data directly by using SQL.

Manipulating blobs in C++

I will be reading and writing large chucks of a large binary file.
Is there a class in standard C++ or upcoming standard C++ or upcoming standard C++ + boost, that will make my task easier?
If not would it be possible to use the string class for this? What would be the dangers of doing so?
PS: A few observations that will clarify things. I expect that the blobs will be passed around a lot, so a container that is reference counted and CoW would probably be preferable.
Also my resistance to using a string class is twofold: these are blobs, not strings, "unprintable characters" and in particular nulls may cause difficulties when they appear.
If you have a blob of binary data you can store this easily and efficiently in a std::vector<unsigned char>.
You can increase performance if you know (or can guess) the size of the blobs by calling reserve.
And finally, if you use streams you can easily read into a vector using std::back_inserter.
Depending on what exactly you want to do, a memory mapped file, such as the one from boost, is probably a good starting point. For in-memory modification, use an std::vector, as others have suggested.
Don't bother with CoW - it's mostly frowned upon in the C++ world, with the possible exception of everything in Qt.

Is there a good general purpose options class for c++?

Each time I write another one of my small c++ toy programs, I come across the need for a small, easy-to-use options/parameter class. Here is what it should be able to do:
accept at least ints, doubles, string parameters
easy way to add new options
portable, small and fast
free
read options from a file and/or command line
upper and lower bounds for parameters
and all the other neat useful things I am not thinking of right now
What I want to do is pass a pointer to this class to the builder and all of my strategy objects, so they can read the parameters of the algorithm I am running (e.g. which algorithm, maximum number of iterations etc.)
Can someone point me to an implementation that achieves at least some of these things?
Boost Program-Options is pretty slick. I think it does all the things on your list, apart from maybe bounds validation. But even then, you can provide custom validators pretty easily.
Update: As #stefan rightly points out in a comment, this also fails on "small"! It adds quite a significant chunk to your binary if you statically link it in.
You might want to consider storing your configuration in JSON format. While reading JSON from the command-line is slightly awkward, it's still perfectly doable and even reasonably legible. Other than that you get a whole lot of flexibility, including nested configuration options, facilities for deserializing complicated data types etc.
There are numerous libraries for de-serializing JSON into C++, see e.g. this discussion and comparison of a few of them. Some are small, many are fast (although you don't actually need them to be fast - configuration data is very small), most are very portable. A long list and some benchmark results (although not a feature comparison) can be found here; some of these libraries might actually be geared towards use for reading configuration options, although that's just a wild guess.

Best way to parse a large floating point file stored in ASCII?

What is the best way to parse a large floating point file stored in ASCII?
What would be the fastest way to do it? I remember someone telling me using ifstream was bad, because it worked on a small number of bytes, and it would be better to just read the file into memory first. Is that true?
Edit: I am running on Windows, and the file format is for a point cloud that is stored in rows like x y z r g b. I am attempting to read them into arrays. Also, the files are around 20 MB each, but I have around 10 GB worth of them.
Second edit: I am going to have to load the files to display every time I want to do a visualization, so it would be nice to have it as fast as possible, but honestly, if ifstream preforms reasonably, I wouldn't mind sticking with readable code. It's running quite slow right now, but that might be more of a hardware I/O limitation than anything I can do in software, I just wanted to confirm.
I think your first concern should be how large the floating point numbers are. Are they float or can there be double data too? The traditional (C) way would be to use fscanf with the format specifier for a float and afaik it is rather fast. The iostreams do add a small overhead in terms of parsing the data, but that is rather negligible. For the sake of brevity I would suggest you use iostreams (not to mention the usual stream features that you'd get with it).
Also, I think it will really help the community if you could add the relevant numbers along with your question, like for e.g., how large a file are you trying to parse ? Is this a small memory footprint environment (like an embedded system).
It's all based on the operating system, and the choice of C and C++ standard libraries.
The days of slow ifstream are pretty much over, however, there is likely some overhead in handling C++ generic interfaces.
atof/strtod might be the fastest way to deal with it if the string is already in the memory.
Finally, any attempt you'd do at getting the file read into memory will likely be futile. Modern operating systems usually get in the way (especially if the file is larger than RAM you will end up swapping code since the system will treat your (already stored on disk) data as swappable).
If you really need to be ridiculously fast (The only places I can think it will be useful are HPC and Map/Reduce based approaches) - try mmap (Linux/Unix) or MapViewOfFile to get the file prefetched into virtual memory in the most sensible approach, and then atof + custom string handling.
If the file is really well organized for this kind of game, you can even be quirky with mmaps and pointers and have the conversion multithreaded. Sounds like a fun excercise if you have over 10GB of floats to convert on a regular basis.
The fastest way is probably to use an ifstream, but you can also use fscanf. If you have a specific platform, you could hand-load the file into memory and parse the float from it manually.