Selecting Appropriate Data Structure - c++

We are reading a book and we have to store every character of this book with it's count.
like : "This is to test" out should be: T4 H1 I2 S3 O1 E1
What will be the most appropriate data structure here and why? and what would be the logic here.

An array of ints would do fine here. Create an array, each index is a letter of the alphabet (you'll probably want to store upper and lower case separately for perf reasons when scanning the book). As you scan, increment the int at the array location for that letter. When you're done, print them all out.

Based on your description, you only need a simple hash of characters to their count. This is because you only have a limited set of characters that can occur in a book (even counting punctuation, accents and special characters). So a hash with a few hundred entries will suffice.

The appropriate data structure would be a std::vector (or, if you want something built-in, an array). If you're using C++0x, std::array would work nicely as well.
The logic is pretty simple -- read a character, (apparently convert to upper case), and increment the count for that item in the array/vector.

The choice of selecting a data structure not only depends on what kind of data you want to store inside the data structure but more importantly on what kind of operations you want to perform on the data.
Have a look at this excellent chart which helps to decide when to use which STL container.
Ofcourse, In your case an std::array(C+0x) or std::vector, seems to be a good choice.

Related

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9
You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.
Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.
Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2
A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

Handling large amounts of data in C++, need approach

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database
column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120
Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.
Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.
Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.
Thanks.
In order of desirability:
don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
if you must parse the rows into structures:
don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected
If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.
If you can skip the SQLite3 step, here is how I would solve the problem:
Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.
I hope this helps you.
There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.
Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.
If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.
Create a structure for your records.
The record should have "ordering" functions for the fields you need to sort by.
Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.
For each field you want to sort by:
Create an index table, std::map, using the field value as the key and the record's index as the value.
To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.
If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.
Thanks for your answers, but I figured out a very fast and simple approach.
I let SQLite3 do the job for me by giving it this query:
SELECT * FROM my_table ORDER BY key_column ASC
For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.
Thank you for your time.

How to find largest values in C++ Map

My teacher in my Data Structures class gave us an assignment to read in a book and count how many words there are. Thats not all; we need to display the 100 most common words. My gut says to sort the map, but I only need 100 words from the map. After googling around, is there a "Textbook Answer" to sorting maps by the value and not the key?
I doubt there's a "Textbook Answer", and the answer is no: you can't sort maps by value.
You could always create another map using the values. However, this is not the most efficient solution. What I think would be better is for you to chuck the values into a priority_queue, and then pop the first 100 off.
Note that you don't need to store the words in the second data structure. You can store pointers or references to the word, or even a map::iterator.
Now, there's another approach you could consider. That is to maintain a running order of the top 100 candidates as you build your first map. That way there would be no need to do the second pass and build an extra structure which, as you pointed out, is wasteful.
To do this efficiently you would probably use a heap-like approach and do a bubble-up whenever you update a value. Since the word counts only ever increase, this would suit the heap very nicely. However, you would have a maintenance issue on your hands. That is: how you reference the position of a value in the heap, and keeping track of values that fall off the bottom.

Find substring in many objects containing multiple strings

I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.
I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.
If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.
Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!
a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)
Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.
partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects
Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.
You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.
For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.
(substitute your favourite smart pointers for object*s if useful)
Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.

Efficient data structure for searching numbers and strings

I have a scenario where in strings and numbers are combined into a single entity. I need to search based on the string or the number. How do I go about with the data structure for this?
I thought of coming up with a hashing for strings and search tree approach for numbers. Can you please comment on my choice and also suggest better structures if any?
Thanks!
Use two std::maps, one from std::string to a pointer and the other from number to a pointer. The pointers go to your "single entity". See how far you can scale this (millions of entries...) before trying to optimize further.