I have a textfile that contains authors and the books written by authors. I am assigned to write a program in that the user will provide the name of a author. And the program must print the name of any books written by that author.
I understand that i am supposed to use an ifstream to read this information. But how can i make it so my program doesn't read the entire file into memory (array, vector, etc.) to perform any of the search queries?
What would be the best way to approach this? my program should use classes as well.
I don't know the whole answer, or even the syntax, but a good way to get started is what do you know about the format of the input text file? Is it simply a two-column file like: [Author Book] separated by a common delimiter? In that case, you could construct a loop that goes through the whole file and only store the entries into a vector that match the search string.
A lot here depends on how often you're going to look for books from the file. If yo're only going to look for one or two, then the most sensible method is probably to just scan through the file reading pairs of lines to find the ones you want.
Most other methods assume that you're going to look for data in the file often enough to justify spending some extra time up front to optimize the queries later. Assuming that's correct, one possibility would be to create an index by reading through the file, hashing each author's name, and the position in the file of the "record" for that author/book pair.
Then you'll store those pairs of hash/file offset into a separate file. When you want to do a query, you'll read the hashes/files offsets into memory. Hash the name of the author you're searching for (using the same algorithm) and see which (if any) file offsets have the same hash value. Seek to those spots in the file, and read in the record for the book. At that point, re-compare the author name in the file to the author name that was entered, in case of a hash collision. Show the records where you get a match.
Related
So conceptually I'm reading in a file with ~2 million lines of data. I'm looking to sort, store and apply other functions to the data later.
I've been told this is referred to as "buckets" but I'm unclear whether this is something pre-defined or a user-defined data type. So I'm curious whether a linked list or array or another combination would be advisable?
Do I need to worry about the size of the file? Will most compilers be able to handle that all going on concurrently or will I need to partition the data first (i.e. divide into each bucket, store in its own file, then use another code, etc)?
If #2 is required, does C++ have the functionality to save multiple files per execution? Meaning a) create bucket1 file.txt; b) populate bucket1 file; close bucket1 file; d) create bucket2 file; ...
OK, so I gather from your post that you are writing this in C++. However, the details are a bit sparse apart from the sorting requirement. But what are you sorting on? Are all fields interpreted as text? Are some numbers? Are there multiple keys?
If you don't absolutely need to write this in C++, and you are on Linux, just invoke /bin/sort to do the sorting. This may seem like a cop-out, but commercial software like Talend even resorts to that.
But if you must write new code in C++, these are my recommendations:
1) Is the CSV file escaped? In other words, do embedded quotes and delimiters need special treatment? You have to figure this out first.
2) Check this out: http://mybyteofcode.blogspot.com/2010/02/parse-csv-file-with-boost-tokenizer-in.html
3) A simple representation of the scanned input is vector<vector<string> >. But it is unwieldy. Instead, wrap a class around vector<string> and make a vector of pointers to those classes, one per line of input, and sort those instead.
4) You should be able to sort ~2M "medium" rows in memory these days. Just use std::sort. But for full generality, you will need to consider, what if it doesn't fit into memory? The most common answer to this is to sort chunks at a time, write the results to disk, and then merge it all using a priority-queue or similar structure.
I need the most efficient way to determine whether a string is in a wordlist (this is a textfile of all words).
Obviously I could create an ofstream object and loop through each line to see whether the string is present.
Is there a quicker way? Perhaps by using a map?
Thanks
To find a particular word in a list of many words, I would use a std::unordered_set, which is a hash-table by another name.
Basically:
Read words from file into set.
Pick a "random combination of letters".
use find() to see if it's in the set.
Go to 2 as required.
Obviously, if you only want to search for a single word, it's no point in loading into a set. Just read the file and check as you go (because on average you'll only need to read half the file, which is obviously about 50% faster than reading the whole file)
I have over 100 text (.txt) files. The data in these files is comma-separated, for example:
good,bad,fine
The aforementioned items are just an example, the data can contain anything from individual words and email-IDs to phone numbers. I have to copy all of these values into a single column of one or more Excel (.xlsx) spreadsheet files.
Now the problem is that I don't want my data to be redundant and the data items should be over 10 million. I'm planning to implement this design in C++. I prefer an algorithm which is efficient so that I can complete my work fast.
Separate your task into 2 steps:
a. getting the list of items in the memory - removing duplicates.
b. putting them into excel.
a: Use a sorted linked tree to gather all items, that's how you will find the duplicate fast.
b: Once done with the list - I would write everything to a simple file and import it to excel, rather than try to do it with c++ against excel API.
After your comment- if a memory problem rise, you might want to create a tree per first letter and - use files to store "each list", so you will not get memory overflow...
it is less efficient, but with today's computing power you won't feel it.
Main idea here is to find fast if you have this word on not and if not - add it to the "list". Searching a sorted tree should do the trick. if you want to avoid worst case scenario there is AVL tree if I recall correctly, this tree will remain even no matter the order of the inserts, yet it is harder to code.
The quickest way would be to simply load all the values (as std::strings) into an std::set. The interface is dead simple. It won't be the fastest way, but you never know.
In particular unordered_set might be faster if you have it available, but will be harder to use because the large number of strings will cause rehashing if you don't prepare properly.
As you read a string, insert it into the set: if(my_set.insert("the string").second) .... That condition will be true for new values only.
10 million is a lot but if each string is around 20 bytes or so, you may get away with a simplistic approach like this.
I have a school task to load a list of names from one text file to another while ordering them, yet I am not allowed to keep them all in the memory (array for example) at the same time. What would be the best way to do this. I have to do a binary search on them afterwards.
My first thought was to generate a hash key for each of them, and then write them in a location that is relative to their key, but the fact that I have to do a binary search afterwards makes me think that this is redundant.
The problem is not knowing them all beforehand (that means I have to somehow push some names in the middle).
This is probably the easiest way
1) read the file line by line and find the first name in your sorting method
e.g.
-read name_1.
-read next name_2.
If name_1 < name_2 then name_2 = name_1 and repeat.
2) read the file line by line again and find the second name.
i.e. The lowest name that is still higher than the first name.
3) write the first name into a file.
4) now read line by line for the third name
5) add the second name into a file
etc...
This will not be fast, but it will have virtual no memory overhead. You will never have more than 3 names stored in memory.
Some ways:
1) You could split the data into multiple temporary files; sort each file separately; merge the files.
2) Call the operating system to sort the file, something like
system ("sort input>output")
Ok, I don't know if I used the term 'lexical tree' right in my comment, but I would make a tree, like a binary, but not with only two possible nodes, but whole alphabet possible. I believe this is called a 'Trie'.
In the nodes you keep a counter how many entries ended on that particular node. You create the nodes dynamically as you need them, so the space consumption is kept low.
Then you can traverse the whole tree and retrieve all elements in an order. That would be non-trivial sort, that would work very well for entries with common prefixes. It would be fast, as all inserts are linear, travesal is also linear. So it would take O(2*N), where N is number of characters in whole set to sort. And memory consumption would be good if the data set would have common prefixes.
So, I have a text file (data.txt), It's a story, so just sentence after sentence, and fairly long. What I'm trying to do is to take every individual word from the file and store it in a data structure of some type. As user input I'm going to get a word as input, and then I need to find the 10 closest words(in data.txt) to that input word, using a function that finds the Levenshtein distance between 2 strings(I figured that function out though). So I figured I'd use getline() using " " as the delimiter to store the words individually. But i don't know what I should store these words into so that I can access them easily. And there's also the fact that I don't know how many words are in the data.txt file.
I may have explained this badly sorry, I'll answer any questions you have though, to clarify.
In C++ you can store the words in a vector of strings:
#include <vector>
#include <string>
//....
std::vector<std::string> wordsArray;
// read word
wordsArray.push_back(oneWord);
You need a data structure capable to "contain" the strings you read.
The standard library offer a number of "container" classes like:
vector
deque
list
set
map
Give a check to http://en.cppreference.com/w/cpp to the containers library and find the one that better fit your needs.
The proper answer changes depending not only on the fact you have to "store them" but also on what you have to do with them afterwards.