I'm having trouble with storing data in a csv file.
I'm hashing a text file that has ~15.500.000 lines and I have to make an Excel graph of frequency based on the hashes. The problem is that the maximum amount of rows that Excel supports is 1.048.576. So I was thinking that I could write the hashes in 15 columns, each one with 1.000.000 rows but I haven't figured out a way to solve this without comparing a bunch of numbers each time, because that would take up a lot of time since I'm dealing with this huge amount of numbers.
Do you have any ideas how to solve this? Maybe using a vector of vectors could work as well, I might try that right now. Thanks
Related
I need advice before I start writing a program.
I have a huge amount of data (~15m lines, 300 MB txt file. 10 simple numbers per line separated by space, each line is unique sequence of numbers).
This data is fixed and does not change.
I need to filter this data under various conditions. (Example search for all sequences that have three of ten numbers identical or find all sequences with the same sum etc.).
What is the recommended way for this task in QT C++ way?
Where to begin? What to do with data? Keep it in a txt file and load from there or insert them to SQLite. What are the recommended ways to complete this task?
300 megabytes is not a lot of memory to use for an application. TXT file is easiest, but you can use anything. You can simply read the data in C++ and process it.
Filtering based on criteria can be usually optimized. But each requirement has a different optimization. So there isn’t one algorithm that is fast for all criteria.
You can start with the incremental approach, go through all lines and validate the criteria:
for (size_t i = 0; i < 15000000; ++i)
{
read_line();
if (check_criteria())
save_result();
}
And then research optimization solutions that could work for all cases.
I'm trying to design such an application that manipulates a list of thousands of individual words that is stored in a txt file for the following tasks,
1- Randomly picking up some words.
2- Checking whether some entered words by the user are actually in the list.
3- Retrieve the entire list from a txt file and store it temporarily for subsequent manipulations.
I'm not asking for implementation neither for pseudo codes. I'm looking for sufficient approach to deal with a massive list of words. For the time being, I might go with a vector of strings, however, searching thousands of words will take some times. Of course there must be some strategies to cope with this kind of tasks however, since my background is not Computer Science, I don't know in which direction which I go. Any suggestions are welcomed.
A vector of strings is fine for this problem. Just sort them, and then you can use binary search to find a string in the list.
Radix trees are a good solution for searching through word lists for matches. Reduced space for storage, but you'll have to have some custom code for getting and putting words in the list. And the text file won't necessarily be easy to read unless you create the tree anew each time you load from a text file. Here's an implementation I committed to on GitHub (I can't even remember the source material at this point) that might be of assistance to you.
I have a CSV file that has about 10 different columns. Im trying to figure out whats the best method to go about here.
Data looks like this:
"20070906 1 0 0 NO"
Theres about 40,000 records like this to be analyzed. Im not sure whats best here, split each column into its own vector, or put each whole row into a vector.
Thanks!
I think this is kind of subjective question but imho I think that having a single vector that contains the split up rows will likely be easier to manage than separate vectors for each column. You could even create a row object that the vector stores to make accessing and processing the data in the rows/columns more friendly.
Although if you are only doing processing on a column level and not on a row or entry level having individual column vectors would be easier.
Since the data set is fairly small (assuming you are using a PC and not some other device, like a smartphone), you can read the file line by line into a vector of strings and then parse the elements one by one and populate a vector of some structures holding the records data.
I need to export data from 3 maps to preferably a single CSV and would like to be able to do so without simply making a column for every possible key (there may be up to 65024 of them).
The output would be a CSV containing the value at each of the keys at each timestep (may be several hundred thousand).
Anyone got any ideas?
Reduce the granularity by categorizing your keys into groups and store them with one timestep per row. Then you can plot one datapoint per line.
Let me know if you need clarification, i'd need some more info.
I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.