C++ Request a specific row from a file

C++ Request a specific row from a file - c++

Is there a way, where I can open a file that contains a large amount of data, and retrieve only one specific row or index, without getting the rest of the content as well?
Update:
Based on what others have mentioned here in the comments, I have some follow-up questions.
Can anyone give me an example of how to put a fixed width on the rows/linebreaks(whatever you want to call it), or show me a good source where I can read more about it?
So if I set this up correctly, I will be able to get a specific line from the file superfast, even if it contains several million rows?

If you want to access a file by records or rows, and the rows are not fixed length, you'll have to create a structure that you can associate (or map) file positions to row indices.
I recommend using std::vector<std::streampos>.
Read through the file.
When the file is at the beginning of a row, read the file position and append to the vector.
If you need to access a row in the file:
1) Use the vector to get the file position of the row.
2) Seek to the row using the file position.
This technique will work with fixed length and variable length rows.

Related

Fixed Length Flat file Parsing

I have a flat file tables say, student.tbl and employee.tbl. Both files are fixed length files. I have a supporting files for both files with the information field code, field description, field Offset and field size.
for example,
ename string 0 10
eage number 10 2
ecity string 12 10
I wrote code to fetch data from the flat files using STL in c++. I am using vector to load those data.
My simple algorithm to load data from Fixed Length file.
1) Read Supporting file.
2) Load supporting file data into a 2D vector string say,
column_records;
3) Read Table file.
4) Get First Line from the Table File, say Data Line.
5) Get First Column Information from the supporting Table Which is
First Row of column_records.
6) Chop Data Line based on the column_record
7) Push the chopped data into a One Dimensional Vector say,
record_vector.
8) Do Step 5, Until the Last Column Information has processed.
9) Push record_vector into 2D vector say,Table_Vector.
10) Do Step 4, Until the last line of the Fixed File has reached.
Well. I did it well. It works fine. But my problem is, in Step 5.
For every fixed length data, I feel there was some repeat Iterations.
I know for a fact, First Fixed Length data itself can have retain the column descriptions for other fixed length data. But I repeatedly doing the Iteration N*M. I wish to my iteration should be 1*M.
I know that I can store my column description in a Structure array. But I have many type of tables. say students.tbl and employee.tbl. Both have different set of columns. So I think it will be bad Idea to have 'N'-struct declaration to load 'N'-supporting Tables.
I wish to use same routine to fetch data from the both tables or 'N' tables. My supporting table format will not be changed. It is in tab delimited format. This is my case.
How do I fetch data from table with 1*M iteration?
I hope I can use enumeration to do this. But I don't know how to do that? will using enumeration or macro solve this issue?
I hope my working source code will not be needed for this Question. If you think source code are needed to answer this question, definitely I will update this question with that source code. I have medium level of English Knowledge. So Sorry for Inconvenience.
Thank You.

How to parse, read, and store only one column of .CSV file into an array in C++

I have a .CSV file that's storing data from a laser. It records the height of the laser beam every second.
The .CSV file ends up having rows for each measurement that are all in this format:
DR,04,#
where the # is the height reading.
For example, if the beam is at a height of 10, the reading would say:
DR,04,10.
I want my program in C++ to read only the height (third column of the .CSV) from each row and put it into an array. I do not want the first two columns at all. That way I end up with an array with just a bunch of height values from each measurement.
How do I do that?

You can use strtok() to separate out the three columns. And then just get the last value.
You could also just take the string and scan for the first comma, and then scan from there for the second comma. What follows is the value you are after.
You could also use sscanf() to parse out the individual values.
This really isn't a difficult problem, and there are many ways to approach it. That is why people are complaining that you probably should've tried something and then ask a question here when you get stuck on a specific question.

A way to retrieve data by address (c++)

Using c++, is it possible to store data to a file, and retrieve that data by address for quicker access? I want to get around having to parse or iterate large files of data, with the ability to gain direct access to a subset of that data. In your answers, it does not matter how the data is stored; whatever works best with the answer you have.

Yes. Assuming you're using iostreams, you can use tellg and tellp to retrieve the current get and put (i.e., read and write) locations respectively. You can later feed the same value back to seekg or seekp to get back to the same location (again, for reading or writing respectively).
You can use these to (for one example) create an index into a file. Before writing each record to your primary data file, you'd use tellp to retrieve the current location. Then you'd store the data to the data file, and save the value tellp returned into the index file. Depending on what sort of index you want, that might just contain a series of locations, so you can seek directly to record #N in the data file (even if the records are of different sizes).
Alternatively, you might store the data for some key field in the index file. For example, you might have a main data file with a set of records about people. Then you might build a number of indices into that, one with last names and a location for each, another with birthdays and a location for each, and so on, so you can search by name or birthday (or do an intersection between them to support things like people older than 18 with a last name starting with "M", "N" or "O").

Splitting File Input Into Numerous Vectors

I have a CSV file that has about 10 different columns. Im trying to figure out whats the best method to go about here.
Data looks like this:
"20070906 1 0 0 NO"
Theres about 40,000 records like this to be analyzed. Im not sure whats best here, split each column into its own vector, or put each whole row into a vector.
Thanks!

I think this is kind of subjective question but imho I think that having a single vector that contains the split up rows will likely be easier to manage than separate vectors for each column. You could even create a row object that the vector stores to make accessing and processing the data in the rows/columns more friendly.
Although if you are only doing processing on a column level and not on a row or entry level having individual column vectors would be easier.

Since the data set is fairly small (assuming you are using a PC and not some other device, like a smartphone), you can read the file line by line into a vector of strings and then parse the elements one by one and populate a vector of some structures holding the records data.

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.

The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.

Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js