VAX Fortran Keyed indexed file - sequential access

VAX Fortran Keyed indexed file - sequential access - fortran

Okay, I know I'm going back a few years, but maybe I'll run across some graybeards (like mine) :).
I have an indexed data file with a key field. It's opened like so in the application:
OPEN (FILE='DATA.MAS',STATUS='OLD',
1 ORGANIZATION='INDEXED',ACCESS='KEYED',
1 RECL=28,UNIT=LUNTM,SHARED,
1 KEY=(1:49:CHARACTER),
1 IOSTAT=IOS,ERR=9999)
I need to be able to scan the content of this file sequentially. However, every combination of organization and access options in the open, followed by reads always results in an error, either on the open or the read. Is it even possible to get the nth record of a keyed file?

Okay, found the solution after reading the doc for the umpteenth time. I changed the OPEN statement for SEQUENTIAL access and INDEXED organization. What I missed was that when you do this, FORTRAN interprets the file as FORMATTED. Adding FORM='UNFOFRMATTED' and adjusting the record size yields happiness and yuletide greetings

Related

Best way to read this file to manipulate later?

I am given a config file that looks like this for example:
Start Simulator Configuration File
Version/Phase: 2.0
File Path: Test_2e.mdf
CPU Scheduling Code: SJF
Processor cycle time (msec): 10
Monitor display time (msec): 20
Hard drive cycle time (msec): 15
Printer cycle time (msec): 25
Keyboard cycle time (msec): 50
Mouse cycle time (msec): 10
Speaker cycle time (msec): 15
Log: Log to Both
Log File Path: logfile_1.lgf
End Simulator Configuration File
I am supposed to be able to take this file, and output the cycle and cycle times to a log and/or monitor. I am then supposed to pull data from a meta-data file that will tell me how many cycles each of these run (among other things) and then im supposed to calculate and log the total time. for example 5 Hard drive cycles would be 75msec. The config and meta data files can come in any order.
I am thinking I will put each item in an array and then cycle through waiting for true when the strings match(This will also help detect file errors). The config file should always be the same size despite a different order. The metadata file can be any size so I figured i would do a similar thing but in a vector.
Then I will multiply the cycle times from the config file by the number of cycles in the matching metadata file string. I think the best way to read the data from the vector is in a queue.
Does this sound like a good idea?
I understand most of the concepts. But my data structures is shaky in terms of actually coding it. For example when reading from the files, should I read it line by line, or would it be best to separate the int's from the strings to calculate them later? I've never had to do this that from a file that can change before.
If i separate them, would I have to use separate arrays/vectors?
Im using C++ btw

Your logic should be:
Create two std::map variables, one that maps a string to a string, and another that maps a string to a float.
Read each line of the file
If the line contains :, then, split the string into two parts:
3a. Part A is the line starting from zero, and 1-minus the index of the :
3b. Part B is the part of the line starting from 1+ the index of the :
Use these two parts to store in your custom std::map types, based on the value type.
Now you have read the file properly. When you read the meta file, you will simply look up the key in the meta data file, use it to lookup the corresponding key in your configuration file data (to get the value), then do whatever mathematical operation is required.

Open statement in fortran

I have question with open command of fortran.
OPEN (UNIT = , FILE=file-name, ACCESS=access, FORM=form, RECL=recl)`
Access = sequential, direct
FORM=formatted, unformatted
recl is is the record length in bytes for a file
I tried searching a lot, but could not get what is meaning of sequential or direct access, formatted or unformatted file, record length of a file. Can someone explain me what these terms mean?

File access specifies how the file will be written to (or read from) after opening. Opening with one access mode, but reading/writing consistent with another access mode, often results in a runtime error.
Sequential access, naturally enough, implies reading and writing sequentially. Writing sequentially means that output is placed in the output file in the same order that the program produces it so, if X is output before Y, the file will contain X before (closer to the beginning of the file) than Y. Reading sequentially means that reading occurs from start toward end of the file. Append access is a special form of sequential access which starts at the end of the file (so write operations add to the end of the file).
Direct access means that contents of the file can be accessed in any order. This is also called random access. Essentially, when performing input or output, the program must specify the position in the file where the operation is to occur.
The position in the direct access file in Fortran is specified in terms of "records", which all have exactly the same length (specified by the RECL= clause when the file is opened). So, if a file contains 20 records and has record length equal to 30, the total size of data the program can access from the file is 600 bytes, and every read or write operation will access a record containing 30 bytes.
An unformatted file basically means the contents of the file are read and written as a stream. An unformatted sequential access file is the equivalent of a binary file in languages like C that is read from beginning to end. An unformatted direct access file is also binary, but operations can access the file in any order (under control of the program).
A formatted file essentially means that all reading and writing must involve a format specification. There are also some special treatments such as, when writing, a newline marker written to the file at the end of every write statement.
A straight text file is typically opened as a sequential access formatted file. Every Fortran read or write operation acts on a new line (so two write statements will produce two lines in the file, and two corresponding read statements will be need to read them back in).
It is possible to have a formatted direct access file. This basically means the read and write statements must specify formats to read/write the records, but records can be accessed in any order. The ends of records are typically marked with newlines.

It's easy to find on the web (including discussion here):
A "record" is data, usually in characters. Some files have records which are all the same length, some do not. In between, there are files which store the length of each record as part of the record. It is simplest to work with files having records which are all the same length, because (for many storage devices) you can compute the beginning of a particular record by knowing the record number and the length of the records. If the records are different lengths, it is more work to keep track of the record locations.
sequential files are accessed one record at a time, like a tape (see this page for length discussion). As a rule, tapes could be rewound, read forward, but reading at a random point was harder. Doing that is direct access. This page makes it clear that there is a distinct choice between the two - you can have one or the other.
Formatted output is just that - making the output follow some report-style format (on the level of lines), while unformatted output does not follow tidy rules. See Fortran unformatted file format for examples of discussion. On a more technical slant, this page at Oracle goes into more depth.

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael

How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.

Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

What's the best way to write to more files than the kernel allows open at a time?

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.
How can I write to the output files effectively?
I also must use the cstdlib library due to legacy code.
The executable must also be UNIX and windows cross-platform compatible.

A couple possible approaches you might take:
keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.
Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.
Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.

It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..

If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.
You can also keep track of the last write-time for each file, and try to keep the most recently written files open.

The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.

First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.

You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.
A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.

You can do it in 2 steps.
1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.
2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.
Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.
I think in most cases it is more efficient to open and close the files more often.

The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.
I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.
The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.
Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.

Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.
You mentioned in a comment that speed was a major concern and that the naive approach is too slow.
There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.
Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).
A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...
Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.

Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.
You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.

"Fewest File Opens" Strategy:
To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.
Pseudocode for each thread:
Run through the file, collect all the unique ids.
fseek() back to the beginning of the input.
For every group of 19 IDs:
Open a file for each ID.
Run through the input file, appending matching records to the corresponding output file.
Close this group of 19 output files.
fseek() to the beginning of the input.
This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.
"Fewest File Operations" Strategy
You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:
Choose the next unread page of the input.
Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
Mark the page as unavailable.
If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
Mark the page as available, and give it the next page number.
You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?
Things to be careful about:
Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.

The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.

If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.

A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page

One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.

Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.

Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.

I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js