Unexpected "padding" in a Fortran unformatted file - fortran

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?

For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.

I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.

Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.

Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Is a seek/sync required to switch between reading & writing an fstream?

When using a std::fstream you can read and write to the file. I'm wondering if there is any part of the C++ standard stating the requirements for switching between read & write operations.
What I found is:
The restrictions on reading and writing a sequence controlled by an
object of class basic_filebuf<charT, traits> are the same as for
reading and writing with the C standard library FILEs.
In particular:
— If the file is not open for reading the input sequence cannot be read.
— If the file is not open for writing the output sequence cannot be written.
— A joint file position is maintained for both the input sequence and the output sequence
So according to only the bullet points if I open a std::fstream, read 5 bytes, then write 2 bytes the new bytes will end up at position 6 & 7: First 2 conditions are met and due to the "joint file position" the read will advance the write position by 5 bytes.
However in practice intermediate buffers are used instead of directly calling fread/fwrite for every single byte. Hence the 5-byte-read could read 512 bytes from the underlying FILE and if that is not accounted for during the write that will end up at position 513 not 6 in the above example.
And indeed libstdc++ (GCC) does correct the offsets: https://github.com/gcc-mirror/gcc/blob/13ea4a6e830da1f245136601e636dec62e74d1a7/libstdc%2B%2B-v3/include/bits/fstream.tcc#L554
I haven't found anything similar in libc++ (Clang) or MS STL, although MS "cheats" by using the buffers from the FILE* inside the filebuf, so it might "work" there.
Anyway: For the C functions I found that switching between read & write requires a seek or flush (after write): https://en.cppreference.com/w/c/io/fopen
So does the part "The restrictions [...] of class basic_filebuf<charT, traits> are the same as [...] with the C standard library" may also be applied to this, so a seekg, seekp, tellg, tellp or sync is required to switch between operations?
Context: I maintain a library providing an (enhanced) filebuf class and not checking if the FILE* position needs to be adjusted to account for buffering due to potential read/write switches without seek/sync saves a potentially expensive conditional. But I don't want to deviate from the standard requirements to not break users expecting a drop-in replacement.

Can a file end with a 0x80 char?

I'm implementing my own version of blowfish encoder/decoder. I use a standard padding of 0x80 if necessary.
My question is if I need to add a padding chars even if I don't need it, because in the case of a file that ends naturally with 0x80, in the deconding part I will remove this character, but in this case it is a wrong action, since te 0x80 is part of the file itself.
This of course can be solved by adding a final char even if the total number of characters is a multiple of the encoding block (64bit in this case). I can implement this countermeasure, but first I'd prefer to know if I really need it.
Natural consequence is thinking if this type of char is chosen because never happens in a file (so the wrong situation above never happens), but I'm not sure at all.
Thanks! and sorry for the dummy question..
On Linux and other filesystem a file can contain any sequence of any bytes. So ideally you can not depended on any particular byte to decide file end. (Although EOF is there..!!)
What i am suggesting is the Most file formats are using.
You can have specific 4-5 magic bytes Header for your file format. and followed by you can have size of rest of bytes. So after some bytes your last byte would be there.
Edit:
In above suggestion In encoder you need to update size of file after adding any new data in files.
If you do not want that then you can encode your data in perticular chunk of data and then encode them packet by packet. Your file will be number of some packet. such things are used in NAL units
Blowfish is a block cipher. It always takes 64 bit input and outputs 64 bit output. If you want to encrypt a stream that is not a multiple of 64 bit long you will need to add some padding bytes. When you decrypt the encrypted stream you always get a multiple of 64 bit. But you have no information if the encrypted stream contained 'real' data or padding bytes. You need to keep track of that yourself. A simple approach would be to store the set of 'data length' and 'encrypted stream'. Another approach would be to prepend the clear text stream with a data length value, for example a 64 bit unsigned integer. Then after decrypting the encrypted stream you will have that length as the first value and then you know how many bytes of the last block are real data and how many are just padding.
And regarding your question about what bytes can be at the end of a file: any. You can have files with arbitrary content. Each byte in the file can be of any value, there is no restriction.
Regular binary file can contain any bytes sequence, so file can end with 0x80, with NULL or any other.
If you are talking about some specific standard, so it depends.. However I think that there is no such file type that could not contain some specific character in the end, I know about file types that ignores as many last characters as not needed (because header determines size) so you should do so, but never heard about illegal file data (except cracked).
So as mentioned use header, reserve for example 8 bytes that determines size. That is easy solution.
Also, before asking such question, you should ask yourself, why file should end with some special character?
The answer is Yes. On every operating system in current use, a file can end with any possible sequence of bytes. In fact, you should generate such a file to test your implementation.
In the general case you cannot recognise trailing padding characters or remove them reliably without knowing the length of the file. Therefore encoding the length of the file must be part of your cryptographic protocol.
Simply put the length of the file at the beginning and encrypt the whole thing, including any padding bytes you like (random is probably best). Once unencrypted you will have the file length to tell you where to truncate.

What's the difference between read() and getc()

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks
While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.