Converting / Flattening RMS indexed files from OpenVMS - fortran

I was attempting to convert some Indexed files created on the OpenVMS to plain flat sequential files to be used in Windows or Linux.
Each indexed files contains x quantity of of POD structures (2594 bytes)
I have converted the files using a simple program such as this:
PROGRAM MAKE_FLAT
BYTE byte_array(2594)
PARAMETER FILE_IN = 1
PARAMETER FILE_OUT = 2
OPEN(UNIT=FILE_IN, fmt='UNFORMATTED',
1 FILE='input.data',
1 ORGANIZATION='INDEXED',
1 ACCESS='SEQUENTIAL',
1 KEY=(1:8:INTEGER), RECL=649)
OPEN(UNIT=FILE_OUT, fmt='UNFORMATTED',
1 FILE='output.data')
DO WHILE (.TRUE.)
READ(FILE_IN, END=999) byte_array
WRITE(FILE_OUT) byte_array
END DO
999 CONTINUE
CLOSE(FILE_IN)
CLOSE(FILE_OUT)
END
If there are 1000 records in the file, and I should be expecting a file that is
~ 10002594 bytes, but instead it resulted with 10002044 bytes shown using:
DIR/FULL output.data
Why is it that the program writing fewer bytes per record? Did I do something wrong?
Using the built-in utility of OpenVMS gives me the expected flat file.
ANAL/RMS/FDL FILE.FDL input.data
EDIT/FDL/ANALY=FILE.FDL FILE.FDL
After changing organization from 'INDEXED' to 'SEQUENTIAL' and contiguous to 'YES', performing the following command gives me the flat file of correct size (include padding per record).
CONVERT/FDL=FILE.FDL input.data output.data

If you do not really need to do this in a program, just use CONVERT
$ CONVERT/FDL=FIXED.FDL IN-FILE OUT-FILE
You can use $ EDIT/FDL FIXED.FDL and follow the prompts for making a sequential file.

2044 looks like the max. record size FORTRAN on VMS is using to write the data. If the file size is really 1000*2044 something is wrong.
What's the output of DUMP/HEADER/BLOCKS=COUNT=0 FOR002.DAT in the lines 'Record size', 'End of file block' and 'End of file byte'?
I would expect that the 2594 bytes are written in two records. Given that there are two bytes for a flag, you will see records with length 2044 and 554. (You can confirm this with a DUMP/RECORD FOR002.DAT/PAGE.) Each record has a record length field of two bytes. That is, you should have a file size of 1000*(2044+2+554+2) = 2602000.
You can double check that with the "End of file" data from the first DUMP command: (End of file block-1)*512 + End of file byte.

Related

Problem with writing .dat files using Fortran

I am reading a .txt file and writing to a .dat file for use in GrADS.
The .txt file contains 1D data, and my program reads all 123 lines correctly as I checked by printing on screen. However, the result .dat file only contains the very first line of data, the rest are all zeros. How can I fix this? Did I set the dimensions in write incorrectly?
Here is my code:
program convert
integer,parameter::nlon=1,nlat=1,nz=123
real,dimension(nlat,nlon,nz)::pr
integer::ilon,ilat,iz
open(2,file='2020082100_1.txt',form='formatted',status='old')
INQUIRE(IOLENGTH=LREC) pr
open(3,file='2020082100_1.dat',form='unformatted', &
access='direct',recl=LREC)
do iz=1,nz
do ilat=nlat,1,-1
read(2,*) (pr(ilat,ilon,iz),ilon=1,1)
end do
end do
IREC=1
do iz=1,nz
write(3,rec=irec) ((pr(ilat,ilon,iz),ilon=1,nlon),ilat=1,nlat)
IREC=IREC+1
end do
close(2)
close(3)
end program convert
For example, the first few lines of the .txt file are:
30.7
29.4
25.9
24.2
24.4
...
However, the .dat file contains this:
30.7
0
0
0
0
...
So after tinkering with the code for a few hours, I figured out how to fix it myself.
Since all the data is writing on the wrong axis, I simply need to change a single line of code:
from this:write(3,rec=irec) ((pr(ilat,ilon,iz),ilon=1,nlon),ilat=1,nlat)
to this: write(3,rec=iz) ((pr(iz,ilon,ilat),ilon=1,nlon),ilat=1,nlat).
By changing this it completely worked. I did actually set the dimensions in write incorrectly.

CTF Reader throwing error for big files in CNTK

I am using a CTF reader function following the CNTK tutorials on Github.
def create_reader(path, is_training, input_dim, label_dim):
return MinibatchSource(CTFDeserializer(path, StreamDefs(
features = StreamDef(field='x', shape=input_dim, is_sparse=True),
labels = StreamDef(field='y', shape=label_dim, is_sparse=False)
)), randomize=is_training, epoch_size= INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)
This works completely fine except when the input file size is bigger than a certain size (unknown). Then it throws an error like this:
WARNING: Sparse index value (269) at offset 8923303 in the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt) exceeds the maximum expected value (268).
attempt: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt)., retrying 2-th time out of 5...
.
.
.
RuntimeError: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt).
I identified that this kind of error is being thrown in the file TextParser.cpp
https://github.com/Microsoft/CNTK/blob/5633e79febe1dc5147149af9190ad1944742328a/Source/Readers/CNTKTextFormatReader/TextParser.cpp
What is the solution to or work-around for this?
You need to know the dimensionality of your input and also know that indices start from 0. So if you created an input file mapping your vocabulary to the range 1 to 20000 the dimensionality is 20001.

Opening multiple files in Fortran 90

I would like to open 10,000 files with file names starting from abc25000 until abc35000 and copy some information into each file. The code I have written is as below:
PROGRAM puppy
IMPLICIT NONE
integer :: i
CHARACTER(len=3) :: n1
CHARACTER(len=5) :: cnum
CHARACTER(len=8) :: n2
loop1: do i = 25000 ,35000 !in one frame
n1='abc'
write(cnum,'(i5)') i
n2=n1//cnum
print*, n2
open(unit=i ,file=n2)
enddo loop1
end
This code is supposed to generate files starting from abc24000 until abc35000 but it stops about half way saying that
At line 17 of file test-openFile.f90 (unit = 26021, file = '')
Fortran runtime error: Too many open files
What do I need to do to fix the above code?
This limit is set by your OS. If you're using a Unix/Linux variant, you can check the limit using from the command line using ulimit -n, and raise it using ulimit -n 16384. You'll need to set a limit greater than 10000 to allow for all the other files that the shell will have open. You may also need admin privileges to do this.
I regularly bump the limit up to 2048 to run Fortran programs, but never as high as 10000. However, I echo the other answers that, if possible, it's better to restructure your program to close each file before opening the next.
You need to work on the files one at a time (or in small groups that do not exceed the limitation imposed by the operating system).
for each file:
open file
write
close file
Operating systems tend to have limits on resources. Typically on, for instance, Linux, there is by default a limit of 1024 file descriptors per process. The error message you're getting is just the Fortran runtime library passing information upwards that it was unable to open yet another file due to an OS error.

converting binary to text in linux

I have a big binary file I have produced by writing an array of float numbers in binary format.
Now how can I simply convert that binary file to text ?
Use the UNIX od command, with the -t f4 option to read the file as 4 byte floating point values. The -A n option is also useful to avoid printing the file offsets. Here is the output of an example file that I created.
/tmp> od -A n -t f4 b.dump
-999.876 -998.876 -997.876 -996.876
-995.876 -994.876 -993.876 -992.876
-991.876 -990.876 -989.876 -988.876
-987.876 -986.876 -985.876 -984.876
You will need to reverse the process.
Read the file back into an array of floats.
Print the array use printf() or your favorite io function.
Any other approach will be ugly and painful; not to say this isn't ugly to start with.

Markov C++ read from file performance

I have my 2nd assignment for C++ class which includes Markov chains. The assignment is simple but I'm not able to figure out what is the best implementation when reading chars from files.
I have a file around 300k. One of the rules for the assignment is to use Map and Vector classes. In Map (key is only string) and values will be the Vectors. When I'm reading from the file, I need to start collecting key pairs.
Example:
File1.txt
1234567890
1234567890
If Select Markov k=3, I should have in my Map:
key vector
123 -> 4
456 -> 7
789 -> 0
0/n1 -> 2
234 -> 5
567 -> 8
890 -> /n
/n -> NULL
The professor's suggestion is to read char by char, so my algorithm is the following
while (readchar != EOF){
tempstring += readchar
increment index
if index == Markovlevel {
get nextchar if =!EOF
insert nextchar value in vector
insert tempstring to Map and assign vector
unget char
}
}
I omit some other details. My main question is that if I have 318,000 characters, I will be doing the conditional every time which slows down my computer a lot (brand new MAC pro). A sample program from the professor executes this file in around 5 seconds.
I'm not able to figure out what's the best method to read fixed length words from a text file in C++.
Thanks!
Repeated file reading will slow down the program.
Read the file in blocks, of say size 1024, put into a buffer. Then process this buffer as you require for the assignment. Repeat for the next block till you are done with the file.
Have you actually timed the program? 318,000 conditionals should be a piece of cake for your brand new MAC pro. That should take only microseconds.
Premature optimization is the root of all evil. Make your program work first, optimization comes second.