CTF Reader throwing error for big files in CNTK - c++

I am using a CTF reader function following the CNTK tutorials on Github.
def create_reader(path, is_training, input_dim, label_dim):
return MinibatchSource(CTFDeserializer(path, StreamDefs(
features = StreamDef(field='x', shape=input_dim, is_sparse=True),
labels = StreamDef(field='y', shape=label_dim, is_sparse=False)
)), randomize=is_training, epoch_size= INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)
This works completely fine except when the input file size is bigger than a certain size (unknown). Then it throws an error like this:
WARNING: Sparse index value (269) at offset 8923303 in the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt) exceeds the maximum expected value (268).
attempt: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt)., retrying 2-th time out of 5...
.
.
.
RuntimeError: Reached the maximum number of allowed errors while reading the input file (C:\local\CNTK-2-0-beta6-0-Windows-64bit-CPU-Only\cntk\Examples\common\data_pos_train_balanced_ctf.txt).
I identified that this kind of error is being thrown in the file TextParser.cpp
https://github.com/Microsoft/CNTK/blob/5633e79febe1dc5147149af9190ad1944742328a/Source/Readers/CNTKTextFormatReader/TextParser.cpp
What is the solution to or work-around for this?

You need to know the dimensionality of your input and also know that indices start from 0. So if you created an input file mapping your vocabulary to the range 1 to 20000 the dimensionality is 20001.

Related

How to determine file size in Fortran 77

I have a Fortran program that needs to read ASCII files, however the list of files sometimes includes a file of size 0. The program then crashes when trying to read this file. I have not find any way so far that will allow me to flag such a file.
I have following READ statement in my code
read(10,220,END=320,ERR=195)parm(1:)
although I expect code to go to statement 195, or to statement 320, without crashing, it crashes
this is where the code crashes when the file size is zero, with the following messages
...
fmt: end of file
apparent state: unit 10 named junko.con
last format: (A)
lately reading sequential formatted external IO
I tried using the INQUIRE statement
inquire (unit=10,SIZE=nsize), but the program would not compile
the OPEN statement did not give any error when opening the zero size file, and the values of IOSTAT was the same, irrespective of the file size
As Ian noted, any modern Fortran compiler should have INQUIRE. A simple test of
program foo
integer sz
inquire(file='tmp.dat',size=sz)
print *, sz
end program foo
with an empty tmp.dat file sets sz=0.

Fortran is reading beyond endfile record

I'm trying to read some data from a file, and the endfile record detection is important to stop reading. However, depending of the array dimensions of the array used to read data, I cannot detect properly the endfile record and my Fortran program stops.
The program is below:
!integer, dimension(3) :: x ! line 1.1
!integer, dimension(3,10) :: x ! line 1.2
integer, dimension(10,3) :: ! line 1.3
integer :: status,i=1
character(len=100) :: error
open( 30, file='data.dat', status='old' )
do
print *,i
!read( 30, *, iostat=status, iomsg=error ) x ! line 2.1
!read( 30, *, iostat=status, iomsg=error ) x(:,i) ! line 2.2
read( 30, *, iostat=status, iomsg=error ) x(i,:) ! line 2.3
if ( status < 0 ) then print *,'EOF'
print *,'total of ',i-1,' lines read.'
exit
else if ( status > 0 ) then
print *,'error cod: ',status
print *,'error message: ', error
stop
else if ( status == 0 ) then
print *,'reading ok.'
i = i + 1
end if
end do
With 'data.dat' file been:
10 20 30
30 40 50
When lines 1.3 and 2.3 are uncommented the mentioned error appears:
error cod: 5008
error message: Read past ENDFILE record
However, using lines 1.1 and 2.1, or 1.2 and 2.2, the program works, detecting endfile record.
So, I would like some help on understanding why I cannot use lines 1.3 and 2.3 to read properly this file, since I'm giving the correct number of array elements for read command.
I'm using gfortran compiler, version 6.3.0.
EDIT: simpler example
the following produces a 5008 "Read past ENDFILE record" error:
implicit none
integer x(2,2),s
open(20,file='noexist')
read(20,*,iostat=s)x
write(*,*)s
end
if we make x a scalar or a one-d array ( any size ) we get the expected -1 EOF flag. It doesn't matter if the file actually doesn't exist or is empty. If the file contains some, but not enough, data its hard to make sense of which return value you might get.
I am not sure if I am expressing myself correctly but it has to do with the way fortran is reading and storing 2d-arrays. When you are using this notation: x(:,i), the column i is virtually expanded in-line and the items are read using this one line of code. In the other case where x(i,:) is used, the row i is read as if you called read multiple times.
You may use implied loops if you want to stick with a specific shape and size. For example you could use something like that:read( 30, *, iostat=status, iomsg=error ) (x(i,j), j=1,3)
In any case you should check that your data are stored properly (as expected at least) in variable x.
Please note this is only a guess. Remember that Fortran stores arrays in column major order. When gfortran compiles read() x(:,i), the 3 memory locations are next to each other so in the executable, it produces a single call to the operating system to read in 3 values from the file.
Now when read() x(i,:) is compiled, the three data elements x(i,1), x(i,2) and x(i,3) are not in contiguous memory. So I am guessing the executable actually has 3 read calls to the operating system. The first one would trap the EOF but the 2nd one gives you the read past end of file error.
UPDATE: I have confirmed that this does not occur with Intel's ifort. gfortran seems to have had a similar problem before: Bad IOSTAT values when readings NAMELISTs past EOF. Whether this is a bug or not is debatable. The code certainly looks like it should trap an EOF.

Converting / Flattening RMS indexed files from OpenVMS

I was attempting to convert some Indexed files created on the OpenVMS to plain flat sequential files to be used in Windows or Linux.
Each indexed files contains x quantity of of POD structures (2594 bytes)
I have converted the files using a simple program such as this:
PROGRAM MAKE_FLAT
BYTE byte_array(2594)
PARAMETER FILE_IN = 1
PARAMETER FILE_OUT = 2
OPEN(UNIT=FILE_IN, fmt='UNFORMATTED',
1 FILE='input.data',
1 ORGANIZATION='INDEXED',
1 ACCESS='SEQUENTIAL',
1 KEY=(1:8:INTEGER), RECL=649)
OPEN(UNIT=FILE_OUT, fmt='UNFORMATTED',
1 FILE='output.data')
DO WHILE (.TRUE.)
READ(FILE_IN, END=999) byte_array
WRITE(FILE_OUT) byte_array
END DO
999 CONTINUE
CLOSE(FILE_IN)
CLOSE(FILE_OUT)
END
If there are 1000 records in the file, and I should be expecting a file that is
~ 10002594 bytes, but instead it resulted with 10002044 bytes shown using:
DIR/FULL output.data
Why is it that the program writing fewer bytes per record? Did I do something wrong?
Using the built-in utility of OpenVMS gives me the expected flat file.
ANAL/RMS/FDL FILE.FDL input.data
EDIT/FDL/ANALY=FILE.FDL FILE.FDL
After changing organization from 'INDEXED' to 'SEQUENTIAL' and contiguous to 'YES', performing the following command gives me the flat file of correct size (include padding per record).
CONVERT/FDL=FILE.FDL input.data output.data
If you do not really need to do this in a program, just use CONVERT
$ CONVERT/FDL=FIXED.FDL IN-FILE OUT-FILE
You can use $ EDIT/FDL FIXED.FDL and follow the prompts for making a sequential file.
2044 looks like the max. record size FORTRAN on VMS is using to write the data. If the file size is really 1000*2044 something is wrong.
What's the output of DUMP/HEADER/BLOCKS=COUNT=0 FOR002.DAT in the lines 'Record size', 'End of file block' and 'End of file byte'?
I would expect that the 2594 bytes are written in two records. Given that there are two bytes for a flag, you will see records with length 2044 and 554. (You can confirm this with a DUMP/RECORD FOR002.DAT/PAGE.) Each record has a record length field of two bytes. That is, you should have a file size of 1000*(2044+2+554+2) = 2602000.
You can double check that with the "End of file" data from the first DUMP command: (End of file block-1)*512 + End of file byte.

C++ SunOS ofstream error

Folks,
we collect large amounts of data and create error, status, info log files to let us know what's
going on. We use ofstreams to write to these files. After some period of time (days), we get a
file error (indicated by .good() call) on one of the ofstreams. In the affected log file, it
appears that the write of a single line begins but is interrupted by a write of the exact same
line. For example,
### Random Line of Text 1 ###
### Random Line of Text 2 ###
### Random Line of Text 3### Random Line of Text 3 ###
Each file/ofstream has a single thread that does the actual writing. We don't flush for performance
reasons and shouldn't have to.
Its always the same type of error.
It does only happen on one of three machines running the same code but we don't see any I/O errors
but maybe not looking in the right place.
Thanks for you time.

In what scenarios would fprintf() not write to the output file

This is a sub-problem of a bigger problem I have posted before. Following is a code snippet from a C++ package I am trying to work with. The intent is to write the values in the float array prob_estimates to the output file. For some seemingly random lines, only some of the values of the array are written. When can that happen? How should I debug it?
int j;
predict_label = predict_probability(model_,x,prob_estimates);
fprintf(output,"%g",predict_label);
for(j=0;j<model_->nr_class;j++) {
fprintf(output," %g",prob_estimates[j]);
fflush(output);
}
fprintf(output,"\n");
I also want to point out that this seems to happen only when the input size is fairly huge. This is a part of a bigger loop which runs per line of an input file (with about 200,000 lines). The prob_estimates array has 500 values per line. The output file writes less than 500 values for some 20-odd lines in the output file.
I ran this a couple of times on a subset (with 20,000 lines) and everything seemed fine.
Update: I tried checking the return value of fprintf after each attempted write and turns out it returns -1 for a lot of lines, when trying to write to the output.
fprintf encountered error at 19th value of line 2109359. Returned -1
fprintf encountered error at 373th value of line 2109359. Returned -1
fprintf encountered error at 229th value of line 2109360. Returned -1
fprintf encountered error at 87th value of line 2109361. Returned -1
fprintf encountered error at 439th value of line 2109361. Returned -1
This is when I modified the above code as follows:
for(j=0;j<model_->nr_class;j++) {
int e = fprintf(output," %g",prob_estimates[j]);
if (e < 0) {
printf("fprintf encountered error at %dth value of line %d. Returned %d",j ,count ,e); }
}
Here count is a variable that counts the number of line. It is incremented at the top of the outer loop (not shown here).
What can I do to figure out why fprintf returns -1?
A few things you could do:
print everything also to console, to see if the problem is in file output or in another place
print model_->nr_class to make sure the number of values is what you expect
Check the output file only after it is closed. Although you fflush output, it might be that other places update the file and don't fflush it. fclose would. I suggest that instead of flushing the file each line, open it in append mode in the beginning of the function, and close it in the end.
Hope this helps
Now you've found that fprintf is returning an error, you need to check the value of errno after the failing call to find out what the actual error cause is.