We have multiple files in cloud storage location.We are reading file from the location and based on file type we are processing file and writing to output topic .Write to output topic is based on file type.Here is the code
PCollection<FileIO.ReadableFile> data = pipeline.
apply(FileIO.match().filepattern(options.getReadDir())
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
.apply(FileIO.readMatches().withCompression(Compression.AUTO));
PCollectionTuple outputData = data.apply(ParDo.of(new Transformer(tupleTagsMap, options.getBlockSize()))
.withOutputTags(TupleTags.customerTag, TupleTagList.of(tupleTagList))
);
outputData.get(TupleTags.topicA)
.apply("Write to customer topic",
PubsubIO.writeStrings().to(options.topicA));
outputData.get(TupleTags.topicB)
.apply("Write to transaction topic",
PubsubIO.writeStrings().to(options.topicB));
processContext.output(TupleTagsMap().get(importContext.getFileType().toString()), jsonBlock); This block of code is inside the transformer
The issue here is one of the file is very large and it contains 100 million of records .
We are adding to above processContext.output in chunks but when it is writing to output topic it is writing when whole file processing is completed.jsonBlock is one of the chunk .
Due to this we are getting memory error while processing the large file .The reason being it is not getting written to output topic
How to solve this issue ?
I'm trying to decompress a .zst file the following way :
public byte[] decompress() {
byte[] compressedBytes = Files.readAllBytes(Paths.get(PATH_TO_ZST));
final long size = Zstd.decompressedSize(compressedBytes);
return Zstd.decompress(compressedBytes, (int)size);
}
and I'm running into this :
com.github.luben.zstd.ZstdException: Unknown frame descriptor [java] com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:157) [java] com.github.luben.zstd.ZstdDecompressCtx.decompress(ZstdDecompressCtx.java:214) [java]
Has anyone faced something similar? Thanks!
That error means zstd doesn't recognize the first 4 bytes of the frame. This can be because either:
The data is not in zstd format,
There is excess data at the end of the zstd frame.
You'll also want to check the output of Zstd.decompressedSize() for 0, which means the frame is corrupted, or the size wasn't present in the frame header. See the documentation.
I recently started exploring the HDF5 library for Fortran, and I came across some curious behaviour that I don't understand. The code below compiles just fine, but when I try to run the code, I get a very long list of errors originating from the subroutine HDF_init_ot. When I place this subroutine inside the caller subroutine HDF_init (as I did in HDF_init2), everything works perfectly fine. Why does this happen?
The code:
!===============================================================================
! This subroutine makes a call to an external subroutine HDF_init_ot,
! which throws an error
!===============================================================================
subroutine HDF_init()
use hdf5
character(len=11), parameter :: filename = "output.hdf5"
character(len=2), parameter :: group_ot = "ot"
integer(HID_T) :: file_handle, ot_handle
integer :: error
! Create a new HDF file
call h5open_f(error)
call h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_handle, error)
! Create a new group
call h5gcreate_f(file_handle, group_ot, ot_handle, error)
! Call to subroutine
call HDF_init_ot(ot_handle)
! Close group
call h5gclose_f(ot_handle, error)
! Close file
call h5fclose_f(file_handle, error)
call h5close_f(error)
end subroutine HDF_init
!===============================================================================
! The culprit subroutine
!===============================================================================
subroutine HDF_init_ot(ot_handle)
use hdf5
character(len=1), parameter :: dset_t = "t"
integer(HID_T) :: ot_handle, ot_space_handle
integer(HID_T) :: data_handle
integer(HSIZE_T), dimension(1) :: ot_dims, ot_max_dims
integer :: error, rank
! Define rank and dimensions
rank = 1
ot_dims = (/0/)
ot_max_dims = (/H5S_UNLIMITED_F/)
! Create a data space
call h5screate_simple_f(rank, ot_dims, ot_space_handle, error, ot_max_dims)
! Create a data set within the space
call h5dcreate_f( ot_handle, dset_t, H5T_NATIVE_DOUBLE, ot_space_handle, &
data_handle, error)
! Close the data set
call h5dclose_f(data_handle, error)
! Close the data space
call h5sclose_f(ot_space_handle, error)
end subroutine HDF_init_ot
!===============================================================================
! When I inline HDF_init_ot into the subroutine below, everything works
!===============================================================================
subroutine HDF_init2()
use hdf5
character(len=11), parameter :: filename = "output.hdf5"
character(len=2), parameter :: group_ot = "ot"
character(len=1), parameter :: dset_t = "t"
integer(HID_T) :: file_handle, ot_handle, ot_space_handle, data_handle
integer(HSIZE_T), dimension(1) :: ot_dims, ot_max_dims
integer :: error, rank
rank = 1
ot_dims = (/0/)
ot_max_dims = (/H5S_UNLIMITED_F/)
call h5open_f(error)
call h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_handle, error)
call h5gcreate_f(file_handle, group_ot, ot_handle, error)
call h5screate_simple_f(rank, ot_dims, ot_space_handle, error, ot_max_dims)
call h5dcreate_f( ot_handle, dset_t, H5T_NATIVE_DOUBLE, ot_space_handle, &
data_handle, error)
call h5dclose_f(data_handle, error)
call h5sclose_f(ot_space_handle, error)
call h5gclose_f(ot_handle, error)
call h5fclose_f(file_handle, error)
call h5close_f(error)
end subroutine HDF_init2
!===============================================================================
! Main program
!===============================================================================
program test
implicit none
! This one throws errors
call HDF_init()
! This one does not
call HDF_init2()
end program test
The error:
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140274482874112:
#000: ../../../src/H5D.c line 194 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: ../../../src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: ../../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: ../../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: ../../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#006: ../../../src/H5L.c line 1685 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#007: ../../../src/H5O.c line 3016 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#008: ../../../src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#009: ../../../src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
major: Dataset
minor: Unable to initialize object
#010: ../../../src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
major: Dataset
minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140274482874112:
#000: ../../../src/H5D.c line 415 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140274482874112:
#000: ../../../src/H5D.c line 194 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: ../../../src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: ../../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: ../../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: ../../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#006: ../../../src/H5L.c line 1685 in H5L_link_cb(): unable to create object
major: Object header
minor: Unable to initialize object
#007: ../../../src/H5O.c line 3016 in H5O_obj_create(): unable to open object
major: Object header
minor: Can't open object
#008: ../../../src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#009: ../../../src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
major: Dataset
minor: Unable to initialize object
#010: ../../../src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
major: Dataset
minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140274482874112:
#000: ../../../src/H5D.c line 415 in H5Dclose(): not a dataset
major: Invalid arguments to routine
minor: Inappropriate type
Edit:
I found that when I remove ot_max_dims from the h5screate_simple_f call, everything runs as expected. Curiously, if I define ot_max_dims within HDF_init and pass this as an extra parameter to HDF_init_ot, things also work. This suggests to me that H5S_UNLIMITED_F is a parameter that is private to HDF_init(), possibly tied to the API instance (i.e. to h5open_f)?
I have an application which writes a new HDF file at every 100th iteration of a loop.
There is normal output at steps 0, 100, 200, 300, and 400, but at step 500 i get
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: H5F.c line 444 in H5Fcreate(): unable to create file
major: File accessibilty
minor: Unable to open file
#001: H5Fint.c line 1567 in H5F_open(): unable to lock the file
major: File accessibilty
minor: Unable to open file
#002: H5FD.c line 1640 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Can't update object
#003: H5FDsec2.c line 959 in H5FD_sec2_lock(): unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'
major: File accessibilty
minor: Bad file ID accessed
The error happens in the call to H5Fcreate.
No matter how many outputs i make before step 500 (e.g. output at every step, or no output at all, or one output at step 499) - all outputs are correct, but at step 500 i get the above error message.
I checked: all HDF files that get created, are closed again immediately after writing.
(all HDF5 calls go through wrapper functions which write to a logfile the handles of the files being created, opened and closed)
Under what circumstances does this error message occur?
Is there a probability to find out what exactly the problem is?
I Want to use Sagemaker KMeans BuilIn Algorithm in one of my applications. I have a large CSV file in S3 (raw data) that I split into several parts to be easy to clean. Before I had cleaned, I tried to use it as the input of Kmeans to perform the training job but It doesn't work.
My manifest file:
[
{"prefix": "s3://<BUCKET_NAME>/kmeans_data/KMeans-2019-28-07-13-40-00-001/"},
"file1.csv",
"file2.csv"
]
The error I've got:
Failure reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError) Caused by: [16:47:31] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.1620.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 0 in the dataset does not start with a valid magic number. Stack trace returned 10 entries: [bt] (0) /opt/amazon/lib/libaialgs.so(+0xb1f0) [0x7fb5674c31f0] [bt] (1) /opt/amazon/lib/libaialgs.so(+0xb54a) [0x7fb5674c354a] [bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x4a6) [0x7fb5674cc436] [bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7fb54ecbcdb1] [bt] (4) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fb567a1e858] [bt] (5) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x15f) [0x7fb567a1d95f
My question is: It's possible to use multiple CSV files as input in Sagemaker KMeans BuilIn Algorithm only in GUI? If it's possible, How should I format my manifest?
the manifest looks fine, but based on the error message, it looks like you haven't set the right data format for you S3 data. It's expecting protobuf, which is the default format :)
You have to set the CSV data format explicitly. See https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input.
It should look something like this:
s3_input_train = sagemaker.s3_input(
s3_data='s3://{}/{}/train/manifest_file'.format(bucket, prefix),
s3_data_type='ManifestFile',
content_type='csv')
...
kmeans_estimator = sagemaker.estimator.Estimator(kmeans_image, ...)
kmeans_estimator.set_hyperparameters(...)
s3_data = {'train': s3_input_train}
kmeans_estimator.fit(s3_data)
Please note the KMeans estimator in the SDK only supports protobuf, see https://sagemaker.readthedocs.io/en/stable/kmeans.html