Feature to download partial file from GCS - c++

Following code works pretty well with latest version of GCS. I can use it to download the complete file in one go.
gcs::ObjectReadStream stream = client.ReadObject(bucket_name, object_name);
But If my file is too long, I need to download it in segments. Can someone suggest how to read in smaller chunks? In another words how can I specify range requests.

As per the API documentation for ReadObject:
auto stream = client.ReadObject(bucket_name, object_name, ReadRange(0, 100));
gets you the first 100 bytes.

Related

How to identify whether a file is DICOM or not which has no extension

I have few files in my GCP bucket folder like below:
image1.dicom
image2.dicom
image3
file1
file4.dicom
Now, I want to even check the files which has no extension i.e image3, file1 are dicom or not.
I use pydicom reader to read dicom files to get the data.
dicom_dataset = pydicom.dcmread("dicom_image_file_path")
Please suggest is there a way to validate the above two files are dicom or not in one sentence.
You can use the pydicom.misc.is_dicom function or do:
try:
ds = dcmread(filename)
except InvalidDicomError:
pass
Darcy has the best general answer, if you're looking to check file types prior to processing them. Apart from checking is the file a dicom file, it will also make sure the file doesn't have any dicom problems itself.
However, another way to quickly check that may or may not be better, depending on your use case, is to simply check the file's signature (or magic number as they're sometimes known.
See https://en.wikipedia.org/wiki/List_of_file_signatures
Basically if the bytes from position 128 to 132 in the file are "DICM" then it should be a dicom file.
If you only want to check for 'is it dicom?' for a set of files, and do it very quickly, this might be another approach

Unzip a large file and break into chunks in Google Cloud Storage Python

I have a large zipped file in Google Cloud Storage. It needs to be unzipped and broken into smaller chunks to be uploaded to the same bucket. My memory limit is 2GB and the file is larger than that, so I cannot unzip it all at once. shutil.copyfileobj(fsrc, fdst[, length]) seems to be a memory-efficient solution but I cannot make it work specifically with GCP (with blob).
You're likely going to need to write some custom code for this.
I would look for a library that can take in a streaming zip data source and parse it as it streams in. It sounds like you're using Python, so that might be something like stream-unzip (haven't tried it, but it sounds like it'd solve your problem).
Then, for each of the files, as you unzip them, you need to stream them back up into Cloud Storage. There are a few ways to code that up depending on which client library you're using to write to GCS.
The code would look roughly like this:
def read_chunks_from_gcs(bucket, object_name):
with your_gcs_library.read_file_like_object(bucket, object_name) as r:
yield from r.iter_bytes(chunk_size=65536)
for file_name, file_size, unzipped_chunks in stream_unzip(read_chunks_from_gcs('mybucket', 'big-zipfile.zip'):
stream = your_gcs_library.open_file_for_write(bucket, file_name, file_size)
for chunk in unzipped_chunks:
stream.write(chunk)
That'd probably work. If you work out an exact solution for some GCS library, I encourage you to post it as an an answer. I'd love to see what it looks like.

Read Partial Parquet file

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?
Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

How should I post a file to AWS Lambda function, process it, and return a file to the client?

I'm using serverless-http to make an express endpoint on AWS Lambda - pretty simple in general. The flow is basically:
POST a zip file via a multipart form to my endpoint
Unzip the file (which contains a bunch of excel files)
Merge the files into a single Excel file
res.sendFile(file) the file back to the user
I'm not stuck on this flow 100%, but that's the gist of what I'm trying to do.
Lambda functions SHOULD give me access to /tmp for storage, so I've tried messing around with Multer to store files there and then read the contents, I've also tried the decompress-zip library and it seems like the files never "work". I've even tried just uploading an image and immediately sending it back. It sends back an files called incoming.[extension], but it's always corrupt. Am I missing something? Is there a better way to do this?
Typically when working with files the approach is to use S3 as the storage, and there are a few reasons for it, but one of the most important is the fact that Lambda has an event size limit of 6mb, so you can't easily POST a huge file directly to it.
If your zipped excel files is always going to be less than that, then you are safe on that regard. If not, then you should look into a different flow, maybe something using AWS step functions with Lambda and S3.
Concerning your issue with unzipping the file, I have personally used and can recommend adm-zip, which would look something like this:
//unzip and extract file entries
var zip = new AdmZip(rawZipData);
var zipEntries = zip.getEntries();
console.log("Zip contents : " + zipEntries.toString());
zipEntries.forEach(function(entry){
var fileContent = entry.getData().toString("utf8");
});

Chunk download with OneDrive Rest API

this is the first time I write on StackOverflow. My question is the following.
I am trying to write a OneDrive C++ API based on the cpprest sdk CasaBlanca project:
https://casablanca.codeplex.com/
In particular, I am currently implementing read operations on OneDrive files.
Actually, I have been able to download a whole file with the following code:
http_client api(U("https://apis.live.net/v5.0/"), m_http_config);
api.request(methods::GET, file_id +L"/content" ).then([=](http_response response){
return response.body();
}).then([=]( istream is){
streambuf<uint8_t> rwbuf = file_buffer<uint8_t>::open(L"test.txt").get();
is.read_to_end(rwbuf).get();
rwbuf.close();
}).wait()
This code is basically downloading the whole file on the computer (file_id is the id of the file I am trying to download). Of course, I can extract an inputstream from the file and using it to read the file.
However, this could give me issues if the file is big. What I had in mind was to download a part of the file while the caller was reading it (and caching that part if he came back).
Then, my question would be:
Is it possible, using the OneDrive REST + cpprest downloading a part of a file stored on OneDrive. I have found that uploading files in chunks seems apparently not possible (Chunked upload (resumable upload) for OneDrive?). Is this true also for the download?
Thank you in advance for your time.
Best regards,
Giuseppe
OneDrive supports byte range reads. And so you should be able to request chunks of whatever size you want by adding a Range header.
For example,
GET /v5.0/<fileid>/content
Range: bytes=0-1023
This will fetch the first KB of the file.