Apache beam fileio write compressed files - google-cloud-platform

I would like to know if it's possible to write compressed files using the fileio module from Apache Beam, Python SDK. At the moment I am using the module to write files to a GCP bucket:
_ = (logs | 'Window' >> beam.WindowInto(window.FixedWindows(60*60))
| 'Convert to JSON' >> beam.ParDo(ConvertToJson())
| 'Write logs to GCS file' >> fileio.WriteToFiles(path = gsc_output_path, shards=1, max_writers_per_bundle=0))
Compression would help in minimizing storage costs.
According to this doc and comment inside class _MoveTempFilesIntoFinalDestinationFn, developers still need to implement handling of compression.
Am I right about this or does someone know how to do it?
Thank you!

developers still need to implement handling of compression.
This is correct.
Though there are open FRs:
https://github.com/apache/beam/issues/19415
https://github.com/apache/beam/issues/19941
At the moment, you can write a DoFn: read the final files -> compress -> write the compressed final files and delete original final files.

Related

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

Unzipping a file in Google Storage Bucket

I am using Data flow templates API to decompress a zipped file I have in Google Storage Bucket. This zip file in turn has multiple folders and files. Now the Data flow api decompresses my zip file but writes the output into a plain text file. What I want is only unzipping of my input file and extract all contents within. How can I do this?
My zip contains following heirarchy
file.zip
|
|_folder1
| |
| |_file1
| |_file2
| |_file3
|_file
Thanks in advance!
The pipeline print only the file in failure in a plain text file. You can see the detail of the process here
Are you sure that your files are readable and well decompressed?
I was able to compress and decompress files using Data flow from console.
On the settings it says: Bulk Decompress Cloud Storage Files template
Required Parameters. The input filepattern to read from (e.g.,
gs://bucket-name/uncompressed/*.gz).
So the compressing/decompressing works at the level of files, by matching the pattern. I do not know how did you compressed or decompressed at the level of folders. When I try to input a folder name for the input parameter I get: "No files matched spec Error."

Google Cloud Dataflow (Python): function to read from and write to a .csv file?

I am not able to figure out the precise functions in GCP Dataflow Python SDK that read from and write to csv files (or any non-txt files for that matter). For BigQuery, I have figured out the following functions:
beam.io.Read(beam.io.BigQuerySource('%Table_ID%'))
beam.io.Write(beam.io.BigQuerySink('%Table_ID%'))
For reading textfiles, the ReadFromText and WriteToText functions are known to me.
However, I am not able to find any examples for GCP Dataflow Python SDK in which data is written to or read from csv files. Please could you provide the GCP Dataflow Python SDK functions for reading from and writing to csv files in the same manner as I have done for the functions relating to BigQuery above?
There is a CsvFileSource in the beam_utils PyPi package repository, that reads .csv files, deals with file headers, and can set custom delimiters. More information on how to use this source in this answer. Hope that helps!
CSV files are text files. The simplest (though somewhat inelegant) way of reading them would be to do a ReadFromText, and then split the lines read on the commas (e.g. beam.Map(lambda x: x.split(','))).
For the more elegant option, check out this question, or simply use the beam_utils pip repository and use the beam_utils.sources.CsvFileSource source to read from.

Chunk download with OneDrive Rest API

this is the first time I write on StackOverflow. My question is the following.
I am trying to write a OneDrive C++ API based on the cpprest sdk CasaBlanca project:
https://casablanca.codeplex.com/
In particular, I am currently implementing read operations on OneDrive files.
Actually, I have been able to download a whole file with the following code:
http_client api(U("https://apis.live.net/v5.0/"), m_http_config);
api.request(methods::GET, file_id +L"/content" ).then([=](http_response response){
return response.body();
}).then([=]( istream is){
streambuf<uint8_t> rwbuf = file_buffer<uint8_t>::open(L"test.txt").get();
is.read_to_end(rwbuf).get();
rwbuf.close();
}).wait()
This code is basically downloading the whole file on the computer (file_id is the id of the file I am trying to download). Of course, I can extract an inputstream from the file and using it to read the file.
However, this could give me issues if the file is big. What I had in mind was to download a part of the file while the caller was reading it (and caching that part if he came back).
Then, my question would be:
Is it possible, using the OneDrive REST + cpprest downloading a part of a file stored on OneDrive. I have found that uploading files in chunks seems apparently not possible (Chunked upload (resumable upload) for OneDrive?). Is this true also for the download?
Thank you in advance for your time.
Best regards,
Giuseppe
OneDrive supports byte range reads. And so you should be able to request chunks of whatever size you want by adding a Range header.
For example,
GET /v5.0/<fileid>/content
Range: bytes=0-1023
This will fetch the first KB of the file.

When will NSURLConnection decompress a compressed resource?

I've read how NSURLConnection will automatically decompress a compressed (zipped) resource, however I can not find Apple documentation or official word anywhere that specifies the logic that defines when this decompression occurs. I'm also curious to know how this would relate to streamed data.
The Problem
I have a server that streams files to my app using a chunked encoding, I believe. This is a WCF service. Incidentally, we're going with streaming because it should alleviate server load during high use and also because our files are going to be very large (100's of MB). The files could be compressed or uncompressed. I think in my case because we're streaming the data, the Content-Encoding header is not available, nor is Content-Length. I only see "Transfer-Encoding" = Identity in my response.
I am using the AFNetworking library to write these files to disk with AFHTTPRequestOperation's inputStream and outputStream. I have also tried using AFDownloadRequestOperation as well with similar results.
Now, the AFNetworking docs state that compressed files will automatically be decompressed (via NSURLConnection, I believe) after download and this is not happening. I write them to my documents directory, with no problems. Yet they are still zipped. I can unzip them manually, as well. So the file is not corrupted. Do they not auto-unzip because I'm streaming the data and because Content-Encoding is not specified?
What I'd like to know:
Why are my compressed files not decompressing automatically? Is it because of streaming? I know I could use another library to decompress afterward, but I'd like to avoid that if possible.
When exactly does NSURLConnection know when to decompress a downloaded file, automatically? I can't find this in the docs anywhere. Is this tied to a header value?
Any help would be greatly appreciated.
NSURLConnection will decompress automatically when the appropriate Content-Encoding (e.g. gzip) is available in the response header. That's down to your server to arrange.