Unzipping a file in Google Storage Bucket - google-cloud-platform

I am using Data flow templates API to decompress a zipped file I have in Google Storage Bucket. This zip file in turn has multiple folders and files. Now the Data flow api decompresses my zip file but writes the output into a plain text file. What I want is only unzipping of my input file and extract all contents within. How can I do this?
My zip contains following heirarchy
file.zip
|
|_folder1
| |
| |_file1
| |_file2
| |_file3
|_file
Thanks in advance!

The pipeline print only the file in failure in a plain text file. You can see the detail of the process here
Are you sure that your files are readable and well decompressed?

I was able to compress and decompress files using Data flow from console.
On the settings it says: Bulk Decompress Cloud Storage Files template
Required Parameters. The input filepattern to read from (e.g.,
gs://bucket-name/uncompressed/*.gz).
So the compressing/decompressing works at the level of files, by matching the pattern. I do not know how did you compressed or decompressed at the level of folders. When I try to input a folder name for the input parameter I get: "No files matched spec Error."

Related

Apache beam fileio write compressed files

I would like to know if it's possible to write compressed files using the fileio module from Apache Beam, Python SDK. At the moment I am using the module to write files to a GCP bucket:
_ = (logs | 'Window' >> beam.WindowInto(window.FixedWindows(60*60))
| 'Convert to JSON' >> beam.ParDo(ConvertToJson())
| 'Write logs to GCS file' >> fileio.WriteToFiles(path = gsc_output_path, shards=1, max_writers_per_bundle=0))
Compression would help in minimizing storage costs.
According to this doc and comment inside class _MoveTempFilesIntoFinalDestinationFn, developers still need to implement handling of compression.
Am I right about this or does someone know how to do it?
Thank you!
developers still need to implement handling of compression.
This is correct.
Though there are open FRs:
https://github.com/apache/beam/issues/19415
https://github.com/apache/beam/issues/19941
At the moment, you can write a DoFn: read the final files -> compress -> write the compressed final files and delete original final files.

Load multiple files, check file name, archive a file

In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv
Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.

Save compressed files into s3 and load in Athena

Hi I am writing some program that will write in some files (with more processes at the time) like:
with gzip.open('filename.gz', 'a') as f:
f.write(json.dumps(some dictionary) + '\n')
f.flush()
After writing finishes I upload files with:
s3.meta.client(filename, bucket, destination, filename without .gz)
Than I want to query data from Athena, after MSCK REPAIR everything seems fine but when I try to select data my rows are empty. Does anyone know what am I doing wrong?
EDIT: My mistake. I have forgot to add ContentType parameter to 'text/plain'
Athena detects the file compression format with the appropriate file extension.
So if you upload a GZIP file, but remove the '.gz' part (as I would guess from your "s3.meta.client(filename, bucket, destination, filename without .gz)" statement), the SerDe is not able to read the information.
If you rename your files to filename.gz, Athena should be able to read your files.
I have fixed the problem by first saving bigger chunks of files locally and than gzip them. I repeat the process but with appending to gziped file. Read that it is better to add bigger chunks of text than just line by line
For the upload I used boto3.transfet.upload_file with extra_args={'ContentEncoding': 'gzip', 'ContentType': 'text/plain'}
I forgot to add ContetType first time so the s3 saved them differently and Athena gave me errors that said my JSON is not formatted right.
I suggest you break the problem into several parts.
First, create a single JSON file that is not gzipped. Store it in Amazon S3, then use Athena to query it.
Once that works, manually gzip the file from the command-line (rather than programmatically), put the file in S3 and use Athena to query it.
If that works, use your code to programmatically gzip it, and try it again.
If that works with a single file, try it with multiple files.
All of the above can be tested with the same command in Athena -- you're simply substituting the source file.
This way, you'll know which part of the process is upsetting Athena without compounding the potential causes.

Appending to a file in a zip archive

I have written a zip class that uses functions and code from miniz to: Open an archive, Close an archive, Open a file in the archive, Close a file in the archive, and write to the currently open file in the archive.
Currently opening a file in an archive overwrites it if it already exists. I would like to know if it is possible to APPEND to a file within a zip archive that has already been closed?
I want to say that it is possible but I would have to edit all offsets in each of the other file's internal states and within the central directory. If it is possible - is this the right path to look in to?
Note:
I deal with large files so decompressing and compressing again is not ideal and neither is doing any copying of files. I would just like to "open" a file in the zip archive to continue writing compressed data to it.
I would just like to "open" a file in the zip archive to continue writing compressed data to it.
Compressed files aren't working like a file system or folder, where you could change individual files. They keep e.g. check sums, that need to apply for the whole archive.
So no, you can't do such inplace, but have to unpack the compressed file, apply your changes and compress everything again.

How to get information about ZIP files?

i working on ClamAV antivirus database.
ZMD one of clamav database file who store information about malice's zip file.
i need to get this information from zip file but if possible not use any component
is encryption.
normal size
compressed size
CRC32
compression method
please help me.
You can use unzip -l to list the contents or you can write your own zip format decoder to extract the information from the headers. The format is documented in the .ZIP File Format Specification.