I am trying to use the ftp method to download a csv.gz. I would also like to unzip it with the zip access function. Is there a way of combining the two in one FILENAME STATEMENT.
FILENAME in ZIP "1763.csv.gz" GZIP LRECL=80 ;
Right now I have a local copy of the file but would instead like to pull it down through FTP.
As far as I know, this isn't doable in a single filename - you have to have two, one for the FTP and one for the Zip.
However, you can stream the file to a (temporary) local file, and then access that same file with FILENAME ZIP. One paper that talks about this method is this from SGF 2019; there are several other similar that have the same basic approach:
filename ftp to your ftp file
Data step to read in from filename ftp and write out using recfm=n to a (temporary) file
filename zip to read in the file created in step 2.
Related
I am using Data flow templates API to decompress a zipped file I have in Google Storage Bucket. This zip file in turn has multiple folders and files. Now the Data flow api decompresses my zip file but writes the output into a plain text file. What I want is only unzipping of my input file and extract all contents within. How can I do this?
My zip contains following heirarchy
file.zip
|
|_folder1
| |
| |_file1
| |_file2
| |_file3
|_file
Thanks in advance!
The pipeline print only the file in failure in a plain text file. You can see the detail of the process here
Are you sure that your files are readable and well decompressed?
I was able to compress and decompress files using Data flow from console.
On the settings it says: Bulk Decompress Cloud Storage Files template
Required Parameters. The input filepattern to read from (e.g.,
gs://bucket-name/uncompressed/*.gz).
So the compressing/decompressing works at the level of files, by matching the pattern. I do not know how did you compressed or decompressed at the level of folders. When I try to input a folder name for the input parameter I get: "No files matched spec Error."
Hi I am writing some program that will write in some files (with more processes at the time) like:
with gzip.open('filename.gz', 'a') as f:
f.write(json.dumps(some dictionary) + '\n')
f.flush()
After writing finishes I upload files with:
s3.meta.client(filename, bucket, destination, filename without .gz)
Than I want to query data from Athena, after MSCK REPAIR everything seems fine but when I try to select data my rows are empty. Does anyone know what am I doing wrong?
EDIT: My mistake. I have forgot to add ContentType parameter to 'text/plain'
Athena detects the file compression format with the appropriate file extension.
So if you upload a GZIP file, but remove the '.gz' part (as I would guess from your "s3.meta.client(filename, bucket, destination, filename without .gz)" statement), the SerDe is not able to read the information.
If you rename your files to filename.gz, Athena should be able to read your files.
I have fixed the problem by first saving bigger chunks of files locally and than gzip them. I repeat the process but with appending to gziped file. Read that it is better to add bigger chunks of text than just line by line
For the upload I used boto3.transfet.upload_file with extra_args={'ContentEncoding': 'gzip', 'ContentType': 'text/plain'}
I forgot to add ContetType first time so the s3 saved them differently and Athena gave me errors that said my JSON is not formatted right.
I suggest you break the problem into several parts.
First, create a single JSON file that is not gzipped. Store it in Amazon S3, then use Athena to query it.
Once that works, manually gzip the file from the command-line (rather than programmatically), put the file in S3 and use Athena to query it.
If that works, use your code to programmatically gzip it, and try it again.
If that works with a single file, try it with multiple files.
All of the above can be tested with the same command in Athena -- you're simply substituting the source file.
This way, you'll know which part of the process is upsetting Athena without compounding the potential causes.
I'm using zlib to compress a stream of txt to a gz gzip file, and it's working well. However, it seems to name the file inside the gzip, exactly the same as my gz name.
I'm wondering is there any way to change the naming of the file that's been compressed?
I would rather it name like the following:
/myfile.gz/myfile
Where myfile is the document that's inside of the compressed gzip file, and myfile.gz is the gzipped file itself.
Is there any way to control these namings?
I think what you're saying is that when you decompress whatever.gz, you get a file named whatever in the current directory. That is the default behavior of the gzip utility, and it is not affected by how the gzip file is made. The contents of the gzip file cannot direct the decompressed data to some other directory. (If it could, that would be a security issue.)
It is possible to store a file name in the gzip header, in which case gzip -N whatever.gz will decompress to the name in the header as opposed to whatever. However it will be a file in the current directory using just the base name in the header. Any path information in the file name in the gzip header is ignored.
i working on ClamAV antivirus database.
ZMD one of clamav database file who store information about malice's zip file.
i need to get this information from zip file but if possible not use any component
is encryption.
normal size
compressed size
CRC32
compression method
please help me.
You can use unzip -l to list the contents or you can write your own zip format decoder to extract the information from the headers. The format is documented in the .ZIP File Format Specification.
I did search for this topic, but I didn't find any relevant clue for this.
Can anyone give me some tips or demo code that can solve the problem?
Thanks in advance.
---FYI---
What I wanna do here is to zip files and upload to remote PC.
I think it'll take the following steps:
a) initialize a zipped file head and send to remote PC and save that zipped file head.
b) open file to read a portion of file data and zip the file data locally.
c) send zipped data through a pipe (tcp or udp for example) to remote PC.
d) save the data from pipe, which is zipped, on the remote PC.
e) if there are multiple files, come back to b)
e) when all files is zipped and transferred to remote PC, then close zipped file.
Two question here:
a) compress/decompress
b) File format
Thanks guys!
zlib zips a single stream. If you want to zip multiple files, you need to do one of two things:
Define a format (or use an existing format) that combines multiple files into one stream, then zip that; or
Zip each file individually, then use some format to combine those into one output file.
If you take the first option, using the existing tar format to combine the files, you will be producing a .tar.Z file which can be extracted with standard tools, so this is a good way to go. You can use libtar to generate a tar archive.
I have built a wrapper around minizip adding some features that I needed and making it nicer to use it. Is does use the latest c++11 and is developed using Visual Studio 2013 (should be portable, but i havent tested it on unix)
There's a full description here: https://github.com/sebastiandev/zipper
but is as simple as you can get:
Zipper zipper("ziptest.zip");
zipper.add("somefile.txt");
zipper.add("myFolder");
zipper.close();
you can zip entire folders, streams, vectors, etc. Also a nice feature is doing everything entirely in memory.