Save compressed files into s3 and load in Athena - amazon-web-services

Hi I am writing some program that will write in some files (with more processes at the time) like:
with gzip.open('filename.gz', 'a') as f:
f.write(json.dumps(some dictionary) + '\n')
f.flush()
After writing finishes I upload files with:
s3.meta.client(filename, bucket, destination, filename without .gz)
Than I want to query data from Athena, after MSCK REPAIR everything seems fine but when I try to select data my rows are empty. Does anyone know what am I doing wrong?
EDIT: My mistake. I have forgot to add ContentType parameter to 'text/plain'

Athena detects the file compression format with the appropriate file extension.
So if you upload a GZIP file, but remove the '.gz' part (as I would guess from your "s3.meta.client(filename, bucket, destination, filename without .gz)" statement), the SerDe is not able to read the information.
If you rename your files to filename.gz, Athena should be able to read your files.

I have fixed the problem by first saving bigger chunks of files locally and than gzip them. I repeat the process but with appending to gziped file. Read that it is better to add bigger chunks of text than just line by line
For the upload I used boto3.transfet.upload_file with extra_args={'ContentEncoding': 'gzip', 'ContentType': 'text/plain'}
I forgot to add ContetType first time so the s3 saved them differently and Athena gave me errors that said my JSON is not formatted right.

I suggest you break the problem into several parts.
First, create a single JSON file that is not gzipped. Store it in Amazon S3, then use Athena to query it.
Once that works, manually gzip the file from the command-line (rather than programmatically), put the file in S3 and use Athena to query it.
If that works, use your code to programmatically gzip it, and try it again.
If that works with a single file, try it with multiple files.
All of the above can be tested with the same command in Athena -- you're simply substituting the source file.
This way, you'll know which part of the process is upsetting Athena without compounding the potential causes.

Related

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

Load multiple files, check file name, archive a file

In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv
Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.

Concatenate 1000 CSV file directly in Google Cloud Storage? Without duplicated headers?

Is it possible to concatenate 1000 CSV file that have header into one file with no duplicated header directly in Google Cloud Storage? I could easily do this by downloading the file into my local hard drive but I would prefer to do it natively in Cloud Storage.
They all have same columns, and have header row.
I wrote an article to handle CSV files with BigQuery. To avoid several files, and if the volume is less than 1Gb, the recommended way is the following
Create a temporary table in BigQuery with all your CSV.
Use the Export API (not the export function)
Let me know if you need more guidance.
The problem with most solutions is that you still end up with a large number of split files where you have to then strip the headers and join them, etc...
Any method of avoiding multiple files tends to be also quite a lot of extra work.
It gets to be quite a hassle especially when big query spits out 3500 split gzipped csv files.
I needed a simple and batch file automatable method for achieving this.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article on with scenario and usage examples:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.

How should I post a file to AWS Lambda function, process it, and return a file to the client?

I'm using serverless-http to make an express endpoint on AWS Lambda - pretty simple in general. The flow is basically:
POST a zip file via a multipart form to my endpoint
Unzip the file (which contains a bunch of excel files)
Merge the files into a single Excel file
res.sendFile(file) the file back to the user
I'm not stuck on this flow 100%, but that's the gist of what I'm trying to do.
Lambda functions SHOULD give me access to /tmp for storage, so I've tried messing around with Multer to store files there and then read the contents, I've also tried the decompress-zip library and it seems like the files never "work". I've even tried just uploading an image and immediately sending it back. It sends back an files called incoming.[extension], but it's always corrupt. Am I missing something? Is there a better way to do this?
Typically when working with files the approach is to use S3 as the storage, and there are a few reasons for it, but one of the most important is the fact that Lambda has an event size limit of 6mb, so you can't easily POST a huge file directly to it.
If your zipped excel files is always going to be less than that, then you are safe on that regard. If not, then you should look into a different flow, maybe something using AWS step functions with Lambda and S3.
Concerning your issue with unzipping the file, I have personally used and can recommend adm-zip, which would look something like this:
//unzip and extract file entries
var zip = new AdmZip(rawZipData);
var zipEntries = zip.getEntries();
console.log("Zip contents : " + zipEntries.toString());
zipEntries.forEach(function(entry){
var fileContent = entry.getData().toString("utf8");
});

AWS S3: distributed concatenation of tens of millions of json files in s3 bucket

I have an s3 bucket with tens of millions of relatively small json files, each less than 10 K.
To analyze them, I would like to merge them into a small number of files, each having one json per line (or some other separator), and several thousands of such lines.
This would allow me to more easily (and performantly) use all kind of big data tools out there.
Now, it is clear to me this cannot be done with one command or function call, but rather a distributed solution is needed, because of the amount of files involved.
The question is if there is something ready and packaged or must I pull out my own solution.
don't know of anything out there that can do this out of the box, but you can pretty easily do it yourself. the solution also depends a lot on how fast you need to get this done.
2 suggestions:
1) list all the files, split the list, download sections, merge and reupload.
2) list all the files, and after them go through them one at a time and read/download and write it to a kinesis steam. configure kinesis to dump the files to s3 via kinesis firehose.
In both scenarios the tricky bit is going to be handling failures and ensuring you don't get the data multiple times.
For completeness, if the files would be larger (>5MB) you could also leverage http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html which would allow you to merge files in S3 directly without having to download.
Assuming each json file is one line only, then I would do:
cat * >> bigfile
This will concat all files in a directory into the new file bigfile.
You can now read bigfile one line at a time, json decode the line and do something interesting with it.
If your json files are formatted for readability, then you will first need to combine all the lines in the file into one line.