AWS S3 file format - amazon-web-services

While writing files in S3 through Glue job, how to give custom file-name and also with timestamp format ( for example - file-name_yyyy-mm-dd_hh-mm-ss) format ??
As by default, glue writes the output files in format part-0**

Since Glue is using Spark in the background it is not possible to change the file names directly.
There is the possibility to change it after you have written to S3 though. This answer provides a simple code snippet that should work.

Related

Is it possible to unload data in AWS Athena to a single file?

The doc states that
UNLOAD results are written to multiple files in parallel.
I guess this is more efficient for both read and write, so unloading to a single file doesn't make sense. But, if for some reason the end user wants the output as a single file, is it possible?
Running a SELECT query in Athena produces a single result file in Amazon S3 in uncompressed CSV format this is the default behaviour.
If your query is expected to output a large result set then significant time is spent in writing results as one single file to Amazon S3. With UNLOAD, you can split the results into multiple files in Amazon S3, which reduces the time spent in the writing phase hence better performance and you can even use compression techniques like parquet.
What you are trying to do is not what unload is meant for. One solution would be to write some kind of post processor which will merge the files after the write is finished. Maybe using the lambda function which is triggered on S3 write.
Assumed your UNLOAD query is using TEXTFILE format and gzip compression like:
UNLOAD( select * from my_table )
TO 's3://your_bucket/your_path/'
WITH (
format = 'TEXTFILE',
compression = 'gzip',
field_delimiter = '\t'
)
A simple solution would be the following:
aws s3 cp --recursive s3://your_bucket/your_path/ .
gzip -d *
cat * > your_file.csv

How can we rename the generated/output parquet file in PYSPARK or Dynamic Frames in AWS Glue?

This is a generated output parquet file in S3 from AWS Glue with PySpark, we want to give a specific name like abcd.parquet not auto-generated characters. Any help would be great. Thanks!
Image
This is unfortunately not possible. Glue is using Spark under the hood which assigns those names to your files.
The only thing you can do is to rename it after writing.

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

No extension while using from_options' in DynamicFrameWriter in AWS Glue spark context

I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.
you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.