AWS Glue crawling JSON lines data in S3 - amazon-web-services

I have this type of data in my S3:
{"version":"0","id":"c1d9e9a4-25a2-a0d8-2fa4-b062efec98c4","detail-type":"OneTypeee","source":"OneSource","account":"123456789","time":"2021-01-17T12:35:17Z","region":"eu-central-1","resources":[],"detail":{"Key1":"Value1"}}
{"version":"0","id":"c13879a4-2h32-a0d8-9m33-b03jsh3cxxj4","detail-type":"OtherType","source":"SomeMagicSource","account":"123456789","time":"2021-01-17T12:36:17Z","region":"eu-central-1","resources":[],"detail":{"Key2":"Value2", "Key22":"Value22"}}
{"version":"0","id":"gi442233-3y44a0d8-9m33-937rjd74jdddj","detail-type":"MoreTypes","source":"SomeMagicSource2","account":"123456789","time":"2021-01-17T12:45:17Z","region":"eu-central-1","resources":[],"detail":{"MagicKey":"MagicValue", "Foo":"Bar"}}
Please note, I have added new lines to make it more readable. In reality, Kinesis Firehose produces these batches with no newlines.
When I try to run an AWS Glue crawler on this type of data, it only crawls the first JSON line and that's it. I know this because when I run Athena SQL queries, I always get only one (first) result.
How do I make a glue crawler correctly crawl through this data and make a correct schema so I could query all of that data?

I wasn't able to run a crawler through JSON lines data, but simply specifying in the Glue Table Serde properties that the data is JSON worked for me. Glue automatically splits the JSON by newline and I can query the data in my Glue Jobs.
Here's what my table's properties look like. Additionally, my json lines data was compressed, so here you can ignore the compressionType property.

I had the same issue and for me the reason was that json records were being written to S3 bucket without next line character: \n.
Make sure your json records are written with \n appended at the end. In case of java, something like this:
PutRecordRequest request = new PutRecordRequest()
.withRecord(new Record().withData(ByteBuffer.wrap((json + "\n").getBytes())))
.withDeliveryStreamName(streamName);
amazonKinesis.putRecordAsync(request);

Related

How to suppress column headers in AWS Athena query result?

I'm running a SELECT Athena query on an S3 bucket manifest. I then want to use the results of that query, in .csv format, in an S3 Batch operation.
My query runs fine and I am able to access the .csv output via S3 Batch, but since the first row is actually column headers, S3 Batch to throws an unrecoverable error because it thinks that the manifest is now referring to multiple buckets.
How can I easily strip the column headers out of my results? I would prefer to just do it in SQL. The file size makes using standard unix tools prohibitive. I could use AWS Glue, but this seems like overkill for just suppressing headers in a SQL query.
Here's a hacky way to get around it
SELECT bucket as "my-bucket-name", key as "fakekey"
from your_athena_table
This will make your header look like the rest of the file which will not break the S3 Batch copy job. You will have just one failed record of fakekey

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

How handle schema changes in glue and get the expected output in csv?

I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns).
I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:
name,age,loc
Ravi,12,Ind
Joe,32,US
Say I have another file "File2.csv" as shown below:
name,age,height
Jack,12,160
Jane,32,180
After crawlers run in the schema was updated as:
name,age,loc,height -This is as expcted
but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that:
the output looks like:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,160,,
Jane,32,180,,
last two rows should have blank for loc as the second file didn't have loc column.
where as expected:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,,160
Jane,32,,180
In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?
I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes.
Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.

Athena returns empty results from Firehose > Glue > S3 parquet setup

I have set up a Kinesis Firehose that passes data through glue which compresses to and transforms JSON to parquet and stores it in an S3 bucket. The transformation is successful and I can query the output file normally with apacheDrill. I cannot however get Athena to function. Doing a preview table (select * from s3data limit 10) I get results with the proper headers for the columns but the data is empty.
Steps I have taken:
I already added the newline to my source: JSON.stringify(event) + '\n';
Downloaded the parquet and queried successfully with apacheDrill
Glue puts the parquet file in YY/MM/DD/HH folders. I have tried moving the parquet to the root folder and I get the same empty results.
The end goal is to get data eventaully into Quicksights, so if I'm going about this wrong let me know.
What am I missing?

AWS Glue custom crawler based on file name

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:
s3://my-bucket/some_id/some_subfolder/datetime.json
When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.
For example if I run crawler on:
s3://my-bucket/10001/fromage/2017-10-10.json
I get table schema like this:
Partition 1: 10001
Partition 2: fromage
Array: JSON data
I did try to add custom classifier based on Grok pattern:
%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}
However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?
Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:
s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json