How handle schema changes in glue and get the expected output in csv? - amazon-web-services

I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns).
I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:
name,age,loc
Ravi,12,Ind
Joe,32,US
Say I have another file "File2.csv" as shown below:
name,age,height
Jack,12,160
Jane,32,180
After crawlers run in the schema was updated as:
name,age,loc,height -This is as expcted
but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that:
the output looks like:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,160,,
Jane,32,180,,
last two rows should have blank for loc as the second file didn't have loc column.
where as expected:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,,160
Jane,32,,180
In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?

I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes.
Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.

Related

Athena query error HIVE_BAD_DATA: Not valid Parquet file . csv / .metadata

I'm creating an app that works with AWS Athena on compressed Parquet (SNAPPY) data.
It works almost fine, however, after every query execution, 2 files get uploaded to the S3_OUTPUT_BUCKET of type csv and metadata. (as it should)
These 2 files break the execution of the next query.
I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://MY_OUTPUT_BUCKET/logs/QUERY_NAME/2022/08/07/tables/894a1d10-0c1d-4de1-9e61-13b2b0f79e40.metadata expected magic number: PAR1 got: HP
I need to manually delete those files for the next query to work.
Any suggestions on how to make this work?
(I know I cannot exclude those files with a regex etc.. but I don't want to delete the files manually for the app to work)
I read everything about the output files but it didn't help. ( Working with query results, recent queries, and output files )
Any help is appreciated.
While setting up Athena for execution, we need to specify where the metadata and csv from the query execution are written into. This needs to be written into a different folder than the table location.
Go to Athena Query Editor > Settings > Manage
and edit Query Result Location to be another S3 bucket than the table or a different folder within the same bucket.

AWS Glue crawling JSON lines data in S3

I have this type of data in my S3:
{"version":"0","id":"c1d9e9a4-25a2-a0d8-2fa4-b062efec98c4","detail-type":"OneTypeee","source":"OneSource","account":"123456789","time":"2021-01-17T12:35:17Z","region":"eu-central-1","resources":[],"detail":{"Key1":"Value1"}}
{"version":"0","id":"c13879a4-2h32-a0d8-9m33-b03jsh3cxxj4","detail-type":"OtherType","source":"SomeMagicSource","account":"123456789","time":"2021-01-17T12:36:17Z","region":"eu-central-1","resources":[],"detail":{"Key2":"Value2", "Key22":"Value22"}}
{"version":"0","id":"gi442233-3y44a0d8-9m33-937rjd74jdddj","detail-type":"MoreTypes","source":"SomeMagicSource2","account":"123456789","time":"2021-01-17T12:45:17Z","region":"eu-central-1","resources":[],"detail":{"MagicKey":"MagicValue", "Foo":"Bar"}}
Please note, I have added new lines to make it more readable. In reality, Kinesis Firehose produces these batches with no newlines.
When I try to run an AWS Glue crawler on this type of data, it only crawls the first JSON line and that's it. I know this because when I run Athena SQL queries, I always get only one (first) result.
How do I make a glue crawler correctly crawl through this data and make a correct schema so I could query all of that data?
I wasn't able to run a crawler through JSON lines data, but simply specifying in the Glue Table Serde properties that the data is JSON worked for me. Glue automatically splits the JSON by newline and I can query the data in my Glue Jobs.
Here's what my table's properties look like. Additionally, my json lines data was compressed, so here you can ignore the compressionType property.
I had the same issue and for me the reason was that json records were being written to S3 bucket without next line character: \n.
Make sure your json records are written with \n appended at the end. In case of java, something like this:
PutRecordRequest request = new PutRecordRequest()
.withRecord(new Record().withData(ByteBuffer.wrap((json + "\n").getBytes())))
.withDeliveryStreamName(streamName);
amazonKinesis.putRecordAsync(request);

How to suppress column headers in AWS Athena query result?

I'm running a SELECT Athena query on an S3 bucket manifest. I then want to use the results of that query, in .csv format, in an S3 Batch operation.
My query runs fine and I am able to access the .csv output via S3 Batch, but since the first row is actually column headers, S3 Batch to throws an unrecoverable error because it thinks that the manifest is now referring to multiple buckets.
How can I easily strip the column headers out of my results? I would prefer to just do it in SQL. The file size makes using standard unix tools prohibitive. I could use AWS Glue, but this seems like overkill for just suppressing headers in a SQL query.
Here's a hacky way to get around it
SELECT bucket as "my-bucket-name", key as "fakekey"
from your_athena_table
This will make your header look like the rest of the file which will not break the S3 Batch copy job. You will have just one failed record of fakekey

AWS Glue custom crawler based on file name

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:
s3://my-bucket/some_id/some_subfolder/datetime.json
When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.
For example if I run crawler on:
s3://my-bucket/10001/fromage/2017-10-10.json
I get table schema like this:
Partition 1: 10001
Partition 2: fromage
Array: JSON data
I did try to add custom classifier based on Grok pattern:
%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}
However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?
Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:
s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.