AWS Glue Crawler - Not picking up Timestamp column correctly (always defined as string)

AWS Glue Crawler - Not picking up Timestamp column correctly (always defined as string) - amazon-athena

I've setup an AWS Glue crawler to index a set of bucketed CSV files in S3 (which then create an Athena DB).
My timestamp is in "Java" format - as defined in the documentation, example;
2019-03-07 14:07:17.651795
I've tried creating a custom classifier (and a new crawler) yet this column keeps being detected as a "string" and not a "timestamp".
I'm at a loss why Athena / Glue won't detect this as a timestamp..

I think the problem may be due to the fractional seconds in the timestamp. I found this StackOverflow answer that contains the patterns recognized as timestamps by Glue (but I haven't found where the patterns come from, I can't find them in the Glue docs).
You might have better luck using a custom classifier to make it understand your timestamp format.
I don't know how much it will help you since you also have to convince Athena to parse your timestamps. You might be better off letting Glue classify them as strings and create a view where you use DATE_PARSE to convert the strings to timestamps.

Related

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}

Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

Glue Crawler does not recognize Timestamps

I have JSON files in an S3 Bucket that may change their schema from time to time. To be able to analyze the data I want to run a glue crawler periodically on them, the analysis in Athena works in general.
Problem: My timestamp string is not recognized as timestamp
The timestamps currently have the following format 2020-04-06T10:37:38+00:00, but I have also tried others, e.g. 2020-04-06 10:37:38 - I have control over this and can adjust the format.
The suggestion to set the serde parameters might not work for my application, I want to have the scheme completely recognized and not have to define each field individually. (AWS Glue: Crawler does not recognize Timestamp columns in CSV format)
Manual adjustments in the table are generally not wanted, I would like to deploy Glue automatically within a CloudFormation stack.
Do you have an idea what else I can try?

This is a very common problem. The way we got around the problem when reading text/json files is we had an extra step in between to cast and set proper data types. The crawler data types are a bit iffy sometimes and is based on the data sample available at that point in time

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

I have two problems in my intended solution:
1.
My S3 store structure is as following:
mainfolder/date=2019-01-01/hour=14/abcd.json
mainfolder/date=2019-01-01/hour=13/abcd2.json.gz
...
mainfolder/date=2019-01-15/hour=13/abcd74.json.gz
All json files have the same schema and I want to make a crawler pointing to mainfolder/ which can then create a table in Athena for querying.
I have already tried with just one file format, e.g. if the files are just json or just gz then the crawler works perfectly but I am looking for a solution through which I can automate either type of file processing. I am open to write a custom script or any out of the box solution but need pointers where to start.
2.
The second issue that my json data has a field(column) which the crawler interprets as struct data but I want to make that field type as string. Reason being that if the type remains struct the date/hour partitions get a mismatch error as obviously struct data has not the same internal schema across the files. I have tried to make a custom classifier but there are no options there to describe data types.

I would suggest skipping using a crawler altogether. In my experience Glue crawlers are not worth the problems they cause. It's easy to create tables with the Glue API, and so is adding partitions. The API is a bit verbose, especially adding partitions, but it's much less pain than trying to make a crawler do what you want it to do.
You can of course also create the table from Athena, that way you can be sure you get tables that work with Athena (otherwise there are some details you need to get right). Adding partitions is also less verbose using SQL through Athena, but slower.

Crawler will not take compressed and uncompressed data together , so it will not work out of box.
It is better to write spark job in glue and use spark.read()

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?

Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.

Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.

Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from
2018-03-23 15:00:30.998
to
2018-08-29T07:59:50.568Z
So the table ends up having entries like this
2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951
This results in parsing errors when I try to run queries against the full DB.
How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.
Thanks in advance.

Unfortunately you have to unify the data. If you decide to use "2018-08-29T07:59:50.568Z" format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe library with the following serde property: 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler - Not picking up Timestamp column correctly (always defined as string) - amazon-athena

Related

AWS Glue table Map data type for arbitratry number of fields and challenges faced

Glue Crawler does not recognize Timestamps

How to create AWS Athena table via Glue crawler when the s3 data store has both json and .gz compressed files?

Loading parquet file from S3 to DynamoDB

Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

Categories

Resources