Redshift spectrum timestamp column issues - amazon-web-services

I have few files in s3. Used glue data catalog to get the table definition. I have field called log_time and I manually set the datatype to timestamp in glue catalog. Now when I query that table from Athena I can see the timestamp values correctly.
Now I go to Redshift spectrum and create an external schema pointing to the schema created by the glue data catalog. I can see the table that are defined there and also when I check the data type of the column I see that it is defined as timestamp. However I run the same query I can in Athena, log_time field displays the date part correctly. But for the time part it is all 00:00:00 for all rows.
Any idea?
**date value it the file :**2018-12-16 00:47:20.28
When i change the field date-type to timestamp manually in glue-data-catalog then then query in Athena i see the value: 2018-12-16 00:47:20.280
When I create a Redshift spectrum schema pointing to the data-catalog's schema and then query it, I see the value 2018-12-16 00:00:00

Related

AWS Glue - Adding fileld to a struct field

I have a table defined in AWS Glue. I use AWS Kinesis streams to stream logs into S3 using this table definition, using parquet file format. It's partitioned by date.
One of the fields in the table is a struct with several fields, event_payload, one of them an array of structs. Recently I added a new field to the inner struct in the log data. I want to add it in the table definition so that it will be written to the S3, and so that I can query it using AWS Athena.
I tried editing the table schema directly in the console. It does write the data to S3, but I get an exception in Athena when querying:
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'event_payload' in table 'c2s.logs' is declared as type 'struct<...>', but partition 'year=2019/month=201910/day=20191026/hour=2019102623' declared column 'event_payload' as type 'struct<...>'.
I tried deleting all the partitions and repairing the table, as specified here, but I got another error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://.../year=2019/month=201910/day=20191022/hour=2019102216/beaconFirehose-4-2019-10-22-16-34-21-71f183d2-207e-4ae9-98fe-07dda0bab70c.parquet (offset=0, length=801679): Schema mismatch, metastore schema for row column event_payload.markings.element has 8 fields but parquet schema has 7 fields
So the schema has a field which is not present in the data.
Is there a way to specify an optional field? If it's not present, just make it null.
As per link schema updates on nested structure is not supported in Athena. One way to make this work can be to flatten the struct type with the help of relalaionalize operator in Glue. for ex:
val frames: Seq[DynamicFrame] = lHistory.relationalize(rootTableName = "hist_root", stagingPath = redshiftTmpDir, JsonOptions.empty)

AWS Glue job to convert table to Parquet w/o needing another crawler

Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?
Current set up:
JSON files in partitioned S3 bucket are crawled once a day
Glue Job creates Parquet files in specified folder
Run ANOTHER crawler to RECREATE the same table that was made in step 1
I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!
For convenience considerations - 2 crawlers is the way to go.
For cost considerations - a hacky solution whould be:
Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;
In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.
Execute the new CREATE TABLE DDL in Athena.
For example:
SHOW CREATE TABLE json_table;
Original DDL:
CREATE EXTERNAL TABLE `json_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
...
LOCATION
's3://bucket_name/table_data'
...
New DDL:
CREATE EXTERNAL TABLE `parquet_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_name/table_data'
You can also do it in the same way with Glue api methods: get_table() > replace > create_table().
Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.

HIVE_CANNOT_OPEN_SPLIT: Schema mismatch when querying parquet files from Athena

I'm getting a schema mismatch error when querying parquet data from Athena.
The error is:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields
In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error.
I did this because of the similar question:
How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
The table schema for the totals column is:
struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>
The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".
You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :
1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.
2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.

Problems while uploading quoted data to Redshift from S3 using AWS GLUE. How do I insert the data?

I am trying to insert a data set in Redshift with values as :
"2015-04-12T00:00:00.000+05:30"
"2015-04-18T00:00:00.000+05:30"
"2015-05-09T00:00:00.000+05:30"
"2015-05-24T00:00:00.000+05:30"
"2015-07-19T00:00:00.000+05:30"
"2015-08-02T00:00:00.000+05:30"
"2015-09-05T00:00:00.000+05:30"
The crawler which I ran over S3 data is unable to identify the columns or datatype of the values. I have been tweaking the table settings to get the job to push the data into Redshift but no avail. Here is what I have tried so far :
Manually added the column in the table definition in Glue Catalog. There is only 1 column which is mentioned above.
Changed the Serde serialization lib from LazySimpleSerde to org.apache.hadoop.hive.serde2.lazy.OpenCSVSerDe
Added the following Serde parameters - quoteChar ", line.delim \n, field.delim \n
I have already tried different combinations of line.delim and field.delim properties. Including one, omitting another and taking both at the same time as well.
Changed the classification from UNKONWN to text in table properties.
Changed the recordCount property to 469 to match the raw data row counts.
The job runs are always successful. After the job runs, when I go to select * from table_name, I always get correct count of rows in the redshift table as per the raw data but all the rows are NULL. How do I populate the rows in Redshift ?
The table properties have been uploaded in image album here : Imgur Album
I was unable to push the data into Redshift using Glue. So I turned to COPY command of Redshift. Here is the command that I executed in case anyone else needs it or faces the same situation :
copy schema_Name.Table_Name
from 's3://Path/To/S3/Data'
iam_role 'arn:aws:iam::Redshift_Role'
FIXEDWIDTH 'Column_Name:31'
region 'us-east-1';

How to Query parquet data from Amazon Athena?

Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data?
I have tried following:
Converted sample JSON data to parquet data.
Uploaded parquet data to S3.
Created temporary table using columns of JSON data.
By doing this I am able to a execute query but the result is empty.
Is this approach right or is there any other approach to be followed on parquet data?
Sample json data:
{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"},
{"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016-12-21T13:04:10:066Z","orgid":"fedex","locationId":"LID001","UserId":"UID001","SuperviceId":"SID001"}
If your data has been successfully stored in Parquet format, you would then create a table definition that references those files.
Here is an example statement that uses Parquet files:
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_pq (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
...
ssl_protocol string )
PARTITIONED BY(year int, month int, day int)
STORED AS PARQUET
LOCATION 's3://athena-examples/elb/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
This example was taken from the AWS blog post Analyzing Data in S3 using Amazon Athena that does an excellent job of explaining the benefits of using compressed and partitioned data in Amazon Athena.
If your table definition is valid but not getting any rows, try this
-- The MSCK REPAIR TABLE command will load all partitions into the table.
-- This command can take a while to run depending on the number of partitions to be loaded.
MSCK REPAIR TABLE {tablename}
steps:
1. create your my_table_json
2. insert data into my_table_json (verify existence of the created json files in the table 'LOCATION')
3. create my_table_parquet: same create statement as my_table_json except you need to add 'STORED AS PARQUET' clause.
4. run: INSERT INTO my_table_parquet SELECT * FROM my_table_json