Athena Table Timestamp With Time Zone Not Possible? - amazon-athena

I am trying to create an athena table with a timestamp column that has time zone information. The create sql looks something like this:
CREATE EXTERNAL TABLE `tmp_123` (
`event_datehour_tz` timestamp with time zone
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'Classification'='parquet'
)
When I run this, I get the error:
line 1:8: mismatched input 'external'. expecting: 'or', 'schema', 'table', 'view' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: b7fa4045-a77e-4151-84d7-1b43db2b68f2; proxy: null)
If I remove the with time zone it will create the table. I've tried this and timestamptz. Is it not possible to create a table in athena that has a timestamp with time zone column?

Unfortunately Athena does not support timestamp with time zone.
What you may do is use the CAST() function around that function call, which will change the type from timestamp with time zone into timestamp.
Or, you can maybe save it as timestamp and use AT TIME STAMP operator as given below:
SELECT event_datehour_tz AT TIME ZONE 'America/Los_Angeles' AS la_time;

Just to give a complete solution after #AswinRajaram answered that Athena does not support timestampo with timezone. Here is how one can CAST the timestamp from a string and use it with timezone.
select
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H'),
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H') AT TIME ZONE 'Europe/Berlin',
at_timezone(CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AS timestamp), 'Europe/Berlin') AS date_partition_berlin,
CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AT TIME ZONE 'Europe/Berlin' AS timestamp) AS date_partition_timestamp;
2022-09-10 00:00:00.000 UTC
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 00:00:00.000

Related

Convert text column to timestamp in Amazon redshift

I have a text field "completed_on" with text values "Thu Jan 27 2022 11:55:12 GMT+0530 (India Standard Time)".
I need to convert this into timestamp.
I tried , cast(completed_on as timestamp) which should give me the timestamp but I am getting the following error in REDSHIFT
ERROR: Char/varchar value length exceeds limit for date/timestamp conversions
Since timestamps can be in many different formats, you need to tell Amazon Redshift how to interpret the string.
From TO_TIMESTAMP function - Amazon Redshift:
TO_TIMESTAMP converts a TIMESTAMP string to TIMESTAMPTZ.
select sysdate, to_timestamp(sysdate, 'YYYY-MM-DD HH24:MI:SS') as seconds;
timestamp | seconds
-------------------------- | ----------------------
2021-04-05 19:27:53.281812 | 2021-04-05 19:27:53+00
For formatting, see: Datetime format strings - Amazon Redshift.

AWS Athena: Partition projection using date-hour with mixed ranges

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!

How to transform data in Amazon Athena

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:
timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='#timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(time__stamp / 1000) as date) as date,
hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

How to calculate gap between 2 timestamps (edited for AWS Athena )

I Have many IOT devices that sends data to my Amazon Athena server, i created a table to store the data and the table contains 2 columns: LocalTime indicate the time that the IOT device capture his status, ServerTime indicate the time the Data arrived to server (sometimes the IOT device doesn't have network connections )
I would like to count the "gaps" in block of hours (let's say 1 hour ) in order to know the deviation of the data arriving, for example:
the result that I would like to get is:
In order to calculate the result i want to calculate how many hours passed between serverTime and LocalTime.
so the first entry (1.1.2019 12:15 - 1.1.2019 10:25 ) = 1-2 hours.
Thanks
If it is MSSQL Server is your database, you can try this below script to get your desired output-
SELECT
CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR) [Hours],
COUNT(*) [Count]
FROM your_table
GROUP BY CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR)
Oracle
If you using Oracle database as a system, you can use this statement:
select CONCAT(CONCAT (diff_hours,'-') , diff_hours+1) as Hours, count(diff_hours) as Count
from (select 24 * (to_date(LocalTime, 'YYYY-MM-DD hh24:mi') - to_date(ServerTime, 'YYYY-MM-DD hh24:mi')) diff_hours from T_TIMETABLE )
group by diff_hours
order by diff_hours;
Note: This will not display the empty intervals.

copy timestamp from AWS iot rule to Amazon redshift table column

My current iot design is iot > rule > kinesis firehose > redshift
I have iot rule as
SELECT *, timestamp() AS timestamp FROM 'topic/#
I get json message something like below
{
"deviceID": "device6",
"timestamp": 1480926222159
}
In my redshift table I have a column eventtime as Timestamp
Now i want to store the json timestamp value to eventtime column, but it gives me error as it needs
TIMEFORMAT AS 'MM.DD.YYYY HH:MI:SS
for timestamp. So how to covert the iot rules timestamp to redshift timestamp?
There is no direct way to converting epoch date value while inserting it to Redshift table Timestamp datatype column.
I have created a column with Bigint datatype and inserting epoch value directly to this column.
After that I am using Quicksight for analytics so I can edit my dataset and create New calculated field for this column and use Qucksight function as below
epochDate(epoch_date)
which converts the epoch value to timestamp field.
One can use similar functions like
SELECT
(TIMESTAMP 'epoch' + myunixtimeclm * INTERVAL '1 Second ')
AS mytimestamp
FROM
example_table