spark timestamp timezone in JDBC read/write - amazon-web-services

I am creating a parquet file from reading data from oracle.
Oracle is running in UTC. I confirmed using,
SELECT DBTIMEZONE FROM DUAL;
Output:
DBTIMEZONE|
----------|
+00:00 |
Reading from JDBC and writing to S3 as parquet:
df = spark.read.format('jdbc').options(url=url,
dbtable=query,
user=user,
password=password,
fetchsize=2000).load()
df.write.parquet(s3_loc, mode="overwrite")
Now, I checked value of spark.sql.session.timeZone
print(spark.conf.get("spark.sql.session.timeZone"))
Output:
UTC
Now, I am reading data from S3 location:
df1 = spark.read.parquet(s3_loc)
df1.show()
Output:
+-------------------+
| col1 |
+-------------------+
|2012-11-11 05:00:00|
|2013-11-25 05:00:00|
|2013-11-11 05:00:00|
|2014-12-25 05:00:00|
+-------------------+
col1 is date in oracle and converted to timestamp in spark df.
Why 5 hours are added in the output? Database is running in UTC and spark.sql.session.timeZone is UTC.
Note:
Both RDS and EMR are running in AWS US-EAST-1
On all the spark nodes, I ran TZ=UTC

Timezone is reckognized by JDBC driver, which does not know about Spark's timezone setting, but relies on JVM's default timezone. Moreover, it ignores remote database session's timezone settings. You said you ran TZ=UTC - I'm not sure, but probably it didn't work. Check what TimeZone.getDefault tells you.
If, as I suspect, your JVM timezone is EDT (US-EAST-1 is Virginia), then 2012-11-11 00:00:00 read from Oracle by JDBC is interpreted to be in EDT. Displayed in Spark it's 2012-11-11 05:00:00 UTC and this is the result you got.
To fix it, override JVM default timezone when running spark-submit:
spark-submit \
--conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" \
--conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC" \
...

Related

Convert text column to timestamp in Amazon redshift

I have a text field "completed_on" with text values "Thu Jan 27 2022 11:55:12 GMT+0530 (India Standard Time)".
I need to convert this into timestamp.
I tried , cast(completed_on as timestamp) which should give me the timestamp but I am getting the following error in REDSHIFT
ERROR: Char/varchar value length exceeds limit for date/timestamp conversions
Since timestamps can be in many different formats, you need to tell Amazon Redshift how to interpret the string.
From TO_TIMESTAMP function - Amazon Redshift:
TO_TIMESTAMP converts a TIMESTAMP string to TIMESTAMPTZ.
select sysdate, to_timestamp(sysdate, 'YYYY-MM-DD HH24:MI:SS') as seconds;
timestamp | seconds
-------------------------- | ----------------------
2021-04-05 19:27:53.281812 | 2021-04-05 19:27:53+00
For formatting, see: Datetime format strings - Amazon Redshift.

Get the specific timestamp query on AWS Cloudwatch Logs Insights

I want to get a specific timestamp so for example:
I need to query: December 31st, 2021 at 8:55AM
fields #timestamp, #message
| sort #timestamp desc
| limit 25
Here are some screenshots of where to change it in the top right of the console as well. First click on Custom:
then switch to Absolute and specify the exact start and end dates/times you want:
Filter by timestamp query on AWS Cloudwatch Logs Insights may address your question if you want to do it in the query.

Athena Table Timestamp With Time Zone Not Possible?

I am trying to create an athena table with a timestamp column that has time zone information. The create sql looks something like this:
CREATE EXTERNAL TABLE `tmp_123` (
`event_datehour_tz` timestamp with time zone
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'Classification'='parquet'
)
When I run this, I get the error:
line 1:8: mismatched input 'external'. expecting: 'or', 'schema', 'table', 'view' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: b7fa4045-a77e-4151-84d7-1b43db2b68f2; proxy: null)
If I remove the with time zone it will create the table. I've tried this and timestamptz. Is it not possible to create a table in athena that has a timestamp with time zone column?
Unfortunately Athena does not support timestamp with time zone.
What you may do is use the CAST() function around that function call, which will change the type from timestamp with time zone into timestamp.
Or, you can maybe save it as timestamp and use AT TIME STAMP operator as given below:
SELECT event_datehour_tz AT TIME ZONE 'America/Los_Angeles' AS la_time;
Just to give a complete solution after #AswinRajaram answered that Athena does not support timestampo with timezone. Here is how one can CAST the timestamp from a string and use it with timezone.
select
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H'),
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H') AT TIME ZONE 'Europe/Berlin',
at_timezone(CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AS timestamp), 'Europe/Berlin') AS date_partition_berlin,
CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AT TIME ZONE 'Europe/Berlin' AS timestamp) AS date_partition_timestamp;
2022-09-10 00:00:00.000 UTC
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 00:00:00.000

Using Dataprep to write to just a date partition in a date partitioned table

I'm using a BigQuery view to fetch yesterday's data from a BigQuery table and then trying to write into a date partitioned table using Dataprep.
My first issue was that Dataprep would not correctly pick up DATE type columns, but converting them to TIMESTAMP works (thanks Elliot).
However, when using Dataprep and setting an output BigQuery table you only have 3 options for: Append, Truncate or Drop existing table. If the table is date partitioned and you use Truncate it will remove all existing data, not just data in that partition.
Is there another way to do this that I should be using? My alternative is using Dataprep to overwrite a table and then using Cloud Composer to run some SQL pushing this data into a date partitioned table. Ideally, I'd want to do this just with Dataprep but that doesn't seem possible right now.
BigQuery table schema:
Partition details:
The data I'm ingesting is simple. In one flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-08 | Josh1 |
| 2018-08-08 | Josh2 |
+------------+--------+
In the other flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-09 | Josh1 |
| 2018-08-09 | Josh2 |
+------------|--------+
It overwrites the data in both cases.
You ca create a partitioned table bases on DATE. Data written to a partitioned table is automatically delivered to the appropriate partition.
Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
Append the data to have the new data added to the partitions.
You can create the table using the bq command:
bq mk --table --expiration [INTEGER1] --schema [SCHEMA] --time_partitioning_field date
time_partitioning_field is what defines which field you will be using for the partitions.

SQL Server exception: Received an invalid column length from the bcp client for colid modify_time

I am getting the below error while using bcp client of SQL Server to load data into Azure SQL Data warehouse.
Exact exception:
com.microsoft.sqlserver.jdbc.SQLServerException: 107096;Received an invalid column length from the bcp client for colid modify_time.
I am able to load the data correctly to Azure SQL database. But while loading the data to Azure SQL data warehouse, this issue happens.
And, this is happening only for timestamp columns.
When I created the table in Azure SQL data warehouse, it was created like this:
name | type | warehouse type | precision | length | java sql type
------------+------------+----------------+-----------+--------+-----------
modify_time | datetime2 | -9 | 27 | 54 | -9*
Bulk load operation is done by the following sample code:
SQLServerBulkCopy copy = new SQLServerBulkCopy(conn);
copy.setDestinationTableName("my_table");
copy.writeToServer(new ISQLServerBulkRecord() {
//Overridden methods
});