Flink table column name collision between metadata and physical columns - amazon-web-services

We are reading data from a Amazon KDS stream using Apache Flink. The incoming stream records contain a column named "timestamp". We also want to use the approximate arrival time value from the record metadata (which is automatically added by KDS). The metadata key for the approximate arrival time is also named "timestamp". This causes an error when trying to use both columns:
CREATE TABLE source_table
(
`timestamp` VARCHAR(100)
`approximate_arrival_time` TIMESTAMP(3) METADATA FROM 'timestamp'
)
WITH
(
'connector' = 'kinesis',
...
);
Trying to access data from the table:
SELECT * FROM source_table;
Results in this error:
org.apache.flink.table.api.ValidationException: Field names must be unique. Found duplicates: [timestamp]
at org.apache.flink.table.types.logical.RowType.validateFields(RowType.java:272)
at org.apache.flink.table.types.logical.RowType.(RowType.java:157)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.createProducedType(DynamicSourceUtils.java:215)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.validateAndApplyMetadata(DynamicSourceUtils.java:443)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.prepareDynamicSource(DynamicSourceUtils.java:158)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.convertSourceToRel(DynamicSourceUtils.java:119)
at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:85)
at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3585)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2507)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2144)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2093)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2050)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:663)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:644)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3438)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:570)
at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:169)
at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:161)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.toQueryOperation(SqlToOperationConverter.java:989)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.convertSqlQuery(SqlToOperationConverter.java:958)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.convert(SqlToOperationConverter.java:283)
at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:101)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.sqlQuery(TableEnvironmentImpl.java:704)
at org.apache.zeppelin.flink.sql.AbstractStreamSqlJob.run(AbstractStreamSqlJob.java:102)
at org.apache.zeppelin.flink.FlinkStreamSqlInterpreter.callInnerSelect(FlinkStreamSqlInterpreter.java:89)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callSelect(FlinkSqlInterrpeter.java:503)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callCommand(FlinkSqlInterrpeter.java:266)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.runSqlList(FlinkSqlInterrpeter.java:160)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.internalInterpret(FlinkSqlInterrpeter.java:112)
at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:852)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:744)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:46)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Related

How to export hive table data into csv.gz format stored in s3

So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```

Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/

Stuck with kinesis stagger window

I have a kinesis analytics application set up which takes data from a kinesis stream which has the following schema.
--------------------------
Column ColumnType
--------------------------
Level varchar(10)
RootID varchar(32)
ProcessID varchar(16)
EntityName varchar(64)
Message varchar(512)
Threshold varchar(32)
TriggerTime timestamp
My objective is to create a realtime kinesis analytics solution which segregates the records with level "OVERFLOW" and
groups them based on the RootID. All records belonging to a RootID are ideally expected to reach kinesis within a span
of 5 minutes. So I am thinking of setting up a stagger window for this and so far I have come up with this SQL.
CREATE OR REPLACE STREAM "OVERFLOW_SQL_STREAM" (
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
);
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
I received an error in the SQL stating that "PARTITION BY clause doesn't have the column 'Level'". I don't understand why
should I add that column to partition as I want my records to be partitioned by only by the RootID column and not by any other.
Adding that column throws error saying that I should add the next column and so on. I couldn't understand the error.
Kindly help me!Thanks!
There is a workaround for this type of problem.
You can Use FIRST_VALUE() or LAST_VALUE() to cast the result instead of directly passing them.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
LAST_VALUE("Level") AS Level,
"RootID" varchar (32),
....
....
....
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
This is the way you can create the stream pump without adding into the PARTITION BY clause.
FIRST_VALUE() -- To get the very first value of level matched by the
stream partition (Here RootID)
LAST_VALUE() -- Viceversa

Does AWS Athena supports Sequence File

Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.