Stuck with kinesis stagger window

Stuck with kinesis stagger window - amazon-web-services

I have a kinesis analytics application set up which takes data from a kinesis stream which has the following schema.
--------------------------
Column ColumnType
--------------------------
Level varchar(10)
RootID varchar(32)
ProcessID varchar(16)
EntityName varchar(64)
Message varchar(512)
Threshold varchar(32)
TriggerTime timestamp
My objective is to create a realtime kinesis analytics solution which segregates the records with level "OVERFLOW" and
groups them based on the RootID. All records belonging to a RootID are ideally expected to reach kinesis within a span
of 5 minutes. So I am thinking of setting up a stagger window for this and so far I have come up with this SQL.
CREATE OR REPLACE STREAM "OVERFLOW_SQL_STREAM" (
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
);
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
"Level" varchar (10),
"RootID" varchar (32),
"ProcessID" varchar(16),
"EntityName" varchar(64),
"Message" varchar(512),
"Threshold" varchar(32),
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
I received an error in the SQL stating that "PARTITION BY clause doesn't have the column 'Level'". I don't understand why
should I add that column to partition as I want my records to be partitioned by only by the RootID column and not by any other.
Adding that column throws error saying that I should add the next column and so on. I couldn't understand the error.
Kindly help me!Thanks!

There is a workaround for this type of problem.
You can Use FIRST_VALUE() or LAST_VALUE() to cast the result instead of directly passing them.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "OVERFLOW_SQL_STREAM"
SELECT STREAM
LAST_VALUE("Level") AS Level,
"RootID" varchar (32),
....
....
....
"TriggerTime" timestamp
FROM "SOURCE_SQL_STREAM_001"
WHERE "Level" like "OVERFLOW"
WINDOWED BY STAGGER (
PARTITION BY "RootID",FLOOR("TriggerTime" TO MINUTE) RANGE INTERVAL '5' MINUTE);
This is the way you can create the stream pump without adding into the PARTITION BY clause.
FIRST_VALUE() -- To get the very first value of level matched by the
stream partition (Here RootID)
LAST_VALUE() -- Viceversa

Related

Redshift deadlock on COPY command

I'm doing a simple COPY command that used to work:
echo " COPY table_name
FROM 's3://bucket/<date>/'
iam_role 'arn:aws:iam::123:role/copy-iam'
format as json 's3://bucket/jupath.json'
gzip ACCEPTINVCHARS ' ' TRUNCATECOLUMNS TRIMBLANKS MAXERROR 3;
" | psql
And now I get:
INFO: Load into table 'table_name' completed, 53465077 record(s) loaded successfully.
ERROR: deadlock detected
DETAIL: Process 26999 waits for AccessExclusiveLock on relation 3176337 of database 108036; blocked by process 26835.
Process 26835 waits for ShareLock on transaction 24230722; blocked by process 26999.
The only change is moving from dc2 instance type to ra3. Let me add this is the only command touches this table and there is only one process at a time.

The key detail here is in the error message:
Process 26999 waits for AccessExclusiveLock on relation 3176337 of
database 108036; blocked by process 26835. Process 26835 waits for
ShareLock on transaction 24230722; blocked by process 26999.
Relation 3176337, I assume, is the table in question - the target of the COPY. This should be confirmed by running something like:
select distinct(id) table_id
,trim(datname) db_name
,trim(nspname) schema_name
,trim(relname) table_name
from stv_tbl_perm
join pg_class on pg_class.oid = stv_tbl_perm.id
join pg_namespace on pg_namespace.oid = relnamespace
join pg_database on pg_database.oid = stv_tbl_perm.db_id
;
I don't expect any surprises here but it is good to check. If it is some different table (object) then this is important to know.
Now for the meat. You have 2 processes listed in the error message - PID 26999 and PID 26835. A process is a unique connection to the database or a session. So these are identifying the 2 connections to the database that have gotten locked with each other. So a good next step is to see what each of these sessions (processes or PIDs) are doing. Like this:
select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
from svl_statementtext
where pid in (26999, 26835)
--where xid = 16627013
and sequence < 320
--and starttime > getdate() - interval '24 hours'
group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
The thing you might run into is that these logging table "recycle" every few days so the data from this exact failure might be lost.
The next part of the error is about the open transaction that is preventing 26835 from moving forward. This transaction (identified by an XID, or transaction ID) is preventing 26835 progressing and is part of process 26999 but 26999 needs 26835 to complete some action before it a move - a deadlock. So seeing what is in this transaction will be helpful as well:
select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
from svl_statementtext
where xid = 24230722
and sequence < 320
--and starttime > getdate() - interval '24 hours'
group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
Again the data may have been lost due to time. I commented out the date range where clause of the last 2 queries to allow for looking back further in these tables. You should also be aware that PID and XID numbers are reused so check the date stamps on the results to be sure that that info from different sessions aren't be combined. You may need a new where clause to focus in on just the event you care about.
Now you should have all the info you need to see why this deadlock is happening. Use the timestamps of the statements to see the order in which statements are being issued by each session (process). Remember that every transaction ends with a COMMIT (or ROLLBACK) and this will change the XID of the following statements in the session. A simple fix might be issuing a COMMIT in the "26999" process flow to close that transaction and let the other process advance. However, you need to understand if such a commit will cause other issues.
If you can find all this info and if you need any help reach out.

Clearly a bug.
Table was cloned from one redshift to another by doing SHOW TABLE table_name, which provided:
CREATE TABLE table_name (
message character varying(50) ENCODE lzo,
version integer ENCODE az64,
id character varying(100) ENCODE lzo ,
access character varying(25) ENCODE lzo,
type character varying(25) ENCODE lzo,
product character varying(50) ENCODE lzo,
)
DISTSTYLE AUTO SORTKEY AUTO ;
After removing the "noise" the command completed as usual without errors:
DROP TABLE table_name;
CREATE TABLE table_name (
message character varying(50),
version integer,
id character varying(100),
access character varying(25),
type character varying(25),
product character varying(50),
);

Why does left join in redshift not working?

We are facing a weird issue with Redshift and I am looking for help to debug it please. Details of the issue are following:
I have 2 tables and I am trying to perform left join as follows:
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.context_id = e.context_id**
where ot.order_id = '222:102'
Above query returns ~7000 records. Looks like it is performing default join as we have only 1 record in [Orders] table with Order ID = ‘222:102’
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.event_id = e.event_id**
where ot.order_id = '222:102'
Above query returns 1 record correctly. If you notice, I have just changed column for joining 2 tables. Event_ID in [Events] table is identity column but I thought I should get similar records even if I use any other column like Context_ID.
Further, I tried following query under the impression it should return all the ~7000 records as I am using default join but surprisingly it returned only 1 record.
select count(*)
from abc.orders ot
**join** abc.events e on ot.event_id = e.event_id
where ot.order_id = '222:102'
Following are the Redshift database details:
Cutdown version of table metadata:
CREATE TABLE abc.orders (
order_id character varying(30) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
event_id character varying(21) NOT NULL ENCODE zstd,
FOREIGN KEY (event_id) REFERENCES events_20191014(event_id)
)
DISTSTYLE EVEN
SORTKEY ( context_id, order_id );
CREATE TABLE abc.events (
event_id character varying(21) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
PRIMARY KEY (event_id)
)
DISTSTYLE ALL
SORTKEY ( context_id, event_id );
Database: Amazon Redshift cluster
I think, I am missing something essential while joining the tables. Could you please guide me in right direction?
Thank you

Flink table column name collision between metadata and physical columns

We are reading data from a Amazon KDS stream using Apache Flink. The incoming stream records contain a column named "timestamp". We also want to use the approximate arrival time value from the record metadata (which is automatically added by KDS). The metadata key for the approximate arrival time is also named "timestamp". This causes an error when trying to use both columns:
CREATE TABLE source_table
(
`timestamp` VARCHAR(100)
`approximate_arrival_time` TIMESTAMP(3) METADATA FROM 'timestamp'
)
WITH
(
'connector' = 'kinesis',
...
);
Trying to access data from the table:
SELECT * FROM source_table;
Results in this error:
org.apache.flink.table.api.ValidationException: Field names must be unique. Found duplicates: [timestamp]
at org.apache.flink.table.types.logical.RowType.validateFields(RowType.java:272)
at org.apache.flink.table.types.logical.RowType.(RowType.java:157)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.createProducedType(DynamicSourceUtils.java:215)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.validateAndApplyMetadata(DynamicSourceUtils.java:443)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.prepareDynamicSource(DynamicSourceUtils.java:158)
at org.apache.flink.table.planner.connectors.DynamicSourceUtils.convertSourceToRel(DynamicSourceUtils.java:119)
at org.apache.flink.table.planner.plan.schema.CatalogSourceTable.toRel(CatalogSourceTable.java:85)
at org.apache.calcite.sql2rel.SqlToRelConverter.toRel(SqlToRelConverter.java:3585)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertIdentifier(SqlToRelConverter.java:2507)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2144)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2093)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertFrom(SqlToRelConverter.java:2050)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelectImpl(SqlToRelConverter.java:663)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertSelect(SqlToRelConverter.java:644)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertQueryRecursive(SqlToRelConverter.java:3438)
at org.apache.calcite.sql2rel.SqlToRelConverter.convertQuery(SqlToRelConverter.java:570)
at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.org$apache$flink$table$planner$calcite$FlinkPlannerImpl$$rel(FlinkPlannerImpl.scala:169)
at org.apache.flink.table.planner.calcite.FlinkPlannerImpl.rel(FlinkPlannerImpl.scala:161)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.toQueryOperation(SqlToOperationConverter.java:989)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.convertSqlQuery(SqlToOperationConverter.java:958)
at org.apache.flink.table.planner.operations.SqlToOperationConverter.convert(SqlToOperationConverter.java:283)
at org.apache.flink.table.planner.delegation.ParserImpl.parse(ParserImpl.java:101)
at org.apache.flink.table.api.internal.TableEnvironmentImpl.sqlQuery(TableEnvironmentImpl.java:704)
at org.apache.zeppelin.flink.sql.AbstractStreamSqlJob.run(AbstractStreamSqlJob.java:102)
at org.apache.zeppelin.flink.FlinkStreamSqlInterpreter.callInnerSelect(FlinkStreamSqlInterpreter.java:89)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callSelect(FlinkSqlInterrpeter.java:503)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callCommand(FlinkSqlInterrpeter.java:266)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.runSqlList(FlinkSqlInterrpeter.java:160)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.internalInterpret(FlinkSqlInterrpeter.java:112)
at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:852)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:744)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:46)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

Kinesis Analytics Session or Stagger Window Batching Without Aggregation

I'm looking to use Kinesis Data Analytics (or some other AWS managed service) to batch records based on a filter criteria. The idea would be that as records come in, we'd start a session window and batch any matching records for 15 min.
The stagger window is exactly what we'd like except we're not looking to aggregate the data, but rather just return the records all together.
Ideally...
100 records spread over 15 min. (20 matching criteria) with first one at 10:02
|
v
At 10:17, the 20 matching records would be sent to the destination
I've tried doing something like:
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"device_id" INTEGER,
"child_id" INTEGER,
"domain" VARCHAR(32),
"category_id" INTEGER,
"posted_at" DOUBLE,
"block" TIMESTAMP
);
-- Create pump to insert into output
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Select all columns from source stream
SELECT STREAM
"device_id",
"child_id",
"domain",
"category_id",
"posted_at",
FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE) as block
FROM "SOURCE_SQL_STREAM_001"
WHERE "category_id" = 888815186
WINDOWED BY STAGGER (
PARTITION BY "child_id", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO MINUTE)
RANGE INTERVAL '15' MINUTE);
I continue to get errors for all the columns not in the aggregation:
From line 6, column 5 to line 6, column 12: Expression 'domain' is not being used in PARTITION BY sub clause of WINDOWED BY clause
Kinesis Firehose was a suggested solution, but it's a blind window to all child_id, so it could possibly cut up a session in to multiple and that's what I'm trying to avoid.
Any suggestions? Feels like this might not be the right tool.

try LAST_VALUE("domain") as domain in the select clause.

wso2 cep - SiddhiQL - sum different rows from the same event table

I have one event table called eventCount and has the following values:
ID | eventCount
1 3
2 1
3 5
4 1
I have a stream of data coming in where I count the values of a certain type for a time period (1 second) and depending on the type and time period I will count() and write the value of the count() in the correspondent row.
I need to make a sum of the values within the event table.
I tried to create another event table and join both. Although I am getting the error of you cannot join from 2 static sources.
What is the correct way of doing this from SIddiQL in WSO2 CEP

In your scenario, Sum of the values in the event table is equivalent to the total number of events, doesn't it? So why you need to keep it an event table, can't you just it then and there (like below)?
#Import('dataIn:1.0.0')
define stream dataIn (id int);
#Export('totalCountStream:1.0.0')
define stream totalCountStream (eventCount long);
#Export('perIdCountStream:1.0.0')
define stream perIdCountStream (id int, eventCount long);
partition with (id of dataIn)
begin
from dataIn#window.time(5 sec)
select id, count() as eventCount
insert into perIdCountStream;
end;
from dataIn#window.time(5 sec)
select count() as eventCount
insert into totalCountStream;
ps: if you really need the event tables, you can always persist totalCountStream and perIdCountStream in two separate tables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stuck with kinesis stagger window - amazon-web-services

Related

Redshift deadlock on COPY command

Why does left join in redshift not working?

Flink table column name collision between metadata and physical columns

Kinesis Analytics Session or Stagger Window Batching Without Aggregation

wso2 cep - SiddhiQL - sum different rows from the same event table

Categories

Resources