Why my snowflake streams data is not getting flushed - amazon-web-services

I am trying to read the snowflake stream data using aws lambda (snowflake connector library) and writing the data into RDS SQL server. After the lambda run, my stream data is not getting deleted.
I don't want to read the data from stream and insert it into temporary snowflake table and again read to insert the data in the SQL server. Is there any better way to do this?
Lambda code:
for table in table_list:
sql5 = f"""SELECT "header__stream_position","header__timestamp" FROM STREAM_{table} where "header__operation" in ('UPDATE' ,'INSERT' ,'DELETE') ;"""
result =cs.execute(sql5).fetchall()
rds_columns = [(c[0],c[1],table[:-4]) for c in result]
if rds_columns:
cursor.fast_executemany = True
sql6 = f"INSERT INTO {RDS_TABLE}(LSNNUMBER,TRANSACTIONTIME,TABLENAME) VALUES (?, ?, ?);"
data = (rds_columns)
cursor.executemany(sql6,data)
table_write.append(table)
conn.commit()
ctx.commit()

Snowflake Streams requires a successful committed DML operation to advance the Stream so you can't avoid an intermediate Snowflake table (transient or otherwise) with Streams.
You could use Changes to get the same change information if you can manage the time/query offset within your application code.

The offset on a Stream will only advance if it is consumed by a DML statement. (INSERT,UPDATE,MERGE). There is a read-only version of streams called CHANGES. However, you must keep track of the offsets yourself.
https://docs.snowflake.com/en/sql-reference/constructs/changes.html

Related

Single source multiple sinks v/s flatmap

I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes:
I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?
Would using a flatmap operator and then having a single sink be better?
When you had a very simple pipeline with a single source and a single sink, something like this:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).
I don't see how using a flatmap would make any difference.
You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.

NextShardIterator never returns null when reading data from kinesis stream

I am trying to read records from kinesis stream after a particular timestamp in a lambda function. I get the shards, shard iterators and then the data.
When I get the first iterator, I get the data and keep calling the same function recursively using NextShardIterator (present in the data returned). According to the documentation, the NextShardIterator will return null when there is no more data to read and it has reached $latest.
But it never returns null, and the function keeps getting invoked and eventually I get Provisioned Throughput Exceeded Exception.
I also tried using MillisBehindLatest to stop reading when the value is zero, but it also fails in some cases.
Is there a correct way to get the data from kinesis based on timestamp?
NextShardIterator will only return null when it reaches the end of a closed shard ( in cases when the shard count is updated using UpdateShardCount, SplitShard or MergeShard)
https://docs.amazonaws.cn/en_us/kinesis/latest/APIReference/API_GetRecords.html#API_GetRecords_ResponseSyntax
"NextShardIterator
The next position in the shard from which to start sequentially reading data records. If set to null, the shard has been closed and the requested iterator does not return any more data."
If you want to start reading the stream from a specified timestamp, the best way to do this would be to use event source mapping with lambda and specifying the StartingPosition as TIMESTAMP in lambda.
https://docs.aws.amazon.com/lambda/latest/dg/API_CreateEventSourceMapping.html#SSS-CreateEventSourceMapping-request-StartingPosition

alpakka cassandrasource read data from cassandra continuously

We are doing some POC to read cassandra table continuosly using Alpakka CassandraSource. Following is the sample code:
final Statement stmt = new SimpleStatement("SELECT * FROM testdb.emp1").setFetchSize(20);
final CompletionStage<List<Row>> rows = CassandraSource.create(stmt, session).runWith(Sink.seq(), materializer);
rows.thenAcceptAsync( e -> e.forEach(System.out::println));
The above code fetches the rows from emp1 table. Since this table grows continuosly we need to keep reading as soon as data available. Is there any way we can set continuous read in CassandraSource?
There is currently no support for continuously reading a table in Alpakka Cassandra connector. However you can make it work by wrapping CassandraSource.create in a RestartSource.withBackoff that will restart the cassandra source after it completes. More about restarting sources in the documentation.

Explicitly lock and unlock a table using ODBC

I have to perform some calculations with data stored in an MSSQL Server database and then save the results in the same database.
I need to load (part of) a table into C++ data structures, perform a calculation (that can take substantial time), and finally add some rows to the same table.
The problem is that several users can access the database concurrently, and I want the table to be locked since the data is loaded in memory until the results of the calculation are written to the table.
Using the ODBC SDK, is it possible to explicitly lock and unlock part of a table?
I have tried the following test program, but unfortunately the INSERT statement succeeds before StmtHandle1 is freed:
SQLDriverConnect(ConHandle1, NULL, (SQLCHAR *)"DRIVER={ODBC Driver 13 for SQL Server};"
"SERVER=MyServer;"
"DATABASE=MyDatabase;"/*, ... */);
SQLSetStmtAttr(StmtHandle1,SQL_ATTR_CONCURRENCY,(SQLPOINTER)SQL_CONCUR_LOCK,SQL_IS_INTEGER);
SQLExecDirect(StmtHandle1, (SQLCHAR *)"SELECT * FROM [MyTable] WITH (TABLOCKX, HOLDLOCK)", SQL_NTS);
SQLDriverConnect(ConHandle2, NULL, (SQLCHAR *)"DRIVER={ODBC Driver 13 for SQL Server};"
"SERVER=MyServer;"
"DATABASE=MyDatabase;"/*, ... */);
SQLSetStmtAttr(StmtHandle2,SQL_ATTR_CONCURRENCY,(SQLPOINTER)SQL_CONCUR_LOCK,SQL_IS_INTEGER);
SQLExecDirect(StmtHandle2, (SQLCHAR *)"INSERT INTO [MyTable] VALUES (...)", SQL_NTS);
unfortunately the INSERT statement succeeds before StmtHandle1 is
freed
By default SQL Server opereates in autocommit mode, i.e. opens a tarnsaction and commits it for you.
You requested TABLOCKX and the table was locked for the duration of your transaction, but what you want instead is to explicitely open a transaction and don't commit/rollback it until you'll done with your calculations, i.e. you should use
begin tran; SELECT top 1 * FROM [MyTable] WITH (TABLOCKX, HOLDLOCK);
And you don't need to read the whole table, top 1 * is sufficient.

Send metadata along with Akka stream

Here is my previous question: Send data from InputStream over Akka/Spring stream
I have managed to send compressed and encrypted file over Akka stream. Now, I am looking for way to transport metadata along with data, mainly filename and hash (checksum).
My current idea is to use Flow.prepend function and insert metadata before data this way:
filename, that can vary in size but always ends with null byte
fixed size hash (checksum)
data
Then, on receiving end I would have to use Flow.takeWhile twice - once to read filename and second time to read hash, and then just read data. It doesn't really look like elegant solution plus if in future I would like to add more metadata it will become even worse.
I have noticed method Flow.named, however documentation says just:
Add a ``name`` attribute to this Flow.
and I do not know how to use this (and if is it possible to transport filename over it).
Question is: is there better idea to transport metadata along with data over Akka stream than above?
EDIT: Attaching my drawing with idea.
I think prepending the metadata makes sense. A simple approach could be to prepend the metadata using the same framing you use to send the data.
The receiving end will need to know how many metadata blocks are there, and use this information to split it. See example below.
// client end
filenameSrc
.concat(hashSrc)
.concat(dataSrc)
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue, allowTruncation = true))
.via(Tcp().outgoingConnection(???, ???))
.runForeach{ ??? }
// server end
val printMetadata =
Flow.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val metadataSink = Sink.foreach(println)
val bcast = builder.add(Broadcast[ByteString](2))
bcast.out(0).take(2) ~> metadataSink
FlowShape(bcast.in, bcast.out(1).drop(2).outlet)
})
val handler =
Framing.delimiter(ByteString("\n"), Int.MaxValue)
.via(printMetadata)
.via(???)
This is only one of the many possible approaches to solve this. But whatever solution you choose, the receiver will need to have knowledge of how to extract the metadata from the raw stream of bytes it reads over TCP.