alpakka cassandrasource read data from cassandra continuously - alpakka

We are doing some POC to read cassandra table continuosly using Alpakka CassandraSource. Following is the sample code:
final Statement stmt = new SimpleStatement("SELECT * FROM testdb.emp1").setFetchSize(20);
final CompletionStage<List<Row>> rows = CassandraSource.create(stmt, session).runWith(Sink.seq(), materializer);
rows.thenAcceptAsync( e -> e.forEach(System.out::println));
The above code fetches the rows from emp1 table. Since this table grows continuosly we need to keep reading as soon as data available. Is there any way we can set continuous read in CassandraSource?

There is currently no support for continuously reading a table in Alpakka Cassandra connector. However you can make it work by wrapping CassandraSource.create in a RestartSource.withBackoff that will restart the cassandra source after it completes. More about restarting sources in the documentation.

Related

Why my snowflake streams data is not getting flushed

I am trying to read the snowflake stream data using aws lambda (snowflake connector library) and writing the data into RDS SQL server. After the lambda run, my stream data is not getting deleted.
I don't want to read the data from stream and insert it into temporary snowflake table and again read to insert the data in the SQL server. Is there any better way to do this?
Lambda code:
for table in table_list:
sql5 = f"""SELECT "header__stream_position","header__timestamp" FROM STREAM_{table} where "header__operation" in ('UPDATE' ,'INSERT' ,'DELETE') ;"""
result =cs.execute(sql5).fetchall()
rds_columns = [(c[0],c[1],table[:-4]) for c in result]
if rds_columns:
cursor.fast_executemany = True
sql6 = f"INSERT INTO {RDS_TABLE}(LSNNUMBER,TRANSACTIONTIME,TABLENAME) VALUES (?, ?, ?);"
data = (rds_columns)
cursor.executemany(sql6,data)
table_write.append(table)
conn.commit()
ctx.commit()
Snowflake Streams requires a successful committed DML operation to advance the Stream so you can't avoid an intermediate Snowflake table (transient or otherwise) with Streams.
You could use Changes to get the same change information if you can manage the time/query offset within your application code.
The offset on a Stream will only advance if it is consumed by a DML statement. (INSERT,UPDATE,MERGE). There is a read-only version of streams called CHANGES. However, you must keep track of the offsets yourself.
https://docs.snowflake.com/en/sql-reference/constructs/changes.html

Elasticsearch snapshot taking forever

My Elasticsearch has indices like index_name-YYYYMM. Data is continuously written to Elasticsearch and it’s in the order of 1TB per hour.
indexA-202102
indexB-202102
indexC-202102
.
.
.
I’m trying to take a snapshot everyday using python client. If I specify single index, snapshot completes in few seconds. But if I specify multiple indices, it’s taking forever as new data is being added continuously.
Is there a way we can solve this ?
def snapshot(self, repository, indices, snapshot_name):
snap_settings = {'indices': indices, 'ignore_unavailable': True,
'include_global_state': True}
return self.es_client.snapshot.create(repository=repository,
snapshot=snapshot_name,
body=snap_settings)

Single source multiple sinks v/s flatmap

I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes:
I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?
Would using a flatmap operator and then having a single sink be better?
When you had a very simple pipeline with a single source and a single sink, something like this:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).
I don't see how using a flatmap would make any difference.
You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.

How to use Airflow to process batch new data?

we want to use Airflow to process batch new data, first, our dag run a command to check our CRM system if there are new data every 15 minutes and then porcess the new data to two other systems, so it's like:
task1 (check if there are new data) > task 2 (send new data to system1) > task 3 (send new data to system2)
The problem is
the numbers of new data are dynamic, we don't know how many data we
might get.
how to porcess the new data one by one?
I am not sure what is the problem you face. Please be more specific.
The best bet is to create a custom operator(if there is no default one).
Task1(Extract new Data write to a location[Export as ndjson or other formats])>
Task2(Checks if there are any data(if the location is dynamic pass it through xcom))>
Task3(same as task 2(location may be passed as xcom))
Each run triggered every 15 min should fetch new data and push

Inserting rows on BigQuery: InsertAllRequest Vs BigQueryIO.writeTableRows()

When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.