How to use Airflow to process batch new data? - airflow-scheduler

we want to use Airflow to process batch new data, first, our dag run a command to check our CRM system if there are new data every 15 minutes and then porcess the new data to two other systems, so it's like:
task1 (check if there are new data) > task 2 (send new data to system1) > task 3 (send new data to system2)
The problem is
the numbers of new data are dynamic, we don't know how many data we
might get.
how to porcess the new data one by one?

I am not sure what is the problem you face. Please be more specific.
The best bet is to create a custom operator(if there is no default one).
Task1(Extract new Data write to a location[Export as ndjson or other formats])>
Task2(Checks if there are any data(if the location is dynamic pass it through xcom))>
Task3(same as task 2(location may be passed as xcom))
Each run triggered every 15 min should fetch new data and push

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Elasticsearch snapshot taking forever

My Elasticsearch has indices like index_name-YYYYMM. Data is continuously written to Elasticsearch and it’s in the order of 1TB per hour.
indexA-202102
indexB-202102
indexC-202102
.
.
.
I’m trying to take a snapshot everyday using python client. If I specify single index, snapshot completes in few seconds. But if I specify multiple indices, it’s taking forever as new data is being added continuously.
Is there a way we can solve this ?
def snapshot(self, repository, indices, snapshot_name):
snap_settings = {'indices': indices, 'ignore_unavailable': True,
'include_global_state': True}
return self.es_client.snapshot.create(repository=repository,
snapshot=snapshot_name,
body=snap_settings)

alpakka cassandrasource read data from cassandra continuously

We are doing some POC to read cassandra table continuosly using Alpakka CassandraSource. Following is the sample code:
final Statement stmt = new SimpleStatement("SELECT * FROM testdb.emp1").setFetchSize(20);
final CompletionStage<List<Row>> rows = CassandraSource.create(stmt, session).runWith(Sink.seq(), materializer);
rows.thenAcceptAsync( e -> e.forEach(System.out::println));
The above code fetches the rows from emp1 table. Since this table grows continuosly we need to keep reading as soon as data available. Is there any way we can set continuous read in CassandraSource?
There is currently no support for continuously reading a table in Alpakka Cassandra connector. However you can make it work by wrapping CassandraSource.create in a RestartSource.withBackoff that will restart the cassandra source after it completes. More about restarting sources in the documentation.

Inserting rows on BigQuery: InsertAllRequest Vs BigQueryIO.writeTableRows()

When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.

Redis SortedSet: Does the ZUNIONSTORE command block other concurrency commands?

I want to create a temporary sorted-set based on the origin one in a timer, maybe the interval is 4 hour, I'm using spring-data-redis api to do this.
ZUNIONSTORE tmp 2 A B AGGREGATE MAX
when the ZUNIONSTORE commmand is executing, will it block any other commands like
ZADD,ZREM,ZRANGE,ZINCRBY based on the SortedSet A or B ? I don't know if this will cause
concurrency problems, please give me some suggestions.