How do you ensure it does work with google cloud pub/sub? - google-cloud-platform

I am currently working on a distributed crawling service. When making this, I have a few issues that need to be addressed.
First, let's explain how the crawler works and the problems that need to be solved.
The crawler needs to save all posts on each and every bulletin board on a particular site.
To do this, it automatically discovers crawling targets and publishes several messages to pub/sub. The message is:
{
"boardName": "test",
"targetDate": "2020-01-05"
}
When the corresponding message is issued, the cloud run function is triggered, and the data corresponding to the given json is crawled.
However, if the same duplicate message is published, duplicate data occurs because the same data is crawled. How can I ignore the rest when the same message comes in?
Also, are there pub/sub or other good features I can refer to for a stable implementation of a distributed crawler?

because PubSub is, by default, designed to deliver AT LEAST one time the messages, it's better to have idempotent processing. (Exact one delivery is coming)
Anyway, your issue is very similar: twice the same message or 2 different messages with the same content will cause the same issue. There is no magic feature in PubSub for that. You need an external tool, like a database, to store the already received information.
Firestore/datastore is a good and serverless place for that. If you need low latency, Memory store and it's in memory database is the fastest.

Related

Cloud Storage notifications delivery guarantees

I am using cloud storage notifications with pub/sub in my streaming pipeline
I read documentation about delivery semantic of cloud notifications and it says that it supports at least once delivery semantic and it doesn't guarantee delivery events in the same order as objects was uploaded (as I understand it means that I can get several events with the same generations. Am I right?).
Notifications are not guaranteed to be published in the order Pub/Sub receives them.
Pub/Sub also offers at-least-once delivery to the recipient, which means that you could receive multiple messages, with multiple IDs, that represent the same Cloud Storage event.
I wrote stateful DoFn in Apache Beam with keeping state of the latest largest processed generation to be able to find out of order received generations or duplicated. I tested it via uploading objects to cloud storage one at three seconds, but I din't catch any duplicated events or out of order generations.
My question is which data volume or data velocity should be to be able to catch duplicated events or out of order generations?
Personally I would not try the exercise you are asking for.
Reason is that you may never catch such events during your tests, btw those events may happen in production. And, the other way around.. you may see them in test and they may never occur in prod.
That's how it's designed, those duplicates may be very rare, depending on pub/sub running status, usage, network traffic etc.
You just need to accomodate that behavior, by making your event handler's logic idempotent.
Also, have a look to pub/sub release news.. they have recently introduced "exactly-one-delivery" feature (maybe still in beta).

AWS Lambda best practices for Real Time Tracking

We currently run an AWS Lambda function that primarily simply redirects the user to a different URL. The function is invoked via API-Gateway.
For tracking purposes, we would like to create a widget on our dashboard that provides real-time insights into how many redirects are performed each second. The creation of the widget itself is not the problem.
My main question currently is which AWS Services is best suited for telling our other services that an invocation took place. We plan to register the invocation in our database.
Some additional things:
low latency (< 5 seconds) in order to be real-time data
nearly no increased time wait for the user. We aim to redirect the user as fast as possible
Many thanks in advance!
Best Regards
Martin
I understand that your goal is to simply persist the information that an invocation happened somewhere with minimal impact on the response time of the Lambda.
For that purpose I'd probably use an SQS standard queue and just send a message to the queue that the invocation happened.
You can then have an asynchronous process (Lambda, Docker, EC2) process the messages from the queue and update your Dashboard.
Depending on the scalability requirements looking into Kinesis Data Analytics might also be worth it.
It's a fully managed streaming data solution and the analytics part allows you to do sliding window analyses using SQL on data in the Stream.
In that case you'd write the info that something happened to the stream, which also has a low latency.

Concurrency Issues between Google Cloud Functions and Sheets v4 API

I have 2 problems related to managing concurrency between Google Cloud Functions.
The setup is I have a slackbot enabling use of a "checkoff" slash command. This slash command sends another Slack user yes/no buttons whether to authorize the checkoff. When the user clicks an option, it sends that response to a Google Cloud Function which 1) Sends a response back to Slack to close the buttons and 2) Records the checkoff if authorized in a Google Sheet using the Sheets v4 API (spreadsheets.values.append)
Issue #1: Users who spam the yes/no buttons trigger multiple Slack requests to the Cloud Function before the Function can acknowledge and close the buttons. This leads to multiple Cloud Functions spawning and multiple checkoffs being recorded in the sheet. If I could maintain state, I could save unique information from the request and check to make sure that request had not been already serviced. Is there a pattern to do this with Cloud Functions?
Issue #2: Sometimes multiple checkoffs are authorized at similar times by independent users. These requests spawn independent Cloud Function instances which attempt to append to the Sheet. There is a rare case where another Function writes in between the first Function's read then write causing an overwrite. I would use a read-write lock to deal with this but there's no way to share concurrency resources between Cloud Functions I'm aware of.
(Less important) Issue #3: I'd really love to batch the spreadsheet writes but it seems against the grain of serverless computing in the 1st place. Is there a way to do this?
Any help is appreciated.
I had a similar issues with Cloud Functions and Firestore. In my case I was receiving notifications about new and updated data in the form of 'order/123', I was then creating a copy of the order in Firestore, the problem was that sometimes multiple notifications arrived at the same time resulting in duplicated orders because of race conditions.
My solution to the problem was to use Google Cloud Tasks, https://console.cloud.google.com/cloudtasks, I have a cloud function that receives the notification, that adds a message to the queue to be processed with concurrency of 1, then other cloud function takes care of the processing.
Receive notification -> Post message to queue (concurrency 1) -> Process message
In this case I have 1 queue per customer, I am sure there a better ways but for now this is good enough. You can later on route customers to the same queues but always having the same customer on the same queue.

What happens to the data when uploading it to gcp bigquery when there is no internet?

I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.

Microservice for notifications

Most of us are familiar with the notifications that pop-up on top right on FB page. Then we ack them, we can still see them with latest first ordering etc.
I have a question on designing such a thing(trying to do some reading also on the side on the arch for this). So at a high level let us say there are microservices running into the system
S1
S2
S3
So let us say I have services like that. I am thinking of creating a new Service say S4 that can receive messages from these services. I am less worried about how these services will talk to S4. There are ways like SQS, kafka, etc.
My main Q is how can S4 do the following
Maintain these notifications in like what? PSQL?
Have a column with timestamp?
Do I need a timeseries DB for it?
How can I store them based on severity? Fatal, Info, critical?
How can someone ack a notification?
Below are some thoughts, but I don't think there is a one size fit all approach. It really depends on the rest of your architecture, domain requirements, scalability requirements, etc.
Maintain these notifications in like what? PSQL?
Do you really need to store the notifications additionally to your event queues? It may depend on if your queue design does allow for a just-in-time access based on topics and message content. Many queue systems would allow you to store the message indefinitely/until processed (check out Apache Pulsar). In this case there is a backlog of unacknowledged messages that can be accessed and processed whenever ready (e.g. client confirms message was read). After acknowledgement you can archive or delete the events.
If you decide that doesn't work and you need an additional database, the usual suspects come into play: Key-Value-Stores, such as MongoDB, or Relational DB, such as CockroachDB.
Have a column with timestamp?
Most queue systems would have a time stamp recorded by default. Sometimes the queue order could be enough. Depending on your query requirements a schema-less format (e.g. json or blob) for the message content and some explicit identifiers for finding the messages (by user etc) would be required as a minimum in most cases.
Do I need a timeseries DB for it?
Possibly useful if you need to track a large number of values changing over time, e.g. IOT related sensor data such as a temperature measurement. If you just have a handful of facebook-style notifications per user that seems not like a good fit.
How can I store them based on severity? Fatal, Info, critical?
Here you could utilize separation by topics in event queues or when using a separate DB it depends again if you use a schema or schema-less DB.
How can someone ack a notification?
As previously mentioned, modern event queue systems have that built in, or if using a separate DB you'll have to add that to the schema (or use a schema-less DB). In either case you need to define or implement the behavior for message retention after acknowledgment, e.g. whether to delete or archive.