I am publishing a message to pub/sub using the gcloud command from the cloud shell like so:
gcloud pubsub topics publish <<some_topic>> \
--message={"ride_id":"3bdc2294-86a5-4f45-bb28-885d3a4c2ada","point_idx":1185,"latitude":40.76384,"longitude":-73.89548,"timestamp":"2022-02-10T02:24:06.11629-05:00","meter_reading":27.502796,"meter_increment":0.02320911,"ride_status":"enroute","passenger_count":3}
Now when I pull the messages from the consumer process using a subscription to the topic I get a base64 encoded string for the pub/sub message(normal BAU). But after I decode the message to UTF-8 it comes out as, passenger_count:3
Which is only a truncated version of the entire message. Any explanation in this regard/behavior of Pub/Sub would be very helpful. As well as a possible fix/workaround for this problem.
I am consuming the message with a Cloud Function having Pub/Sub trigger. The code looks something like below:
import base64
def subscribe_topic(event, context):
# some code
message = base64.b64decode(event['data']).decode('utf-8')
print(message)
# some more code
The subcribe_topic() function serves as the entrypoint to my CF. When I print the message it usually gets reflected in the CF logs where I am able to see the truncated message instead of the entire one.
More on CF with Pub/Sub triggers here
Related
I am setting up a pub/sub trigger-based cloud function in GCP, the topic the cloud function listen to is us-pubsub1. When I deployed the cloud function and used the testing panel to send messages like:
{"index":123,
"video_name":'test.mp4'}
in cloud function to processing the message with key index and video_name, it has no issue. But when I sent a real message to us-pubsub1 to trigger the cloud function, it always failed for not able to find the 'index' in the message body. and when reading the pub/sub message in cloud function, it return me messages like:
{'#type': 'type.googleapis.com/google.pubsub.v1.PubsubMessage',
'attributes': None, 'data':
'eyJzZXNzaW9uX2lkIjogImUzYjM0MTJiLWQxNWUtNDM5My05YjEyLWI3ZGY1ZGE4MTQ0NCIsICJzZXNzaW9uX25hbWUiOiAiU0VTU0lPTl9DMjIjMwMTI1VDIwMTI1NCIsICJzaXRlX25hbWUiOiAiVkEgTUVESUNBTCBDRU5URVIgLSBQQUxPIEFMVE8gLSc3RlY3RvbXkiLCAiaHViX3NlcmlhbF9udW1iZXIiOiAiQzIyNC0wMDIwNSIsICJjYXN0X2FwcF92ZXJzaW9uIjogIjEyLjAuMzIuNyIsICJkdl9zeXN0ZW0iOiAiVUFUU0syMDA3IiwgImludGVybmFsX2tleSI6ICJtNjQzY2U1NS0zNjNiLQ=='}
I checked that the message arrive in us-pubsub1 correctly and just failed to process in cloud function.
Is there any thing I have missed for fetching the message body for real cloud function?
It's normal. You have a pubsub message enveloppe and your content is base64 encoded in the data field. Here the documentation details
FWIW, here the real content of your sample
{"session_id": "e3b3412b-d15e-4393-9b12-b7df5da81444", "session_name": "SESSION_C22#3#UC##SB"'6FUR#%dTD44TDU"Dstectomy", "hub_serial_number": "C224-00205", "cast_app_version": "12.0.32.7", "dv_system": "UATSK2007", "internal_key": "m643ce55-363b-
it is truncated, you might not share the whole content ;)
I have been using this example of creating a Vertex AI monitoring job. It sends an email and have adapted it to send a Pubsub message, with #Jose Gutierrez Paliza's help.
I have got this working, sort of. But what seems to be happening is that Pubsub pushes the log to a function which errors.
My log sink includes:
When I look at logs I see an INFO entry:
my-fn an_id Event data: {"insertId":"another_id...
followed by a separate ERROR entry:
...
ValueError: The pipeline parameter insertId is not found in the pipeline job input definitions.
So I assume Pubsub is sending the log to the function which gets extraneous crap, including insertId.
I can run the pipeline fine via Jupyter:
from google.cloud import pubsub
publish_client = pubsub.PublisherClient()
topic = f'projects/{PROJECT}/topics/{PUBSUB_TOPIC}'
data = {}
message = json.dumps(data)
_ = publish_client.publish(topic, message.encode())
So how do I the equivalent via Pubsub?
It was the log being used (as data) in the following:
parameter_values = json.dumps(data).encode()
I set:
parameter_values={"project": project, "display_name": "some_name"}
I have a cloud function which publishes a message to PubSub and that triggers a cloud run to perform an archive file process. When there are large files, my cloud run python code takes some time to process the data it looks like PubSub is retrying the message after 20 seconds (default acknowledge deadline time) which is triggering another instance from my Cloud Run. I've increased the acknowledge deadline to 600s and redeployed everything but it's still retrying the message after 20 seconds. I am missing anything?
Cloud Function publishing the message code:
# Publishes a message
try:
publish_future = publisher.publish(topic_path, data=message_bytes)
publish_future.result() # Verify the publish succeeded
return 'Message published.'
except Exception as e:
print(e)
return (e, 500)
Here is the PubSub subscription config:
Logging showing a second instance being triggered after 20s:
Cloud Run code:
#app.route("/", methods=["POST"])
def index():
envelope = request.get_json()
if not envelope:
msg = "no Pub/Sub message received"
print(f"error: {msg}")
return f"Bad Request: {msg}", 400
if not isinstance(envelope, dict) or "message" not in envelope:
msg = "invalid Pub/Sub message format"
print(f"error: {msg}")
return f"Bad Request: {msg}", 400
pubsub_message = envelope["message"]
if isinstance(pubsub_message, dict) and "data" in pubsub_message:
#Decode base64 event['data']
event_data = base64.b64decode(pubsub_message['data']).decode('utf-8')
message = json.loads(event_data)
#logic to process data/archive
return ("", 204)
You should be able to control the retries by setting the minimumBackoff retrypolicy. You can set the minimumBackoff time to the max of 600 seconds, like your ack deadline, so that redelivered messages will be more than 600 seconds old. This should lower the number of occurrences you see.
To handle duplicates, making your subscriber idempotent is recommended. You need to apply some kind of code check to see if the messageId was processed before.
You can find below in the documentation at-least-once-delivery :
Typically, Pub/Sub delivers each message once and in the order in which it was published. However, messages may sometimes be delivered out of order or more than once. In general, accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages. You can achieve exactly once processing of Pub/Sub message streams using the Apache Beam programming model. The Apache Beam I/O connectors let you interact with Cloud Dataflow via controlled sources and sinks. You can use the Apache Beam PubSubIO connector (for Java and Python) to read from Cloud Pub/Sub. You can also achieve ordered processing with Cloud Dataflow by using the standard sorting APIs of the service. Alternatively, to achieve ordering, the publisher of the topic to which you subscribe can include a sequence token in the message.
I have this simple python function where i am just taking the input from pubsub topic and then print it.
import base64,json
def hello_pubsub(event, context):
"""Triggered from a message on a Cloud Pub/Sub topic.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
pubsub_message = base64.b64decode(event['data'])
data = json.loads(pubsub_message)
for i in data:
for k,v in i.items():
print(k,v)
If i had used the pubsub_v1 library, there i could do following.
subscriber = pubsub_v1.SubscriberClient()
def callback(message):
message.ack()
subscriber.subscribe(subscription_path, callback=callback)
How do i ack the message in pubsub triggered function?
Following your latest message, I understood the (common) mistake. With Pubsub, you have
a topic, and the publishers can publish messages in it
(push or pull) subscriptions. All the messages published in the topic are duplicated in each subscription. The message queue belong to each subscription.
Now, if you look closely to your subscriptions on your topic, you will have at least 2.
The pull subscription that you have created.
A push subscription created automatically when you deployed your Cloud Function on the topic.
The messages of the push subscription is correctly processed and acknowledge. However those of pull subscription aren't, because the Cloud Function don't consume and acknowledge them; the subscription are independent.
So, your Cloud Function code is correct!
The message should be ack'd automatically if the function terminates normally without an error.
How can I bulk move messages from one topic to another in GCP Pub/Sub?
I am aware of the Dataflow templates that provide this, however unfortunately restrictions do not allow me to use Dataflow API.
Any suggestions on ad-hoc movement of messages between topics (besides one-by-one copy and pasting?)
Specifically, the use case is for moving messages in a deadletter topic back into the original topic for reprocessing.
You can't use snapshots, because snapshots can be applied only on subscriptions of the same topics (to avoid message ID overlapping).
The easiest way is to write a function that pull your subscription. Here, how I will do it:
Create a topic (named, for example, "transfer-topic") with a push subscription. Set the timeout to 10 minutes
Create a Cloud Functions HTTP triggered by PubSub push subscription (or a CLoud Run service). When you deploy it, set the timeout to 9 minutes for Cloud Function and to 10 minutes for Cloud Run. The content of the processing is the following
Read a chunk of messages (for examples 1000) from the deadletter pull subscription
Publish the messages (in bulk mode) into the initial topic
Acknowledge the messages of the dead letter subscription
Repeat this up to the pull subscription is empty
Return code 200.
The global process:
Publish a message in the transfer-topic
The message trigger the function/cloud run with a push HTTP
The process pull the messages and republish them into the initial topic
If the timeout is reached, the function crash and PubSub perform a retry of the HTTP request (according with an exponential backoff).
If all the message are processed, the HTTP 200 response code is returned and the process stopped (and the message into the transfer-topic subscription is acked)
this process allow you to process a very large amount of message without being worried about the timeout.
I suggest that you use a Python script for that.
You can use the PubSub CLI to read the messages and publish to another topic like below:
from google.cloud import pubsub
from google.cloud.pubsub import types
# Defining parameters
PROJECT = "<your_project_id>"
SUBSCRIPTION = "<your_current_subscription_name>"
NEW_TOPIC = "projects/<your_project_id>/topics/<your_new_topic_name>"
# Creating clients for publishing and subscribing. Adjust the max_messages for your purpose
subscriber = pubsub.SubscriberClient()
publisher = pubsub.PublisherClient(
batch_settings=types.BatchSettings(max_messages=500),
)
# Get your messages. Adjust the max_messages for your purpose
subscription_path = subscriber.subscription_path(PROJECT, SUBSCRIPTION)
response = subscriber.pull(subscription_path, max_messages=500)
# Publish your messages to the new topic
for msg in response.received_messages:
publisher.publish(NEW_TOPIC, msg.message.data)
# Ack the old subscription if necessary
ack_ids = [msg.ack_id for msg in response.received_messages]
subscriber.acknowledge(subscription_path, ack_ids)
Before running this code you will need to install the PubSub CLI in your Python environment. You can do that running pip install google-cloud-pubsub
An approach to execute your code is using Cloud Functions. If you decide to use it, pay attention in two points:
The maximum time that you function can take to run is 9 minutes. If this timeout get exceeded, your function will terminate without finishing the job.
In Cloud Functions you can just put google-cloud-pubsub in a new line of your requirements file instead of running a pip command.