How do I write a Cloud Function to receive, parse, and publish PubSub messages? - google-cloud-platform

This can be considered a follow-up to this thread, but I need more help with moving things along. Hopefully someone can have a look over my attempts below and provide further guidance.
To summarize, I need a cloud function that
Is triggered by a PubSub message being published in topic A (this can be done in UI).
reads a messy object change notification message in "push" PubSub topic A.
"parse" it
publish a message in PubSub topic B, with the original message ID as data, and other metadata (e.g. file name, size, time) as attributes.
. 1:
Example of a messy object change notification:
\n "kind": "storage#object",\n "id": "bucketcfpubsub/test.txt/1544681756538155",\n "selfLink": "https://www.googleapis.com/storage/v1/b/bucketcfpubsub/o/test.txt",\n "name": "test.txt",\n "bucket": "bucketcfpubsub",\n "generation": "1544681756538155",\n "metageneration": "1",\n "contentType": "text/plain",\n "timeCreated": "2018-12-13T06:15:56.537Z",\n "updated": "2018-12-13T06:15:56.537Z",\n "storageClass": "STANDARD",\n "timeStorageClassUpdated": "2018-12-13T06:15:56.537Z",\n "size": "1938",\n "md5Hash": "sDSXIvkR/PBg4mHyIUIvww==",\n "mediaLink": "https://www.googleapis.com/download/storage/v1/b/bucketcfpubsub/o/test.txt?generation=1544681756538155&alt=media",\n "crc32c": "UDhyzw==",\n "etag": "CKvqjvuTnN8CEAE="\n}\n
To clarify, is this a message with blank "data" field, and all the information above are in attribute pairs (like "attribute name": "attribute data")? Or is it just a long string stuffed into the "data" field, with no "attributes"?
. 2:
In the above thread, a "pull" subscription is used. Is it better than using a "push" subscription? Push sample below:
def create_push_subscription(project_id,
topic_name,
subscription_name,
endpoint):
"""Create a new push subscription on the given topic."""
# [START pubsub_create_push_subscription]
from google.cloud import pubsub_v1
# TODO project_id = "Your Google Cloud Project ID"
# TODO topic_name = "Your Pub/Sub topic name"
# TODO subscription_name = "Your Pub/Sub subscription name"
# TODO endpoint = "https://my-test-project.appspot.com/push"
subscriber = pubsub_v1.SubscriberClient()
topic_path = subscriber.topic_path(project_id, topic_name)
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
push_config = pubsub_v1.types.PushConfig(
push_endpoint=endpoint)
subscription = subscriber.create_subscription(
subscription_path, topic_path, push_config)
print('Push subscription created: {}'.format(subscription))
print('Endpoint for subscription is: {}'.format(endpoint))
# [END pubsub_create_push_subscription]
Or do I need further code after this to receive messages?
Also, doesn't this create a new subscriber every time the Cloud Function is triggered by a pubsub message being published? Should I add a subscription delete code at the end of the CF, or are there more efficient ways to do this?
. 3:
Next, to parse the code, this sample code doing a few attributes as follows:
def summarize(message):
# [START parse_message]
data = message.data
attributes = message.attributes
event_type = attributes['eventType']
bucket_id = attributes['bucketId']
object_id = attributes['objectId']
Will this work with my above notification in 1:?
. 4:
How do I separate the topic_name? Steps 1 and 2 use topic A, while this step is to publish into topic B. Is is as simple as re-writing the topic_name in the below code example?
# TODO topic_name = "Your Pub/Sub topic name"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
for n in range(1, 10):
data = u'Message number {}'.format(n)
# Data must be a bytestring
data = data.encode('utf-8')
# Add two attributes, origin and username, to the message
publisher.publish(
topic_path, data, origin='python-sample', username='gcp')
print('Published messages with custom attributes.')
Source where I got most of the sample code from (besides the above thread):python-docs-samples. Will adapting and stringing the above code samples together produce useful code? Or will I still be missing stuff like "import ****"?

You should not attempt to manually create a Subscriber running in Cloud Functions. Instead, follow the documentation here for setting up a Cloud Function which will be called with all messages sent to a given topic by passing the --trigger-topic command line parameter.
To address some of your other concerns:
“Should I add a subscription delete code at the end of the CF”- Subscriptions are long-lived resources corresponding to a specific backlog of messages. If the subscription is created and deleted at the end of the cloud function, messages sent when it does not exist will not be received.
“How do I separate the topic_name”- The ‘topic_name’ in this example refers to the last part of the string formatted like this projects/project_id/topics/topic_name that will appear on this page in the cloud console for your topic after it has been created.

Related

Can I create Slack subscriptions to an AWS SNS topic?

I'm trying to create a SNS topic in AWS and subscribe a lambda function to it that will send notifications to Slack apps/users.
I did read this article -
https://aws.amazon.com/premiumsupport/knowledge-center/sns-lambda-webhooks-chime-slack-teams/
that describes how to do it using this lambda code:
#!/usr/bin/python3.6
import urllib3
import json
http = urllib3.PoolManager()
def lambda_handler(event, context):
url = "https://hooks.slack.com/services/xxxxxxx"
msg = {
"channel": "#CHANNEL_NAME",
"username": "WEBHOOK_USERNAME",
"text": event['Records'][0]['Sns']['Message'],
"icon_emoji": ""
}
encoded_msg = json.dumps(msg).encode('utf-8')
resp = http.request('POST',url, body=encoded_msg)
print({
"message": event['Records'][0]['Sns']['Message'],
"status_code": resp.status,
"response": resp.data
})
but the problem is, that in that implementation I have to create a lambda function for every user.
I want to subscribe multiple Slack apps/users to one SNS topic.
Is there a way of doing that without creating a lambda function for each one?
You really DON'T need Lambda. Just SNS and SLACK are enough.
I found a way to integrate AWS SNS with slack WITHOUT AWS Lambda or AWS chatbot. With this approach you can confirm the subscription easily.
Follow the video which show all the step clearly.
https://www.youtube.com/watch?v=CszzQcPAqNM
Steps to follow:
Create slack channel or use existing channel
Create a work flow with selecting Webhook
Create a variable name as "SubscribeURL". The name
is very important
Add the above variable in the message body of the
workflow Publish the workflow and get the url
Add the above Url as subscription of the SNS You will see the subscription URL in the
slack channel
Follow the URl and complete the subscription
Come back to the work flow and change the "SubscribeURL" variable to "Message"
The publish the
message in SNS. you will see the message in the slack channel.
Hi i would say you should go for a for loop and make a list of all the users. Either manually state them in the lambda or get them with api call from slack e.g. this one here: https://api.slack.com/methods/users.list
#!/usr/bin/python3.6
import urllib3
import json
http = urllib3.PoolManager()
def lambda_handler(event, context):
userlist = ["name1", "name2"]
for user in userlist:
url = "https://hooks.slack.com/services/xxxxxxx"
msg = {
"channel": "#" + user, # not sure if the hash has to be here
"username": "WEBHOOK_USERNAME",
"text": event['Records'][0]['Sns']['Message'],
"icon_emoji": ""
}
encoded_msg = json.dumps(msg).encode('utf-8')
resp = http.request('POST',url, body=encoded_msg)
print({
"message": event['Records'][0]['Sns']['Message'],
"status_code": resp.status,
"response": resp.data
})
Another solution you can do is set up email for the slack users, see link:
https://slack.com/help/articles/206819278-Send-emails-to-Slack
When you can just add the emails as subscribers to the sns topic. You can fileter the msg that the receiver gets with Subscription filter policy.

Google PubSub does not trigger Function with each new message

I have a Google PubSub stream that has a set of addresses added to it each day. I want each of these addresses processed by a triggered Google Cloud Function. However, what I have seen is that each address is only processed once even though there is a new message added to the stream each day.
My question is, if the same value is added to a stream each day will it be processed as a new message? Or will it be treated as a duplicate message?
This is the scenario I am seeing. Each day the locations Cloud Function is triggered and publishes each location to the locations topic. Most of the time these are the same messages as the previous day. The only time they change is if a location closes or there is a new one added. However, what I see is that many of the locations messages are never picked up by the location_metrics Cloud Function.
The flow of the functions is like this:
Topic is triggered each day at 2a (called locations_trigger) > triggers locations Cloud Function > sends to locations topic > triggers location_metrics Cloud Function > sends to location_metrics topic
For the locations Cloud Function, it is triggered and returns all addresses correctly then sends them to the locations topic. I won't put the whole function here because there are no problems with it. For each location it retrieves there is a "publish successful" message in the log. Here is the portion that sends the location details to the topic.
project_id = "project_id"
topic_name = "locations"
topic_id = "projects/project_id/topics/locations"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
try:
publisher.publish(topic_path, data=location_details.encode('utf-8'))
print("publish successful: ", location)
except Exception as exc:
print(exc)
An example location payload that is sent is:
{"id": "accounts/123456/locations/123456", "name": "Business 123 Main St Somewhere NM 10010"}
The location_metrics function looks like:
def get_metrics(loc):
request_body = {
"locationNames": [ loc['id'] ],
"basicRequest" : {
"metricRequests": [
{
"metric": 'ALL',
"options": ['AGGREGATED_DAILY']
}
],
"timeRange": {
"startTime": start_time_range,
"endTime": end_time_range,
},
}
}
request_url = <request url>
report_insights_response = http.request(request_url, "POST", body=json.dumps(request_body))
report_insights_response = report_insights_response[1]
report_insights_response = report_insights_response.decode().replace('\\n','')
report_insights_json = json.loads(report_insights_response)
<bunch of logic to parse the right metrics, am not including because this runs in a separate manual script without issue>
my_data = json.dumps(my_data)
project_id = "project_id"
topic_name = "location-metrics"
topic_id = "projects/project_id/topics/location-metrics"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
print("publisher: ", publisher)
print("topic_path: ", topic_path)
try:
publisher.publish(topic_path, data=gmb_data.encode('utf-8'))
print("publish successful: ", loc['name'])
except Exception as exc:
print("topic publish failed: ", exc)
def retrieve_location(event, context):
auth_flow()
message_obj = event.data
message_dcde = message_obj.decode('utf-8')
message_json = json.loads(message_dcde)
get_metrics(message_json)

subFolder is empty when using a Google IoT Core gateway and Pub/Sub

I have a device publishing through a gateway on the events topic (/devices/<dev_id>/events/motion) to PubSub. It's landing in PubSub correctly but subFolder is just an empty string.
On the gateway I'm publishing using the code below. f"mb.{device_id}" is the device_id (not the gateway ID and attribute could be anything - motion, temperature, etc
def report(self, device_id, attribute, value):
topic = f"/devices/mb.{device_id}/events/{attribute}"
timestamp = datetime.utcnow().timestamp()
client.publish(topic, json.dumps({"v": value, "ts": timestamp}))
And this is the cloud function listening on the PubSub queue.
def iot_to_bigtable(event, context):
payload = json.loads(base64.b64decode(event["data"]).decode("utf-8"))
timestamp = payload.get("ts")
value = payload.get("v")
if not timestamp or value is None:
raise BadDataException()
attributes = event.get("attributes", {})
device_id = attributes.get("deviceId")
registry_id = attributes.get("deviceRegistryId")
attribute = attributes.get("subFolder")
if not device_id or not registry_id or not attribute:
raise BadDataException()
A sample of the event in Pub/Sub:
{
#type: 'type.googleapis.com/google.pubsub.v1.PubsubMessage',
attributes: {
deviceId: 'mb.26727bab-0f37-4453-82a4-75d93cb3f374',
deviceNumId: '2859313639674234',
deviceRegistryId: 'mb-staging',
deviceRegistryLocation: 'europe-west1',
gatewayId: 'mb.42e29cd5-08ad-40cf-9c1e-a1974144d39a',
projectId: 'mb-staging',
subFolder: ''
},
data: 'eyJ2IjogImxvdyIsICJ0cyI6IDE1OTA3NjgzNjcuMTMyNDQ4fQ=='
}
Why is subFolder empty? Based on the docs I was expecting it to be the attribute (i.e. motion or temperature)
This issue has nothing to do with Cloud IoT Core. It is instead caused by how Pub/Sub handles failed messages. It was retrying messages from ~12 hours ago that had failed (and didn't have an attribute).
You fix this by purging the Subscription in Pub/Sub.

How to read and parse data from PubSub topic into a beam pipeline and print it

I have a program which creates a topic in pubSub and also publishes messages to the topic. I also have an automated dataflow job(using a template) which saves these messages into my BigQuery table. Now I intend to replace the template based job with a python pipeline where my requirement is to read data from PubSub, apply transformations and save the data into BigQuery/publish to another PubSub topic. I started writing the script in python and did a lot of trial and error to achieve it but to my dismay, I could not achieve it. The code looks like this:
import apache_beam as beam
from apache_beam.io import WriteToText
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
def run():
o = beam.options.pipeline_options.PipelineOptions()
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = (
p
| "Read From Pub/Sub" >> beam.io.ReadFromPubSub(topic=TOPIC_PATH)
)
data | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
print("Lines: ", data)
run()
I will really appreciate if I can get some help at the earliest.
Note: I have my project set up on google cloud and I have my script running locally.
Here the working code.
import apache_beam as beam
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
def run():
o = beam.options.pipeline_options.PipelineOptions()
# Replace this by --stream execution param
standard_options = o.view_as(beam.options.pipeline_options.StandardOptions)
standard_options.streaming = True
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = p | beam.io.ReadFromPubSub(topic=TOPIC_PATH) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
# Don't forget to run the pipeline!
result = p.run()
result.wait_until_finish()
run()
In summary
You miss to run the pipeline. Indeed, Beam is a Graph programming model. So, in your previous code, you built your graph but you never run it. Here, at the end, run it (not blocking call) and wait the end (blocking call)
When you start your pipeline, Beam mention that PubSub work only in streaming mode. Thus, you can start your code with --streaming param, or do it programmatically as shown in my code
Be careful, streaming mode means to listen indefinitively on PubSub. If you run this on Dataflow, your pipeline will be always up, until you stop it. This can be cost expensive if you have few message. Be sure that is the target model
An alternative is to use your pipeline for a limited period of time (you use scheduler for starting it, and another one for stopping it). But, at this moment, you have to stack message. Here you use a Topic as entry of the pipeline. This option force Beam to create a temporary subscription and to listen message on this subscription. This means that the message publish before this subscription creation won't be received and processed.
The idea is to create a subscription, by this way the message will be stacked in it (up to 7 days, by default). Then, use the subscription name in entry of your pipeline beam.io.ReadFromPubSub(subscription=SUB_PATH). The messages will be unstacked and processed by Beam (Order not guaranteed!)
Based on the Beam programming guide, you simply have to add a Transform step in your pipeline. Here an example or transform:
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
Add it to your pipeline
data | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
You can add the number of transforms that you want. You can test the value and set the elements in tagged PCollection (for having multiple output) for fan out, or use side input for fan in PCollection.

Tagging EMR cluster via an AWS Lambda tiggered by a Cloudwatch event rule

I need to catch the event RunflowJob in my Cloudwatch EventRule in order to tag AWS EMR starting clusters.
I'm looking for this event, because i need the username and account informations
Any idea?
Thanks
Calls to the ListClusters, DescribeCluster, and RunJobFlow actions generate entries in CloudTrail log files.
Every log entry contains information about who generated the request. For example, if a request is made to create and run a new job flow (RunJobFlow), CloudTrail logs the user identity of the person or service that made the request
https://docs.aws.amazon.com/emr/latest/ManagementGuide/logging_emr_api_calls.html#understanding_emr_log_file_entries
Here is a sample snippet to get the username using Python Boto3.
import boto3
cloudtrail = boto3.client("cloudtrail")
response = cloudtrail.lookup_events (
LookupAttributes=[
{
'AttributeKey': 'EventName',
'AttributeValue': 'RunJobFlow'
}
],
)
for event in response.get ("Events"):
print(event.get ("Username"))
username and cluster details can be retrieved from the RunJobFlow event itself. Easier solution would be to use Cloudwatch event rule along with Lambda function as a target to fetch these info and subsequently further action can be taken as required. Example below:
Event Pattern to be used with Cloudwatch event rule
{
"source": ["aws.elasticmapreduce"],
"detail": {
"eventName": ["RunJobFlow"]
}
}
Lambda code snippet
def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
user = event['detail']['userIdentity']['userName']
cluster_id = event['detail']['responseElements']['jobFlowId']
region = event['region']