I have a Google PubSub stream that has a set of addresses added to it each day. I want each of these addresses processed by a triggered Google Cloud Function. However, what I have seen is that each address is only processed once even though there is a new message added to the stream each day.
My question is, if the same value is added to a stream each day will it be processed as a new message? Or will it be treated as a duplicate message?
This is the scenario I am seeing. Each day the locations Cloud Function is triggered and publishes each location to the locations topic. Most of the time these are the same messages as the previous day. The only time they change is if a location closes or there is a new one added. However, what I see is that many of the locations messages are never picked up by the location_metrics Cloud Function.
The flow of the functions is like this:
Topic is triggered each day at 2a (called locations_trigger) > triggers locations Cloud Function > sends to locations topic > triggers location_metrics Cloud Function > sends to location_metrics topic
For the locations Cloud Function, it is triggered and returns all addresses correctly then sends them to the locations topic. I won't put the whole function here because there are no problems with it. For each location it retrieves there is a "publish successful" message in the log. Here is the portion that sends the location details to the topic.
project_id = "project_id"
topic_name = "locations"
topic_id = "projects/project_id/topics/locations"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
try:
publisher.publish(topic_path, data=location_details.encode('utf-8'))
print("publish successful: ", location)
except Exception as exc:
print(exc)
An example location payload that is sent is:
{"id": "accounts/123456/locations/123456", "name": "Business 123 Main St Somewhere NM 10010"}
The location_metrics function looks like:
def get_metrics(loc):
request_body = {
"locationNames": [ loc['id'] ],
"basicRequest" : {
"metricRequests": [
{
"metric": 'ALL',
"options": ['AGGREGATED_DAILY']
}
],
"timeRange": {
"startTime": start_time_range,
"endTime": end_time_range,
},
}
}
request_url = <request url>
report_insights_response = http.request(request_url, "POST", body=json.dumps(request_body))
report_insights_response = report_insights_response[1]
report_insights_response = report_insights_response.decode().replace('\\n','')
report_insights_json = json.loads(report_insights_response)
<bunch of logic to parse the right metrics, am not including because this runs in a separate manual script without issue>
my_data = json.dumps(my_data)
project_id = "project_id"
topic_name = "location-metrics"
topic_id = "projects/project_id/topics/location-metrics"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
print("publisher: ", publisher)
print("topic_path: ", topic_path)
try:
publisher.publish(topic_path, data=gmb_data.encode('utf-8'))
print("publish successful: ", loc['name'])
except Exception as exc:
print("topic publish failed: ", exc)
def retrieve_location(event, context):
auth_flow()
message_obj = event.data
message_dcde = message_obj.decode('utf-8')
message_json = json.loads(message_dcde)
get_metrics(message_json)
Related
I have few Google cloud transfer jobs running in my GCP account, which transfers data from Azure to GCS bucket.
As per this document - https://cloud.google.com/storage-transfer/docs/reference/rest/v1/transferJobs/get?apix_params=%7B%22jobName%22%3A%22transferJobs%2F213858246512856794%22%2C%22projectId%22%3A%22merlincloud-gcp-preprod%22%7D
the "get" method can fetch details of the job like name, description, bucketName, status, includePrefixes, storageAccount and so on.
Here's the sample output of "get" method.
{
"name": "transferJobs/<job_name>",
"description": "<description given while creating job>",
"projectId": "<project_id>",
"transferSpec": {
"gcsDataSink": {
"bucketName": "<destination_bucket>"
},
"objectConditions": {
"includePrefixes": [
"<prefix given while creating job>"
],
"lastModifiedSince": "2021-06-30T18:30:00Z"
},
"transferOptions": {
},
"azureBlobStorageDataSource": {
"storageAccount": "<account_name>",
"container": "<container_name>"
}
},
"schedule": {
"scheduleStartDate": {
"year": 2021,
"month": 7,
"day": 1
},
"startTimeOfDay": {
"hours": 13,
"minutes": 45
},
"repeatInterval": "86400s"
},
"status": "ENABLED",
"creationTime": "2021-07-01T06:08:19.392111916Z",
"lastModificationTime": "2021-07-01T06:13:32.460934533Z",
"latestOperationName": "transferOperations/transferJobs-<job_name>"
}
Now, how do I fetch the run history details of a particular job in python?
By "Run history details" I mean the metrics (Data transferred, no of files, status, size, duration) displayed in GTS console as shown in the picture below.
I'm unfamiliar with the transfer service but I'm very familiar with GCP.
The only other resource that's provided by the service is transferOperations.
Does that provide the data you need?
If not (!), it's possible that Google hasn't exposed this functionality beyond the Console. This happens occasionally even though the intent is always to be (public) API first.
One way you can investigate is to check the browser's developer tools 'network' tab to see what REST API calls the Console is making to fulfill the request. Another way is to use the equivalent gcloud command and tack on --log-http to see the underlying REST API calls that way.
As #DazWilkin mentioned, I was able to fetch each job's run history details using the transferOperations - list API
I wrote a Cloud Function to fetch GTS metrics by making API calls.
Initially it makes tansferJobs - list API call and fetches the list of jobs and in that fetches only the required job details. It then makes "transferOperations" API call and passes the job name to fetch the run history details.
Here's the code:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
from datetime import datetime
import logging
"""
requirements.txt
google-api-python-client==2.3.0
oauth2client==4.1.3
"""
class GTSMetrics:
def __init__(self):
self.project = "<your_gcp_project_name>"
self.source_type_mapping = {"gcsDataSource": "Google Cloud Storage", "awsS3DataSource": "Amazon S3",
"azureBlobStorageDataSource": "Azure Storage"}
self.transfer_job_names = ["transferJobs/<your_job_name>"]
self.credentials = GoogleCredentials.get_application_default()
self.service = discovery.build('storagetransfer', 'v1', credentials=self.credentials)
self.metric_values = {}
def build_run_history_metrics(self, job=None):
try:
if job:
operation_filters = {"projectId": self.project, "jobNames": [job['name']]}
request = self.service.transferOperations().list(name='transferOperations', filter=operation_filters)
while request is not None:
response = request.execute()
if 'operations' in response:
self.metric_values['total_runs'] = len(response['operations'])
metadata = response['operations'][0]['metadata']
status = metadata['status'] if 'status' in metadata else ""
start_time = metadata['startTime'] if 'startTime' in metadata else ""
end_time = metadata['endTime'] if 'endTime' in metadata else ""
start_time_object = datetime.strptime(start_time[:-4], "%Y-%m-%dT%H:%M:%S.%f")
end_time_object = datetime.strptime(end_time[:-4], "%Y-%m-%dT%H:%M:%S.%f")
gts_copy_duration = end_time_object - start_time_object
self.metric_values['latest_run_status'] = status
self.metric_values['latest_run_time'] = str(start_time_object)
self.metric_values['latest_run_errors'] = ""
self.metric_values['start_time'] = str(start_time_object)
self.metric_values['end_time'] = str(end_time_object)
self.metric_values['duration'] = gts_copy_duration.total_seconds()
if status == "FAILED":
if 'errorBreakdowns' in metadata:
errors = metadata['errorBreakdowns'][0]['errorCount']
error_code = metadata['errorBreakdowns'][0]['errorCode']
self.metric_values['latest_run_errors'] = f"{errors} - {error_code}"
elif status == "SUCCESS":
counters = metadata['counters']
data_bytes = counters['bytesCopiedToSink'] if 'bytesCopiedToSink' in counters else '0 B'
obj_from_src = str(
counters['objectsFoundFromSource']) if 'objectsFoundFromSource' in counters else 0
obj_copied_sink = str(
counters['objectsCopiedToSink']) if 'objectsCopiedToSink' in counters else 0
data_skipped_bytes = counters[
'bytesFromSourceSkippedBySync'] if 'bytesFromSourceSkippedBySync' in counters else '0 B'
data_skipped_files = counters[
'objectsFromSourceSkippedBySync'] if 'objectsFromSourceSkippedBySync' in counters else '0'
self.metric_values['data_transferred'] = data_bytes
self.metric_values['files_found_in_source'] = obj_from_src
self.metric_values['files_copied_to_sink'] = obj_copied_sink
self.metric_values['data_skipped_in_bytes'] = data_skipped_bytes
self.metric_values['data_skipped_files'] = data_skipped_files
break
# request = self.service.transferOperations().list_next(previous_request=request,
# previous_response=response)
except Exception as e:
logging.error(f"Exception in build_run_history_metrics - {str(e)}")
def build_job_metrics(self, job):
try:
transfer_spec = list(job['transferSpec'].keys())
source = ""
source_type = ""
if "gcsDataSource" in transfer_spec:
source_type = self.source_type_mapping["gcsDataSource"]
source = job['transferSpec']["gcsDataSource"]["bucketName"]
elif "awsS3DataSource" in transfer_spec:
source_type = self.source_type_mapping["awsS3DataSource"]
source = job['transferSpec']["awsS3DataSource"]["bucketName"]
elif "azureBlobStorageDataSource" in transfer_spec:
source_type = self.source_type_mapping["azureBlobStorageDataSource"]
frequency = "Once"
schedule = list(job['schedule'].keys())
if "repeatInterval" in schedule:
interval = job['schedule']['repeatInterval']
if interval == "86400s":
frequency = "Every day"
elif interval == "604800s":
frequency = "Every week"
else:
frequency = "Custom"
prefix = ""
if 'objectConditions' in transfer_spec:
obj_con = job['transferSpec']['objectConditions']
if 'includePrefixes' in obj_con:
prefix = job['transferSpec']['objectConditions']['includePrefixes'][0]
self.metric_values['job_description'] = job['description']
self.metric_values['job_name'] = job['name']
self.metric_values['source_type'] = source_type
self.metric_values['source'] = source
self.metric_values['destination'] = job['transferSpec']['gcsDataSink']['bucketName']
self.metric_values['frequency'] = frequency
self.metric_values['prefix'] = prefix
except Exception as e:
logging.error(f"Exception in build_job_metrics - {str(e)}")
def build_metrics(self):
try:
request = self.service.transferJobs().list(pageSize=None, pageToken=None, x__xgafv=None,
ilter={"projectId": self.project})
while request is not None:
response = request.execute()
for transfer_job in response['transferJobs']:
if transfer_job['name'] in self.transfer_job_names:
# fetch job details
self.build_job_metrics(job=transfer_job)
# fetch run history details for the job
self.build_run_history_metrics(job=transfer_job)
request = self.service.transferJobs().list_next(previous_request=request, previous_response=response)
logging.info(f"GTS Metrics - {str(self.metric_values)}")
except Exception as e:
logging.error(f"Exception in build_metrics - {str(e)}")
def build_gts_metrics(request):
gts_metrics = GTSMetrics()
gts_metrics.build_metrics()
I have a device publishing through a gateway on the events topic (/devices/<dev_id>/events/motion) to PubSub. It's landing in PubSub correctly but subFolder is just an empty string.
On the gateway I'm publishing using the code below. f"mb.{device_id}" is the device_id (not the gateway ID and attribute could be anything - motion, temperature, etc
def report(self, device_id, attribute, value):
topic = f"/devices/mb.{device_id}/events/{attribute}"
timestamp = datetime.utcnow().timestamp()
client.publish(topic, json.dumps({"v": value, "ts": timestamp}))
And this is the cloud function listening on the PubSub queue.
def iot_to_bigtable(event, context):
payload = json.loads(base64.b64decode(event["data"]).decode("utf-8"))
timestamp = payload.get("ts")
value = payload.get("v")
if not timestamp or value is None:
raise BadDataException()
attributes = event.get("attributes", {})
device_id = attributes.get("deviceId")
registry_id = attributes.get("deviceRegistryId")
attribute = attributes.get("subFolder")
if not device_id or not registry_id or not attribute:
raise BadDataException()
A sample of the event in Pub/Sub:
{
#type: 'type.googleapis.com/google.pubsub.v1.PubsubMessage',
attributes: {
deviceId: 'mb.26727bab-0f37-4453-82a4-75d93cb3f374',
deviceNumId: '2859313639674234',
deviceRegistryId: 'mb-staging',
deviceRegistryLocation: 'europe-west1',
gatewayId: 'mb.42e29cd5-08ad-40cf-9c1e-a1974144d39a',
projectId: 'mb-staging',
subFolder: ''
},
data: 'eyJ2IjogImxvdyIsICJ0cyI6IDE1OTA3NjgzNjcuMTMyNDQ4fQ=='
}
Why is subFolder empty? Based on the docs I was expecting it to be the attribute (i.e. motion or temperature)
This issue has nothing to do with Cloud IoT Core. It is instead caused by how Pub/Sub handles failed messages. It was retrying messages from ~12 hours ago that had failed (and didn't have an attribute).
You fix this by purging the Subscription in Pub/Sub.
I have a .Net core client application using amazon Textract with S3,SNS and SQS as per the AWS Document , Detecting and Analyzing Text in Multipage Documents(https://docs.aws.amazon.com/textract/latest/dg/async.html)
Created an AWS Role with AmazonTextractServiceRole Policy and added the Following Trust relation ship as per the documentation (https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "textract.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Subscribed SQS to the topic and Given Permission to the Amazon SNS Topic to Send Messages to the Amazon SQS Queue as per the aws documentation .
All Resources including S3 Bucket, SNS ,SQS are in the same us-west2 region
The following method shows a generic error "InvalidParameterException"
Request has invalid parameters
But If the NotificationChannel section is commented the code is working fine and returning the correct job id.
Error message is not giving a clear picture about the parameter. Highly appreciated any help .
public async Task<string> ScanDocument()
{
string roleArn = "aws:iam::xxxxxxxxxxxx:instance-profile/MyTextractRole";
string topicArn = "aws:sns:us-west-2:xxxxxxxxxxxx:AmazonTextract-My-Topic";
string bucketName = "mybucket";
string filename = "mytestdoc.pdf";
var request = new StartDocumentAnalysisRequest();
var notificationChannel = new NotificationChannel();
notificationChannel.RoleArn = roleArn;
notificationChannel.SNSTopicArn = topicArn;
var s3Object = new S3Object
{
Bucket = bucketName,
Name = filename
};
request.DocumentLocation = new DocumentLocation
{
S3Object = s3Object
};
request.FeatureTypes = new List<string>() { "TABLES", "FORMS" };
request.NotificationChannel = channel; /* Commenting this line work the code*/
var response = await this._textractService.StartDocumentAnalysisAsync(request);
return response.JobId;
}
Debugging Invalid AWS Requests
The AWS SDK validates your request object locally, before dispatching it to the AWS servers. This validation will fail with unhelpfully opaque errors, like the OP.
As the SDK is open source, you can inspect the source to help narrow down the invalid parameter.
Before we look at the code: The SDK (and documentation) are actually generated from special JSON files that describe the API, its requirements and how to validate them. The actual code is generated based on these JSON files.
I'm going to use the Node.js SDK as an example, but I'm sure similar approaches may work for the other SDKs, including .NET
In our case (AWS Textract), the latest Api version is 2018-06-27. Sure enough, the JSON source file is on GitHub, here.
In my case, experimentation narrowed the issue down to the ClientRequestToken. The error was an opaque InvalidParameterException. I searched for it in the SDK source JSON file, and sure enough, on line 392:
"ClientRequestToken": {
"type": "string",
"max": 64,
"min": 1,
"pattern": "^[a-zA-Z0-9-_]+$"
},
A whole bunch of undocumented requirements!
In my case the token I was using violated the regex (pattern in the above source code). Changing my token code to satisfy the regex solved the problem.
I recommend this approach for these sorts of opaque type errors.
After a long days analyzing the issue. I was able to resolve it .. as per the documentation topic only required SendMessage Action to the SQS . But after changing it to All SQS Action its Started Working . But Still AWS Error message is really misleading and confusing
you would need to change the permissions to All SQS Action and then use the code as below
def startJob(s3BucketName, objectName):
response = None
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
response = textract.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = textract.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
Invoking textract with Python, I received the same error until I truncated the ClientRequestToken down to 64 characters
response = client.start_document_text_detection(
DocumentLocation={
'S3Object':{
'Bucket': bucket,
'Name' : fileName
}
},
ClientRequestToken= fileName[:64],
NotificationChannel= {
"SNSTopicArn": "arn:aws:sns:us-east-1:AccountID:AmazonTextractXYZ",
"RoleArn": "arn:aws:iam::AccountId:role/TextractRole"
}
)
print('Processing started : %s' % json.dumps(response))
This can be considered a follow-up to this thread, but I need more help with moving things along. Hopefully someone can have a look over my attempts below and provide further guidance.
To summarize, I need a cloud function that
Is triggered by a PubSub message being published in topic A (this can be done in UI).
reads a messy object change notification message in "push" PubSub topic A.
"parse" it
publish a message in PubSub topic B, with the original message ID as data, and other metadata (e.g. file name, size, time) as attributes.
. 1:
Example of a messy object change notification:
\n "kind": "storage#object",\n "id": "bucketcfpubsub/test.txt/1544681756538155",\n "selfLink": "https://www.googleapis.com/storage/v1/b/bucketcfpubsub/o/test.txt",\n "name": "test.txt",\n "bucket": "bucketcfpubsub",\n "generation": "1544681756538155",\n "metageneration": "1",\n "contentType": "text/plain",\n "timeCreated": "2018-12-13T06:15:56.537Z",\n "updated": "2018-12-13T06:15:56.537Z",\n "storageClass": "STANDARD",\n "timeStorageClassUpdated": "2018-12-13T06:15:56.537Z",\n "size": "1938",\n "md5Hash": "sDSXIvkR/PBg4mHyIUIvww==",\n "mediaLink": "https://www.googleapis.com/download/storage/v1/b/bucketcfpubsub/o/test.txt?generation=1544681756538155&alt=media",\n "crc32c": "UDhyzw==",\n "etag": "CKvqjvuTnN8CEAE="\n}\n
To clarify, is this a message with blank "data" field, and all the information above are in attribute pairs (like "attribute name": "attribute data")? Or is it just a long string stuffed into the "data" field, with no "attributes"?
. 2:
In the above thread, a "pull" subscription is used. Is it better than using a "push" subscription? Push sample below:
def create_push_subscription(project_id,
topic_name,
subscription_name,
endpoint):
"""Create a new push subscription on the given topic."""
# [START pubsub_create_push_subscription]
from google.cloud import pubsub_v1
# TODO project_id = "Your Google Cloud Project ID"
# TODO topic_name = "Your Pub/Sub topic name"
# TODO subscription_name = "Your Pub/Sub subscription name"
# TODO endpoint = "https://my-test-project.appspot.com/push"
subscriber = pubsub_v1.SubscriberClient()
topic_path = subscriber.topic_path(project_id, topic_name)
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
push_config = pubsub_v1.types.PushConfig(
push_endpoint=endpoint)
subscription = subscriber.create_subscription(
subscription_path, topic_path, push_config)
print('Push subscription created: {}'.format(subscription))
print('Endpoint for subscription is: {}'.format(endpoint))
# [END pubsub_create_push_subscription]
Or do I need further code after this to receive messages?
Also, doesn't this create a new subscriber every time the Cloud Function is triggered by a pubsub message being published? Should I add a subscription delete code at the end of the CF, or are there more efficient ways to do this?
. 3:
Next, to parse the code, this sample code doing a few attributes as follows:
def summarize(message):
# [START parse_message]
data = message.data
attributes = message.attributes
event_type = attributes['eventType']
bucket_id = attributes['bucketId']
object_id = attributes['objectId']
Will this work with my above notification in 1:?
. 4:
How do I separate the topic_name? Steps 1 and 2 use topic A, while this step is to publish into topic B. Is is as simple as re-writing the topic_name in the below code example?
# TODO topic_name = "Your Pub/Sub topic name"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
for n in range(1, 10):
data = u'Message number {}'.format(n)
# Data must be a bytestring
data = data.encode('utf-8')
# Add two attributes, origin and username, to the message
publisher.publish(
topic_path, data, origin='python-sample', username='gcp')
print('Published messages with custom attributes.')
Source where I got most of the sample code from (besides the above thread):python-docs-samples. Will adapting and stringing the above code samples together produce useful code? Or will I still be missing stuff like "import ****"?
You should not attempt to manually create a Subscriber running in Cloud Functions. Instead, follow the documentation here for setting up a Cloud Function which will be called with all messages sent to a given topic by passing the --trigger-topic command line parameter.
To address some of your other concerns:
“Should I add a subscription delete code at the end of the CF”- Subscriptions are long-lived resources corresponding to a specific backlog of messages. If the subscription is created and deleted at the end of the cloud function, messages sent when it does not exist will not be received.
“How do I separate the topic_name”- The ‘topic_name’ in this example refers to the last part of the string formatted like this projects/project_id/topics/topic_name that will appear on this page in the cloud console for your topic after it has been created.
I am attempting to load a simple transactions.txt table into a S3 bucket where a Lambda function reads the file and populates DynamoDB tables for Customers and Transactions. This all works fine. However, I also have a Lambda function that is supposed to read the Transactions table as they populate the table and sum up the transaction totals by customer and insert them into another DynamoDB table--TransactionTotal.
My TotalNotifier Lambda function throws a "KeyError" regarding a "New Image". I believe the code is fine, and I have tried changing the type of Streams from 'New and Old' to just 'New' for the Transactions table and still encounter same error.
from __future__ import print_function
import json, boto3
# Connect to SNS
sns = boto3.client('sns')
alertTopic = 'HighBalanceAlert'
snsTopicArn = [t['TopicArn'] for t in sns.list_topics()['Topics'] if t['TopicArn'].endswith(':' + alertTopic)][0]
# Connect to DynamoDB
dynamodb = boto3.resource('dynamodb')
transactionTotalTableName = 'TransactionTotal'
transactionsTotalTable = dynamodb.Table(transactionTotalTableName);
# This handler is executed every time the Lambda function is triggered
def lambda_handler(event, context):
# Show the incoming event in the debug log
print("Event received by Lambda function: " + json.dumps(event, indent=2))
# For each transaction added, calculate the new Transactions Total
for record in event['Records']:
customerId = record['dynamodb']['NewImage']['CustomerId']['S']
transactionAmount = int(record['dynamodb']['NewImage']['TransactionAmount']['N'])
# Update the customer's total in the TransactionTotal DynamoDB table
response = transactionsTotalTable.update_item(
Key={
'CustomerId': customerId
},
UpdateExpression="add accountBalance :val",
ExpressionAttributeValues={
':val': transactionAmount
},
ReturnValues="UPDATED_NEW"
)
Here is a sample error from the CloudWatch log:
'NewImage': KeyError
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 30, in lambda_handler
customerId = record['dynamodb']['NewImage']['CustomerId']['S']
KeyError: 'NewImage'
To elaborate on Oluwafemi's comment, you're likely experiencing this error when receiving a REMOVE event. Regardless of whether your stream is new and old images, or just new, you won't receive a NEW_IMAGE on a REMOVE event, since there is no new image. Check out the example events on aws docs.
A check on the value of record['eventName'] should solve the issue.