Properly return a label in post-annotation lambda for AWS SageMaker Ground Truth custom labeling job

Properly return a label in post-annotation lambda for AWS SageMaker Ground Truth custom labeling job - amazon-web-services

I'm working on a SageMaker labeling job with custom datatypes. For some reason though, I'm not getting the correct label in the AWS web console. It should have the selected label which is "Native", but instead, I'm getting the <labelattributename> which is "new-test-14".
After Ground Truth runs the post-annotation lambda, it seems to modify the metadata before returning a data object. The data object it returns doesn't contain a class-name key inside the metadata attribute, even when I hard-code the lambda to return an object that contains it.
My manifest file looks like this:
{"source-ref" : "s3://<file-name>", "text" : "Hello world"}
{"source-ref" : "s3://"<file-name>", "text" : "Hello world"}
And the worker response looks like this:
{"answers":[{"acceptanceTime":"2021-05-18T16:08:29.473Z","answerContent":{"new-test-14":{"label":"Native"}},"submissionTime":"2021-05-18T16:09:15.960Z","timeSpentInSeconds":46.487,"workerId":"private.us-east-1.ea05a03fcd679cbb","workerMetadata":{"identityData":{"identityProviderType":"Cognito","issuer":"https://cognito-idp.us-east-1.amazonaws.com/us-east-1_XPxQ9txEq","sub":"edc59ce1-e09d-4551-9e0d-a240465ea14a"}}}]}
That worker response gets processed by my post-annotation lambda which is modeled after this aws sample ground truth recipe. Here's my code:
import json
import sys
import boto3
from datetime import datetime
def lambda_handler(event, context):
# Event received
print("Received event: " + json.dumps(event, indent=2))
labeling_job_arn = event["labelingJobArn"]
label_attribute_name = event["labelAttributeName"]
label_categories = None
if "label_categories" in event:
label_categories = event["labelCategories"]
print(" Label Categories are : " + label_categories)
payload = event["payload"]
role_arn = event["roleArn"]
output_config = None # Output s3 location. You can choose to write your annotation to this location
if "outputConfig" in event:
output_config = event["outputConfig"]
# If you specified a KMS key in your labeling job, you can use the key to write
# consolidated_output to s3 location specified in outputConfig.
# kms_key_id = None
# if "kmsKeyId" in event:
# kms_key_id = event["kmsKeyId"]
# # Create s3 client object
# s3_client = S3Client(role_arn, kms_key_id)
s3_client = boto3.client('s3')
# Perform consolidation
return do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client)
def do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client):
"""
Core Logic for consolidation
:param labeling_job_arn: labeling job ARN
:param payload: payload data for consolidation
:param label_attribute_name: identifier for labels in output JSON
:param s3_client: S3 helper class
:return: output JSON string
"""
# Extract payload data
if "s3Uri" in payload:
s3_ref = payload["s3Uri"]
payload_bucket, payload_key = s3_ref.split('/',2)[-1].split('/',1)
payload = json.loads(s3_client.get_object(Bucket=payload_bucket, Key=payload_key)['Body'].read())
# print(payload)
# Payload data contains a list of data objects.
# Iterate over it to consolidate annotations for individual data object.
consolidated_output = []
success_count = 0 # Number of data objects that were successfully consolidated
failure_count = 0 # Number of data objects that failed in consolidation
for p in range(len(payload)):
response = None
dataset_object_id = payload[p]['datasetObjectId']
log_prefix = "[{}] data object id [{}] :".format(labeling_job_arn, dataset_object_id)
print("{} Consolidating annotations BEGIN ".format(log_prefix))
annotations = payload[p]['annotations']
# print("{} Received Annotations from all workers {}".format(log_prefix, annotations))
# Iterate over annotations. Log all annotation to your CloudWatch logs
annotationsFromAllWorkers = []
for i in range(len(annotations)):
worker_id = annotations[i]["workerId"]
anotation_data = annotations[i]["annotationData"]
annotation_content = anotation_data["content"]
annotation_content_json = json.loads(annotation_content)
annotation_job = annotation_content_json["new_test"]
annotation_label = annotation_job["label"]
consolidated_annotation= {
"workerId": worker_id,
"annotationData": {
"content": {
"annotatedResult": {
"instances": [{"label":annotation_label }]
}
}
}
}
annotationsFromAllWorkers.append(consolidated_annotation)
consolidated_annotation = {"annotationsFromAllWorkers": annotationsFromAllWorkers} # TODO : Add your consolidation logic
# Build consolidation response object for an individual data object
response = {
"datasetObjectId": dataset_object_id,
"consolidatedAnnotation": {
"content": {
label_attribute_name: consolidated_annotation,
label_attribute_name+ "-metadata": {
"class-name": "Native",
"confidence": 0.00,
"human-annotated": "yes",
"creation-date": datetime.strftime(datetime.now(), "%Y-%m-%dT%H:%M:%S"),
"type": "groundtruth/custom"
}
}
}
}
success_count += 1
# print("{} Consolidating annotations END ".format(log_prefix))
# Append individual data object response to the list of responses.
if response is not None:
consolidated_output.append(response)
failure_count += 1
print(" Consolidation failed for dataobject {}".format(p))
print(" Unexpected error: Consolidation failed." + str(sys.exc_info()[0]))
print("Consolidation Complete. Success Count {} Failure Count {}".format(success_count, failure_count))
print(" -- Consolidated Output -- ")
print(consolidated_output)
print(" ------------------------- ")
return consolidated_output
As you can see above, the do_consolidation method returns an object hard-coded to include a class-name of "Native", and the lambda_handler method returns that same object. Here's the post-annotation function response:
[{
"datasetObjectId": "4",
"consolidatedAnnotation": {
"content": {
"new-test-14": {
"annotationsFromAllWorkers": [{
"workerId": "private.us-east-1.ea05a03fcd679cbb",
"annotationData": {
"content": {
"annotatedResult": {
"instances": [{
"label": "Native"
}]
}
}
}
}]
},
"new-test-14-metadata": {
"class-name": "Native",
"confidence": 0,
"human-annotated": "yes",
"creation-date": "2021-05-19T07:06:06",
"type": "groundtruth/custom"
}
}
}
}]
As you can see, the post-annotation function return value has the class-name of "Native" in the metadata so I would expect the class-name to be present in the data object metadata, but it's not. And here's a screenshot of the data object summary:
It seems like Ground Truth overwrote the metadata, and now the object doesn't contain the correct label. I think perhaps that's why my label is coming through as the label attribute name "new-test-14" instead of as the correct label "Native". Here's a screenshot of the labeling job in the AWS web console:
The web console is supposed to show the label "Native" inside the "Label" column but instead I'm getting the <labelattributename> "new-test-14" in the label column.
Here is the output.manifest file generated by Ground Truth at the end:
{
"source-ref": "s3://<file-name>",
"text": "Hello world",
"new-test-14": {
"annotationsFromAllWorkers": [{
"workerId": "private.us-east-1.ea05a03fcd679ert",
"annotationData": {
"content": {
"annotatedResult": {
"label": "Native"
}
}
}
}]
},
"new-test-14-metadata": {
"type": "groundtruth/custom",
"job-name": "new-test-14",
"human-annotated": "yes",
"creation-date": "2021-05-18T12:34:17.400000"
}
}
What should I return from the Post-Annotation function? Am I missing something in my response? How do I get the proper label to appear in the AWS web console?

Related

Variable passing throwing error in BigQueryInsertJobOperator in Airflow

I have written a BigQueryInsertJobOperator in Airflow to select and insert data to a Big Query table. But I am facing issue with variable passing. I am getting below error while executing Airflow DAG.
File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 911, in to_api_repr
configuration = self._configuration.to_api_repr()
File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 683, in to_api_repr
query_parameters = resource["query"].get("queryParameters")
AttributeError: 'str' object has no attribute 'get'
Here is my Operator code:
dag = DAG(
'bq_to_sql_operator',
default_args=default_args,
schedule_interval="#daily",
template_searchpath="/opt/airflow/dags/scripts",
user_defined_macros={"BQ_PROJECT": BQ_PROJECT, "BQ_EDW_DATASET": BQ_EDW_DATASET, "BQ_STAGING_DATASET": BQ_STAGING_DATASET},
catchup=False
)
t1 = BigQueryInsertJobOperator(
task_id='bq_write_to_umc_cg_service_agg_stg',
configuration={
"query": "{% include 'umc_cg_service_agg_stg.sql' %}",
"useLegacySql":False,
"allow_large_results":True,
"writeDisposition": "WRITE_TRUNCATE",
"destinationTable": {
'projectId': BQ_PROJECT,
'datasetId': BQ_STAGING_DATASET,
'tableId': UMC_CG_SERVICE_AGG_STG_TABLE_NAME
}
},
params={'BQ_PROJECT': BQ_PROJECT, 'BQ_EDW_DATASET': BQ_EDW_DATASET, 'BQ_STAGING_DATASET': BQ_STAGING_DATASET },
gcp_conn_id=BQ_CONN_ID,
location=BQ_LOCATION,
dag=dag
)
My SQL file looks like as below:
select
faccs2.employer_key employer_key,
faccs2.service_name service_name,
gender,
approximate_age_band,
state,
relationship_map_name,
account_attribute1_name,
account_attribute1_value,
account_attribute2_name,
account_attribute2_value,
account_attribute3_name,
account_attribute3_value,
account_attribute4_name,
account_attribute4_value,
account_attribute5_name,
account_attribute5_value,
count(distinct faccs2.sf_service_id) total_service_count
from `{{params.BQ_PROJECT}}.{{params.BQ_EDW_DATASET}}.fact_account_cg_case_survey` faccs
inner join `{{params.BQ_PROJECT}}.{{params.BQ_EDW_DATASET}}.fact_account_cg_case_service` faccs2 on faccs.sf_case_id = faccs2.sf_case_id
inner join `{{params.BQ_PROJECT}}.{{params.BQ_EDW_DATASET}}.dim_account` da on faccs2.account_key = da.account_key
left join `{{params.BQ_PROJECT}}.{{params.BQ_STAGING_DATASET}}.stg_account_selected_attr_tmp2` attr on faccs.account_key = attr.account_key
where not da.is_test_account_flag
and attr.gender is not null
and coalesce(faccs.case_status,'abc') <> 'Closed as Duplicate'
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16;
Can someone please help me how to fix this issue.

I think that the query configuration should be in a nested document called query:
t1 = BigQueryInsertJobOperator(
task_id='bq_write_to_umc_cg_service_agg_stg',
configuration={
"query": {
"query": "{% include 'umc_cg_service_agg_stg.sql' %}",
"useLegacySql":False,
"allow_large_results":True,
"writeDisposition": "WRITE_TRUNCATE",
"destinationTable": {
'projectId': BQ_PROJECT,
'datasetId': BQ_STAGING_DATASET,
'tableId': UMC_CG_SERVICE_AGG_STG_TABLE_NAME
}
}
},
params={'BQ_PROJECT': BQ_PROJECT, 'BQ_EDW_DATASET': BQ_EDW_DATASET, 'BQ_STAGING_DATASET': BQ_STAGING_DATASET },
gcp_conn_id=BQ_CONN_ID,
location=BQ_LOCATION,
dag=dag
)
With your provided configuration dict, an internal method try to access queryParameters which should be in the dict configuration["query"], but it finds str instead of dict.

Consider below script what I've used at work.
target_date = '{{ ds_nodash }}'
...
# DAG task
t1= bq.BigQueryInsertJobOperator(
task_id = 'sample_task,
configuration = {
"query": {
"query": f"{{% include 'your_query_file.sql' %}}",
"useLegacySql": False,
"queryParameters": [
{ "name": "target_date",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": f"{target_date}" }
}
],
"parameterMode": "NAMED"
},
},
location = 'asia-northeast3',
)
-- in your_query_file.sql, #target_date value is passed as a named parameter.
DECLARE target_date DATE DEFAULT SAFE.PARSE_DATE('%Y%m%d', #target_date);
SELECT ... FROM ... WHERE partitioned_at = target_date;
You can refer to configuration JSON field specification on the link below.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#queryrequest
parameterMode string
Standard SQL only. Set to POSITIONAL to use positional (?) query parameters or to NAMED to use named (#myparam) query parameters in this query.
queryParameters[] object (QueryParameter)
jobs.query parameters for Standard SQL queries.
queryParameters is an array of QueryParameter which has following JSON format.
{
"name": string,
"parameterType": {
object (QueryParameterType)
},
"parameterValue": {
object (QueryParameterValue)
}
}
https://cloud.google.com/bigquery/docs/reference/rest/v2/QueryParameter

Build Elasticsearch query dynamically by extracting fields to be matched from the Lambda event in AWS Elasticsearch service

I want to write a query to match indexed fields in Elasticsearch. I am using AWS Elasticsearch service and writing the query as an AWS Lambda function. This lambda function is executed when an event occurs, searches for a fields sent in the event, matches the fields with the indexed documents and returns the matched documents.
However, we don't know the fields or the number of fields to be searched ahead of time. So I want to be able to extract the fields from the event in the lambda function and construct the query dynamically to match the fields with the indexed documents.
The event is as follows:
{
"headers": {
"Host": "***"
},
"queryStringParameters": {
"fieldA": "abc",
"fieldB": "def"
}
}
The lambda function is as follows. This function expects two fields and matches them.
def search(event, context):
fields = list(event['queryStringParameters'].keys())
firstField = fields[0]
secondField = fields[1]
values = list(event['queryStringParameters'].values())
firstValue = values[0]
secondValue = values[1]
query = {
"query": {
"bool" : {
"must" :[
{"match" : { firstField : firstValue }},
{"match" : { secondField : secondValue }}
]
}
}
}
How can I rewrite my query so it dynamically accepts the fields and the number of fields that the event sends (not known ahead of time)?

Not sure what your exact requirements are but you could go with the following:
def search(event, context):
query = {
"query": {
"bool": {
"query_string": {
"query": " OR ".join([
"(%s:'%s')" % (k, v) for (k, v) in event["queryStringParameters"].items()
])
}
}
}
}
print(query)
which'd result in a proper query_string_query:
{
"query":{
"bool":{
"query_string":{
"query":"(fieldB:'def') OR (fieldA:'abc')"
}
}
}
}
You could interchange the OR with an AND. Also keep in mind that when the values are wrapped in quotes, ES will enforce exact matches. Leave them out in case you're after a contains behavior (i.e. a match query).

AWS Appsync Batch Resolver

Struggling with this for sometime now, and applogies I changed the query name for the question to getDeviceReadings, I have been using getAllUserDevices (sorry for any confusion)
type Device {
id: String
device: String!
}
type Reading {
device: String
time: Int
}
type PaginatedDevices {
devices: [Device]
readings: [Reading]
nextToken: String
}
type Query {
getDevicesReadings(nextToken: String, count: Int): PaginatedDevices
}
Then I have a resolver on the query getDevicesReadings which works fine and returns all the devices a user has so far so good
{
"version": "2017-02-28",
"operation": "Query",
"query" : {
"expression": "id = :id",
"expressionValues" : {
":id" : { "S" : "${context.identity.username}" }
}
}
#if( ${context.arguments.count} )
,"limit": ${context.arguments.count}
#end
#if( ${context.arguments.nextToken} )
,"nextToken": "${context.arguments.nextToken}"
#end
}
now I want to return all the readings that devices has based on the source result so I have a resolver on getDevicesReadings/readings
#set($ids = [])
#foreach($id in ${ctx.source.devices})
#set($map = {})
$util.qr($map.put("device", $util.dynamodb.toString($id.device)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"readings": {
"keys": $util.toJson($ids),
"consistentRead": true
}
}
}
With a response mapping like so ..
$utils.toJson($context.result.data.readings)
I run a query
query getShit{
getDevicesReadings{
devices{
device
}
readings{
device
time
}
}
}
this returns the following results
{
"data": {
"getAllUserDevices": {
"devices": [
{
"device": "123"
},
{
"device": "a935eeb8-a0d0-11e8-a020-7c67a28eda41"
}
],
"readings": [
null,
null
]
}
}
}
As you can see on the image the primary partition key is device on the readings table I look at the logs and I have the following
Sorry if you cant read the log it basically says that there are unprocessedKeys
and the following error message
"message": "The provided key element does not match the schema (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: 0H21LJE234CH1GO7A705VNQTJVVV4KQNSO5AEMVJF66Q9ASUAAJG)",
I'm guessing some how my mapping isn't quite correct and I'm passing in readings as my keys ?
Any help greatly appreciated

No, you can absolutely use batch resolvers when you have a primary sort key. The error in your example is that you were not providing the primary sort key to the resolver.
This code needs to provide a "time" as well a "device" because you need both to fully specify the primary key.
#set($ids = [])
#foreach($id in ${ctx.source.devices})
#set($map = {})
$util.qr($map.put("device", $util.dynamodb.toString($id.device)))
$util.qr($ids.add($map))
#end
You should have something like it:
#set($ids = [])
#foreach($id in ${ctx.source.devices})
#set($map = {})
# The tables primary key is made up of "device" AND "time"
$util.qr($map.put("device", $util.dynamodb.toString($id.device)))
$util.qr($map.put("time", $util.dynamodb.toString($id.time)))
$util.qr($ids.add($map))
#end
If you want to get many records that share the same "device" value but that have different "time" values, you need to use a DynamoDB Query operation, not a batch get.

You're correct, the request mapping template you provided doesn't match the primary key on the readings table. A BatchGetItem expects keys to be primary keys, however you are only passing the hash key.
For the BatchGetItem call to succeed you must pass both hash and sort key, so in this case, both device and time attributes.
Maybe a Query on the readings table would be more appropriate?

So you can't have a batch resolver when you have primary sort key ?!
So the answer was to create a lambda function and tack that on as my resolver
import boto3
from boto3.dynamodb.conditions import Key
def lambda_handler(event, context):
list = []
for device in event['source']['devices'] :
dynamodb = boto3.resource('dynamodb')
readings = dynamodb.Table('readings')
response = readings.query(
KeyConditionExpression=Key('device').eq(device['device'])
)
items = response['Items']
list.extend(items)
return list

How to retrieve multiple items from Dynamo DB using AWS lambda

How to get multiple items from DB. the below code throws me an error as it fetches only one item. I am retrieving the items based on email value.
import json
import os
import boto3
import decimalencoder
dynamodb = boto3.resource('dynamodb')
def get(event, context):
table = dynamodb.Table(os.environ['DYNAMODB_TABLE'])
# fetch a person from the database
result = table.get_item(
Key={
'email': event['pathParameters']['email']
}
)
# create a response
response = {
"statusCode": 200,
"body": json.dumps(result['Item'], cls=decimalencoder.DecimalEncoder),
"headers": {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Credentials": "true"
}
}
return response

To retrive multiple rows from db, first query on id you want data to be filtered.
Then maintain a list to store all row values in it.
def lambda_handler(event,context):
item = table.query(
KeyConditionExpression=Key('hubID').eq(hubId)
)
if (item["Count"] == 0):
response = {"msg": "Item not exist, can't perform READ"}
else:
i = 1
lst = []
while i < item["Count"]:
response = {
"hubId" : item["Items"][i]["hubID"],
"deviceState": int(item["Items"][i]["deviceState"]),
"deviceId": item["Items"][i]["deviceID"],
"deviceType": item["Items"][i]["deviceType"],
"intensity": int(item["Items"][i]["intensity"])
}
lst.append(response)
i += 1
print(lst)
response = lst
return response

babel-plugin-react-intl: Extract strings to a single file

Currently while using babel-plugin-react-intl, separate json for every component is created with 'id', 'description' and 'defaultMessage'. What I need is that only a single json to be created which contains a single object with all the 'id' as the 'key' and 'defaultMessage' as the 'value'
Present situation:
ComponentA.json
[
{
"id": "addEmoticonA",
"description": "Add emoticon",
"defaultMessage": "Insert Emoticon"
},
{
"id": "addPhotoA",
"description": "Add photo",
"defaultMessage": "Insert photo"
}
]
ComponentB.json
[
{
"id": "addEmoticonB",
"description": "Add emoji",
"defaultMessage": "Insert Emoji"
},
{
"id": "addPhotoB",
"description": "Add picture",
"defaultMessage": "Insert picture"
}
]
What I need for translation.
final.json
{
"addEmoticonA": "Insert Emoticon",
"addPhotoA": "Insert photo",
"addEmoticonB": "Insert Emoji",
"addPhotoB": "Insert picture"
}
Is there any way to accomplish this task? May it be by using python script or anything. i.e to make a single json file from different json files. Or to directly make a single json file using babel-plugin-react-intl

There is a translations manager that will do this.
Or for a custom option see below
The script below which is based on this script goes through the translation messages created by
babel-plugin-react-intl and creates js files that contain all messages from all components in the json format.
import fs from 'fs'
import {
sync as globSync
}
from 'glob'
import {
sync as mkdirpSync
}
from 'mkdirp'
import * as i18n from '../lib/i18n'
const MESSAGES_PATTERN = './_translations/**/*.json'
const LANG_DIR = './_translations/lang/'
// Ensure output folder exists
mkdirpSync(LANG_DIR)
// Aggregates the default messages that were extracted from the example app's
// React components via the React Intl Babel plugin. An error will be thrown if
// there are messages in different components that use the same `id`. The result
// is a flat collection of `id: message` pairs for the app's default locale.
let defaultMessages = globSync(MESSAGES_PATTERN)
.map(filename => fs.readFileSync(filename, 'utf8'))
.map(file => JSON.parse(file))
.reduce((collection, descriptors) => {
descriptors.forEach(({
id, defaultMessage, description
}) => {
if (collection.hasOwnProperty(id))
throw new Error(`Duplicate message id: ${id}`)
collection[id] = {
defaultMessage, description
}
})
return collection
}, {})
// Sort keys by name
const messageKeys = Object.keys(defaultMessages)
messageKeys.sort()
defaultMessages = messageKeys.reduce((acc, key) => {
acc[key] = defaultMessages[key]
return acc
}, {})
// Build the JSON document for the available languages
i18n.en = messageKeys.reduce((acc, key) => {
acc[key] = defaultMessages[key].defaultMessage
return acc
}, {})
Object.keys(i18n).forEach(lang => {
const langDoc = i18n[lang]
const units = Object.keys(defaultMessages).map((id) => [id, defaultMessages[id]]).reduce((collection, [id]) => {
collection[id] = langDoc[id] || '';
return collection;
}, {});
fs.writeFileSync(`${LANG_DIR}${lang}.json`, JSON.stringify(units, null, 2))
})

You can use babel-plugin-react-intl-extractor for aggregate your translations in single file. Also it provides autorecompile translation files on each change of your messages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Properly return a label in post-annotation lambda for AWS SageMaker Ground Truth custom labeling job - amazon-web-services

Related

Variable passing throwing error in BigQueryInsertJobOperator in Airflow

Build Elasticsearch query dynamically by extracting fields to be matched from the Lambda event in AWS Elasticsearch service

AWS Appsync Batch Resolver

How to retrieve multiple items from Dynamo DB using AWS lambda

babel-plugin-react-intl: Extract strings to a single file

Categories

Resources