Writing JSON with dict properties to Google Cloud Datastore - python-2.7

Using Apache Beam(Python 2.7 SDK) I am trying to write JSON files as entities into Google Cloud Datastore.
Sample JSON:
{
"CustId": "005056B81111",
"Name": "John Smith",
"Phone": "827188111",
"Email": "john#xxx.com",
"addresses": [
{"type": "Billing", "streetAddress": "Street 7", "city": "Malmo", "postalCode": "CR0 4UZ"},
{"type": "Shipping", "streetAddress": "Street 6", "city": "Stockholm", "postalCode": "YYT IKO"}
]
}
I have written a Apache Beam pipeline with mainly 3 steps,
beam.io.ReadFromText(input_file_path)
beam.ParDo(CreateEntities())
WriteToDatastore(PROJECT)
In step 2, I am converting JSON object(dict) into an entity,
class CreateEntities(beam.DoFn):
def process(self, element):
element = element.encode('ascii','ignore')
element = json.loads(element)
Id = element.pop('CustId')
entity = entity_pb2.Entity()
datastore_helper.add_key_path(entity.key, 'CustomerDF', Id)
datastore_helper.add_properties(entity, element)
return [entity]
This works fine for basic properties. However since address is a dict object itself it fails.
I have read a similar post.
However did not get the exact code to convert dict -> entity
Tried below to set address element as entity but does not work,
element['addresses'] = entity_pb2.Entity()
Other References:
https://www.the-swamp.info/blog/uploading-data-cloud-datastore-using-dataflow/
https://gcloud-python.readthedocs.io/en/latest/datastore/entities.html

Are you trying to store this as a repeated structured property?
ndb.StructuredPropertys appear in dataflow with the keys flattened, and for repeated structured properties, each individual property within the structured property object becomes an array. So I think you would need to write it like this:
datastore_helper.add_properties(entity, {
...
"addresses.type": ["Billing", "Shipping"],
"addresses.streetAddress": ["Street 7", "Street 6"],
"addresses.city": ["Malmo", "Stockholm"],
"addresses.postalCode": ["CR0 4UZ", "YYT IKO"],
})
Alternatively, if youre trying to save this as a ndb.JsonProperty, you can do this:
datastore_helper.add_properties(entity, {
...
"addresses": json.dumps(element['addresses']),
})

I know this is an old question, but I had a similar issue (although Python 3.6 and NDB) and wrote a function to convert all dicts inside a dict into Entity. This uses recursion to go through all nodes converting as necessary:
def dict_to_entity(data):
# the data can be a dict or a list, and they are iterated over differently
# also create a new object to store the child objects
if type(data) == dict:
childiterator = data.items()
new_data = {}
elif type(data) == list:
childiterator = enumerate(data)
new_data = []
else:
return
for i, child in childiterator:
# if the child is a dict or a list, continue drilling...
if type(child) in [dict, list]:
new_child = dict_to_entity(child)
else:
new_child = child
# add the child data to the new object
if type(data) == dict:
new_data[i] = new_child
else:
new_data.append(new_child)
# convert the new object to Entity if needed
if type(data) == dict:
child_entity = datastore.Entity()
child_entity.update(new_data)
return child_entity
else:
return new_data

Related

How to rearrange priority field for a django model?

I have a Model with a priority field of type postitive integer. This field is unique and allows me to manage the priority of objects.
For example, I want the most important object to have priority one, the second most important to have priority two, etc...
Example:
[
{ "name": "object82",
"priority": 1
}
{ "name": "object54",
"priority": 2
}
{ "name": "object12",
"priority": 3
}
]
class MyObject(models.Model):
name = models.CharField(_("name"), max_length=255)
priority = models.PositiveSmallIntegerField(_("priority"), unique=True)
I want to override the object serializer so that if I add a new object with an existing priority, it unpacks the existing objects. (same thing for the path of an existing object)
For example if I take the example above and add:
{ "name": "object22",
"priority": 2
}
I want the following result:
[
{ "name": "object82",
"priority": 1 // the priority didn't changed
}
{ "name": "object22", // my new object
"priority": 2
}
{ "name": "object54",
"priority": 3 // the priority had changed
}
{ "name": "object12", // the priority had changed
"priority": 4
}
]
I think I have to check first if an object with the same priority exists in the database or not.
If not => I save as is
If yes, I have to change the priority of some objects before add the new object.
How to do this ?
Maybe something like:
class MyObjectSerializer(serializers.ModelSerializer):
class Meta:
model = MyObject
fields = '__all__'
def update(self, instance, validated_data):
target_priority = validated_data.get('priority')
if MyObject.objects.filter(target_priority).exists():
existing_priorities = MyObject.objects.filter(priority__gte=target_priority)
for existing_priority in existing_priorities:
existing_priority.priority += 1
existing_priority.save(update_fields=['priority'])
instance.priority = target_priority
instance.save(update_fields=['priority'])
I was facing a similar problem, and what I have done is that I have a model form and I'm doing the validation in clean function
def clean(self):
cleaned_data = super().clean()
priority = cleaned_data.get('priority')
task = Task.objects.filter(priority__exact=priority)
while task.exists():
prev_task_id = task[0].id
task.update(priority=priority+1)
priority += 1
task = Task.objects.filter(priority__exact=priority).exclude(pk=prev_task_id)
return cleaned_data
I have used the prev_task_id variable for excluding the model that is just got updated. For e.g. let's say we have data
{
title: 'first one',
priority: 3
},
{
title: 'second one',
priority: 4
}
So now if I got priority 3 and after updating it we will have two tasks with priority 4 so we have to exclude the previous task i.e. 'first one'. We have to only update the second task in next iteration
PS:- This code is written assuming that in the database no duplicate priority exists.

Properly return a label in post-annotation lambda for AWS SageMaker Ground Truth custom labeling job

I'm working on a SageMaker labeling job with custom datatypes. For some reason though, I'm not getting the correct label in the AWS web console. It should have the selected label which is "Native", but instead, I'm getting the <labelattributename> which is "new-test-14".
After Ground Truth runs the post-annotation lambda, it seems to modify the metadata before returning a data object. The data object it returns doesn't contain a class-name key inside the metadata attribute, even when I hard-code the lambda to return an object that contains it.
My manifest file looks like this:
{"source-ref" : "s3://<file-name>", "text" : "Hello world"}
{"source-ref" : "s3://"<file-name>", "text" : "Hello world"}
And the worker response looks like this:
{"answers":[{"acceptanceTime":"2021-05-18T16:08:29.473Z","answerContent":{"new-test-14":{"label":"Native"}},"submissionTime":"2021-05-18T16:09:15.960Z","timeSpentInSeconds":46.487,"workerId":"private.us-east-1.ea05a03fcd679cbb","workerMetadata":{"identityData":{"identityProviderType":"Cognito","issuer":"https://cognito-idp.us-east-1.amazonaws.com/us-east-1_XPxQ9txEq","sub":"edc59ce1-e09d-4551-9e0d-a240465ea14a"}}}]}
That worker response gets processed by my post-annotation lambda which is modeled after this aws sample ground truth recipe. Here's my code:
import json
import sys
import boto3
from datetime import datetime
def lambda_handler(event, context):
# Event received
print("Received event: " + json.dumps(event, indent=2))
labeling_job_arn = event["labelingJobArn"]
label_attribute_name = event["labelAttributeName"]
label_categories = None
if "label_categories" in event:
label_categories = event["labelCategories"]
print(" Label Categories are : " + label_categories)
payload = event["payload"]
role_arn = event["roleArn"]
output_config = None # Output s3 location. You can choose to write your annotation to this location
if "outputConfig" in event:
output_config = event["outputConfig"]
# If you specified a KMS key in your labeling job, you can use the key to write
# consolidated_output to s3 location specified in outputConfig.
# kms_key_id = None
# if "kmsKeyId" in event:
# kms_key_id = event["kmsKeyId"]
# # Create s3 client object
# s3_client = S3Client(role_arn, kms_key_id)
s3_client = boto3.client('s3')
# Perform consolidation
return do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client)
def do_consolidation(labeling_job_arn, payload, label_attribute_name, s3_client):
"""
Core Logic for consolidation
:param labeling_job_arn: labeling job ARN
:param payload: payload data for consolidation
:param label_attribute_name: identifier for labels in output JSON
:param s3_client: S3 helper class
:return: output JSON string
"""
# Extract payload data
if "s3Uri" in payload:
s3_ref = payload["s3Uri"]
payload_bucket, payload_key = s3_ref.split('/',2)[-1].split('/',1)
payload = json.loads(s3_client.get_object(Bucket=payload_bucket, Key=payload_key)['Body'].read())
# print(payload)
# Payload data contains a list of data objects.
# Iterate over it to consolidate annotations for individual data object.
consolidated_output = []
success_count = 0 # Number of data objects that were successfully consolidated
failure_count = 0 # Number of data objects that failed in consolidation
for p in range(len(payload)):
response = None
dataset_object_id = payload[p]['datasetObjectId']
log_prefix = "[{}] data object id [{}] :".format(labeling_job_arn, dataset_object_id)
print("{} Consolidating annotations BEGIN ".format(log_prefix))
annotations = payload[p]['annotations']
# print("{} Received Annotations from all workers {}".format(log_prefix, annotations))
# Iterate over annotations. Log all annotation to your CloudWatch logs
annotationsFromAllWorkers = []
for i in range(len(annotations)):
worker_id = annotations[i]["workerId"]
anotation_data = annotations[i]["annotationData"]
annotation_content = anotation_data["content"]
annotation_content_json = json.loads(annotation_content)
annotation_job = annotation_content_json["new_test"]
annotation_label = annotation_job["label"]
consolidated_annotation= {
"workerId": worker_id,
"annotationData": {
"content": {
"annotatedResult": {
"instances": [{"label":annotation_label }]
}
}
}
}
annotationsFromAllWorkers.append(consolidated_annotation)
consolidated_annotation = {"annotationsFromAllWorkers": annotationsFromAllWorkers} # TODO : Add your consolidation logic
# Build consolidation response object for an individual data object
response = {
"datasetObjectId": dataset_object_id,
"consolidatedAnnotation": {
"content": {
label_attribute_name: consolidated_annotation,
label_attribute_name+ "-metadata": {
"class-name": "Native",
"confidence": 0.00,
"human-annotated": "yes",
"creation-date": datetime.strftime(datetime.now(), "%Y-%m-%dT%H:%M:%S"),
"type": "groundtruth/custom"
}
}
}
}
success_count += 1
# print("{} Consolidating annotations END ".format(log_prefix))
# Append individual data object response to the list of responses.
if response is not None:
consolidated_output.append(response)
failure_count += 1
print(" Consolidation failed for dataobject {}".format(p))
print(" Unexpected error: Consolidation failed." + str(sys.exc_info()[0]))
print("Consolidation Complete. Success Count {} Failure Count {}".format(success_count, failure_count))
print(" -- Consolidated Output -- ")
print(consolidated_output)
print(" ------------------------- ")
return consolidated_output
As you can see above, the do_consolidation method returns an object hard-coded to include a class-name of "Native", and the lambda_handler method returns that same object. Here's the post-annotation function response:
[{
"datasetObjectId": "4",
"consolidatedAnnotation": {
"content": {
"new-test-14": {
"annotationsFromAllWorkers": [{
"workerId": "private.us-east-1.ea05a03fcd679cbb",
"annotationData": {
"content": {
"annotatedResult": {
"instances": [{
"label": "Native"
}]
}
}
}
}]
},
"new-test-14-metadata": {
"class-name": "Native",
"confidence": 0,
"human-annotated": "yes",
"creation-date": "2021-05-19T07:06:06",
"type": "groundtruth/custom"
}
}
}
}]
As you can see, the post-annotation function return value has the class-name of "Native" in the metadata so I would expect the class-name to be present in the data object metadata, but it's not. And here's a screenshot of the data object summary:
It seems like Ground Truth overwrote the metadata, and now the object doesn't contain the correct label. I think perhaps that's why my label is coming through as the label attribute name "new-test-14" instead of as the correct label "Native". Here's a screenshot of the labeling job in the AWS web console:
The web console is supposed to show the label "Native" inside the "Label" column but instead I'm getting the <labelattributename> "new-test-14" in the label column.
Here is the output.manifest file generated by Ground Truth at the end:
{
"source-ref": "s3://<file-name>",
"text": "Hello world",
"new-test-14": {
"annotationsFromAllWorkers": [{
"workerId": "private.us-east-1.ea05a03fcd679ert",
"annotationData": {
"content": {
"annotatedResult": {
"label": "Native"
}
}
}
}]
},
"new-test-14-metadata": {
"type": "groundtruth/custom",
"job-name": "new-test-14",
"human-annotated": "yes",
"creation-date": "2021-05-18T12:34:17.400000"
}
}
What should I return from the Post-Annotation function? Am I missing something in my response? How do I get the proper label to appear in the AWS web console?

Query Django JSONFields that are a list of dictionaries

Given a Django JSONField that is structured as a list of dictionaries:
# JSONField "materials" on MyModel:
[
{"some_id": 123, "someprop": "foo"},
{"some_id": 456, "someprop": "bar"},
{"some_id": 789, "someprop": "baz"},
]
and given a list of values to look for:
myids = [123, 789]
I want to query for all MyModel instances that have a matching some_id anywhere in those lists of dictionaries. I can do this to search in dictionaries one at a time:
# Search inside the third dictionary in each list:
MyModel.objects.filter(materials__2__some_id__in=myids)
But I can't seem to construct a query to search in all dictionaries at once. Is this possible?
Given the clue here from Davit Tovmasyan to do this by incrementing through the match_targets and building up a set of Q queries, I wrote this function that takes a field name to search, a property name to search against, and a list of target matches. It returns a new list containing the matching dictionaries and the source objects they come from.
from iris.apps.claims.models import Claim
from django.db.models import Q
def json_list_search(
json_field_name: str,
property_name: str,
match_targets: list
) -> list:
"""
Args:
json_field_name: Name of the JSONField to search in
property_name: Name of the dictionary key to search against
match_targets: List of possible values that should constitute a match
Returns:
List of dictionaries: [
{"claim_id": 123, "json_obj": {"foo": "y"},
{"claim_id": 456, "json_obj": {"foo": "z"}
]
Example:
results = json_list_search(
json_field_name="materials_data",
property_name="material_id",
match_targets=[1, 22]
)
# (results truncated):
[
{
"claim_id": 1,
"json_obj": {
"category": "category_kmimsg",
"material_id": 1,
},
},
{
"claim_id": 2,
"json_obj": {
"category": "category_kmimsg",
"material_id": 23,
}
},
]
"""
q_keys = Q()
for match_target in match_targets:
kwargs = {
f"{json_field_name}__contains": [{property_name: match_target}]
}
q_keys |= Q(**kwargs)
claims = Claim.objects.filter(q_keys)
# Now we know which ORM objects contain references to any of the match_targets
# in any of their dictionaries. Extract *relevant* objects and return them
# with references to the source claim.
results = []
for claim in claims:
data = getattr(claim, json_field_name)
for datum in data:
if datum.get(property_name) and datum.get(property_name) in match_targets:
results.append({"claim_id": claim.id, "json_obj": datum})
return results
contains might help you. Should be something like this:
q_keys = Q()
for _id in myids:
q_keys |= Q(materials__contains={'some_id': _id})
MyModel.objects.filter(q_keys)

DRF formatting XLSX content

I am trying to set a different color on every second row in XLSX file. From the documentation I see that I can pass some conditions using body property or get_body() method, but this only allows me to set somewhat "static" conditions. Here is the ViewSet config responsible for rendering the XLSX file:
class MyViewSet(XLSXFileMixin, ModelViewSet):
def get_renderers(self) -> List[BaseRenderer]:
if self.action == "export":
return [XLSXRenderer()]
else:
return super().get_renderers()
#action(methods=["GET"], detail=False)
def export(self, request: Request) -> Response:
serializer = self.get_serializer(self.get_queryset(), many=True)
return Response(serializer.data)
# Properties for XLSX
column_header = {
"titles": [
"Hostname", "Operating System", "OS name", "OS family", "OS version", "Domain", "Serial number",
"Available patches",
],
"tab_title": "Endpoints",
"style": {
"font": {
"size": 14,
"color": "FFFFFF",
},
"fill": {
"start_color": "3F803F",
"fill_type": "solid",
}
}
}
body = {
"style": {
"font": {
"size": 12,
"color": "FFFFFF"
},
"fill": {
"fill_type": "solid",
"start_color": "2B2B2B"
},
}
}
OK. I got the answer after some digging through the source code. The render method of XLSXRenderer has this piece of code:
for row in results:
column_count = 0
row_count += 1
flatten_row = self._flatten(row)
for column_name, value in flatten_row.items():
if column_name == "row_color":
continue
column_count += 1
cell = ws.cell(
row=row_count, column=column_count, value=value,
)
cell.style = body_style
ws.row_dimensions[row_count].height = body.get("height", 40)
if "row_color" in row:
last_letter = get_column_letter(column_count)
cell_range = ws[
"A{}".format(row_count): "{}{}".format(last_letter, row_count)
]
fill = PatternFill(fill_type="solid", start_color=row["row_color"])
for r in cell_range:
for c in r:
c.fill = fill
So when I added a field row_color in my serializer as SerializerMethodField I was able to define a function that colors rows:
def get_row_color(self, obj: Endpoint) -> str:
"""
This method returns color value for row in XLSX sheet.
(*self.instance,) extends queryset to a list (it must be a queryset, not a single Endpoint).
.index(obj) gets index of currently serialized object in that list.
As the last step one out of two values from the list is chosen using modulo 2 operation on the index.
"""
return ["353535", "2B2B2B"][(*self.instance,).index(obj) % 2]

groovy: create a list of values with all strings

I am trying to iterate through a map and create a new map value. The below is the input
def map = [[name: 'hello', email: ['on', 'off'] ], [ name: 'bye', email: ['abc', 'xyz']]]
I want the resulting data to be like:
[hello: ['on', 'off'], bye: ['abc', 'xyz']]
The code I have right now -
result = [:]
map.each { key ->
result[random] = key.email.each {random ->
"$random"
}
}
return result
The above code returns
[hello: [on, off], bye: [abc, xyz]]
As you can see from above, the quotes from on, off and abc, xyz have disappeared, which is causing problems for me when i am trying to do checks on the list value [on, off]
It should not matter. If you see the result in Groovy console, they are still String.
Below should be sufficient:
map.collectEntries {
[ it.name, it.email ]
}
If you still need the single quotes to create a GString instead of a String, then below tweak would be required:
map.collectEntries {
[ it.name, it.email.collect { "'$it'" } ]
}
I personally do not see any reasoning behind doing the later way. BTW, map is not a Map, it is a List, you can rename it to avoid unnecessary confusions.
You could convert it to a json object and then everything will have quotes
This does it. There should/may be a groovier way though.
def listOfMaps = [[name: 'hello', email: ['on', 'off'] ], [ name: 'bye', email: ['abc', 'xyz']]]
def result = [:]
listOfMaps.each { map ->
def list = map.collect { k, v ->
v
}
result[list[0]] = ["'${list[1][0]}'", "'${list[1][1]}'"]
}
println result