When predicting, what are the valid values for dataFormat? - google-cloud-ml

Problem
Using the REST API, I have trained and deployed a model that I now want to use for prediction. I've defined the collections for prediction input and output and uploaded a json file formatted accordingly to the cloud storage. However, when trying to create a prediction job I cannot figure out what value to use for the dataFormat field, which is a required parameter. Is there any way to list all valid values?
What I've tried
My requests look like the one below. I've tried JSON, NEWLINE_DELIMITED_JSON (like when importing data into BigQuery), and even the json mime type application/json, in pretty much all different cases I can think of (upper and lower combined with snake, camel, etc.).
{
"jobId": "my_predictions_123",
"predictionInput": {
"modelName": "projects/myproject/models/mymodel",
"inputPaths": [
"gs://model-bucket/data/testset.json"
],
"outputPath": "gs://model-bucket/predictions/0/",
"region": "us-central1",
"dataFormat": "JSON"
},
"predictionOutput": {
"outputPath": "gs://my-bucket/predictions/1/"
}
}
All my attempts have only gotten me this back though:
{
"error": {
"code": 400,
"message": "Invalid value at 'job.prediction_input.data_format' (TYPE_ENUM), \"JSON\"",
"status": "INVALID_ARGUMENT",
"details": [
{
"#type": "type.googleapis.com/google.rpc.BadRequest",
"fieldViolations": [
{
"field": "job.prediction_input.data_format",
"description": "Invalid value at 'job.prediction_input.data_format' (TYPE_ENUM), \"JSON\""
}
]
}
]
}
}

From Cloud ML API reference document https://cloud.google.com/ml/reference/rest/v1beta1/projects.jobs#DataFormat, the data format field in your request should be "TEXT" for all text inputs (including JSON, CSV, etc).

Related

How do I extract a string of numbers from random text in Power Automate?

I am setting up a flow to organize and save emails as PDF in a Dropbox folder. The first email that will arrive includes a 10 digit identification number which I extract along with an address. My flow creates a folder in Dropbox named in this format: 2023568684 : 123 Main St. Over a few weeks, additional emails arrive that I need to put into that folder. The subject always has a 10 digit number in it. I was building around each email and using functions like split, first, last, etc. to isolate the 10 digits ID. The problem is that there is no consistency in the subjects or bodies of the messages to be able to easily find the ID with that method. I ended up starting to build around each email format individually but there are way too many, not to mention the possibility of new senders or format changes.
My idea is to use List files in folder when a new message arrives which will create an array that I can filter to find the folder ID the message needs to be saved to. I know there is a limitation on this because of the 20 file limit but that is a different topic and question.
For now, how do I find a random 10 digit number in a randomly formatted email subject line so I can use it with the filter function?
For this requirement, you really need regex and at present, PowerAutomate doesn't support the use of regex expressions but the good news is that it looks like it's coming ...
https://powerusers.microsoft.com/t5/Power-Automate-Ideas/Support-for-regex-either-in-conditions-or-as-an-action-with/idi-p/24768
There is a connector but it looks like it's not free ...
https://plumsail.com/actions/request-free-license
To get around it for now, my suggestion would be to create a function app in Azure and let it do the work. This may not be your cup of tea but it will work.
I created a .NET (C#) function with the following code (straight in the portal) ...
#r "Newtonsoft.Json"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
dynamic data = JsonConvert.DeserializeObject(requestBody);
string strToSearch = System.Text.Encoding.UTF8.GetString(Convert.FromBase64String((string)data?.Text));
string regularExpression = data?.Pattern;
var matches = System.Text.RegularExpressions.Regex.Matches(strToSearch, regularExpression);
var responseString = JsonConvert.SerializeObject(matches, new JsonSerializerSettings()
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
});
return new ContentResult()
{
ContentType = "application/json",
Content = responseString
};
}
Then in PowerAutomate, call the HTTP action passing in a base64 encoded string of the content you want to search ...
The is the expression in the JSON ... base64(variables('String to Search')) ... and this is the json you need to pass in ...
{
"Text": "#{base64(variables('String to Search'))}",
"Pattern": "[0-9]{10}"
}
This is an example of the response ...
[
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 33,
"Length": 10,
"Value": "2023568684"
},
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 98,
"Length": 10,
"Value": "8384468684"
}
]
Next, add a Parse JSON action and use this schema ...
{
"type": "array",
"items": {
"type": "object",
"properties": {
"Groups": {
"type": "object",
"properties": {}
},
"Success": {
"type": "boolean"
},
"Name": {
"type": "string"
},
"Captures": {
"type": "array"
},
"Index": {
"type": "integer"
},
"Length": {
"type": "integer"
},
"Value": {
"type": "string"
}
},
"required": [
"Groups",
"Success",
"Name",
"Captures",
"Index",
"Length",
"Value"
]
}
}
Finally, extract the first value that you find which matches the regex pattern. It returns multiple results if found so if you need to, you can do something with those.
This is the expression ... #{first(body('Parse_JSON'))?['value']}
From this string ...
We're going to search for string 2023568684 within this text and we're also going to try and find 8384468684, this should work.
... this is the result ...
Don't have a Premium PowerAutomate licence so can't use the HTTP action?
You can do this exact same thing using the LogicApps service in Azure. It's the same engine with some slight differences re: connectors and behaviour.
Instead of the HTTP, use the Azure Functions action.
In relation to your action to fire when an email is received, in LogicApps, it will poll every x seconds/minutes/hours/etc. rather than fire on event. I'm not 100% sure which email connector you're using but it should exist.
Dropbox connectors exist, that's no problem.
You can export your PowerAutomate flow into a LogicApps format so you don't have to start from scratch.
https://learn.microsoft.com/en-us/azure/logic-apps/export-from-microsoft-flow-logic-app-template
If you're concerned about cost, don't be. Just make sure you use the consumption plan. Costs only really rack up for these services when the apps run for minutes at a time on a regular basis. Just keep an eye on it for your own mental health.
TO get the function URL, you can find it in the function itself. You have to be in the function ...

Cannot find the EntityType - SessionEntityType name is composed of Session name and EntityType.display_name

Even though I have provided the correct information in the SessionEntityTypes, the I am getting the following errors. Tried from both REST & Python options, please let me know if there is anything which I am missing in the integrations.
Request
HTTP Method: POST
{
"name": "projects/{projectId}/locations/asia-northeast1/agent/environments/draft/users/-/sessions/c973fe-e44-9b5-34e-b404439b7/entityTypes/speciality_types",
"entities": [
{
"value": "APPLE_KEY",
"synonyms": [
"apple",
"green apple",
"crabapple"
]
},
{
"value": "ORANGE_KEY",
"synonyms": [
"orange"
]
}
],
"entityOverrideMode": "ENTITY_OVERRIDE_MODE_SUPPLEMENT"
}
Response
{
"error": {
"code": 400,
"message": "com.google.apps.framework.request.BadRequestException: Cannot find the EntityType of SessionEntityType 'projects/{projectId}/locations/asia-northeast1/agent/environments/draft/users/-/sessions/c973fe-e44-9b5-34e-b404439b7/entityTypes/speciality_types'. Please note that the SessionEntityType name is composed of Session name and EntityType.display_name.",
"status": "INVALID_ARGUMENT"
}
}
Google Try this API
I am going to paraphrase the issue here in order to ensure I’m not missing any details: you are attempting to create a sessionEntity using the “Try this API” tool, which is the Create (POST) version 2.
The issue is that the “name” you are passing in the request body does not have a valid format for API v2.
The format you are using for the name is:
projects/<ProjectID>/locations/<LocationID>/agent/environments/<EnvironmentID>/users/<UserID>/sessions/<SessionID>/entityTypes/<EntityTypeDisplayName>
Below I’ve listed the two valid name formats for v2 and as you can see the locations/<Location ID> is not needed:
projects/<Project ID>/agent/sessions/<Session ID>/entityTypes/<Entity Type Display Name>
and
projects/<Project ID>/agent/environments/<Environment ID>/users/<User ID>/sessions/<Session ID>/entityTypes/<Entity Type Display Name>
The below request body works as intended, I tested it in the same “Try this API” tool:
{
"name":"projects/{projectId}/agent/environments/draft/users/-/sessions/c973fe-e44-9b5-34e-b404439b7/entityTypes/speciality_types",
"entities":[
{
"value":"APPLE_KEY",
"synonyms":[
"apple",
"green apple",
"crabapple"
]
},
{
"value":"ORANGE_KEY",
"synonyms":[
"orange"
]
}
],
"entityOverrideMode":"ENTITY_OVERRIDE_MODE_SUPPLEMENT"
}

GCP recommendation data format for catalog

I am currently working on recommendation AI. since I am new to GCP recommendation, I have been struggling with data format for catalog. I read the documentation and it says each product item JSON format should be on a single line.
I understand this totally, but It would be really great if I could get what the JSON format looks like in real because the one in their documentation is very ambiguous to me. and I am trying to use console to import data
I tried to import data looking like down below but I got error saying invalid JSON format 100 times. it has lots of reasons such as unexpected token and something should be there and so on.
[
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
...
]
Maybe it was because each item was not on a single line, but I am also wondering if the above is enough for importing. I am not sure if those data should be included in another property like
{
"inputConfig": {
"productInlineSource": {
"products": [
{
"id": "1",
"title": "Toy Story (1995)",
"categories": [
"Animation",
"Children's",
"Comedy"
]
},
{
"id": "2",
"title": "Jumanji (1995)",
"categories": [
"Adventure",
"Children's",
"Fantasy"
]
},
}
I can see the above in the documentation but it says it is for importing inline which is using POST request. it does not mention anything about importing with console. I just guess the format is also used for console but I am not 100% sure. that is why I am asking
Is there anyone who can show me the entire data format to import data by using console?
Problem Solved
For those who might have the same question, The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
Posting this Community Wiki for better visibility.
OP edited question and add solution:
The exact data format you should import by using gcp console looks like
{"id":"1","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"2","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
No square bracket wrapping all the items.
No comma between items.
Only each item on a single line.
However I'd like to elaborate a bit.
There are a few ways to import Importing catalog information:
Importing catalog data from Merchant Center
Importing catalog data from BigQuery
Importing catalog data from Cloud Storage
I guess this is what was used by OP, as I was able to import catalog using UI and GCS with below JSON file.
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{"id":"111","title":"Toy Story (1995)","categories":["Animation","Children's","Comedy"]}
{"id":"222","title":"Jumanji (1995)","categories":["Adventure","Children's","Fantasy"]}
{"id":"333","title":"Test Movie (2020)","categories":["Adventure","Children's","Fantasy"]}
]
}
}
}
Importing catalog data inline
At the bottom of the Importing catalog information documentation you can find information:
The line breaks are for readability; you should provide an entire catalog item on a single line. Each catalog item should be on its own line.
It means you should use something similar to NDJSON - convenient format for storing or streaming structured data that may be processed one record at a time.
If you would like to try inline method, you should use this format, however it's single line but with breaks for readability.
data.json file
{
"inputConfig": {
"catalogInlineSource": {
"catalogItems": [
{
"id": "1212",
"category_hierarchies": [ { "categories": [ "Animation", "Children's" ] } ],
"title": "Toy Story (1995)"
},
{
"id": "5858",
"category_hierarchies": [ { "categories": [ "Adventure", "Fantasy" ] } ],
"title": "Jumanji (1995)"
},
{
"id": "321123",
"category_hierarchies": [ { "categories": [ "Comedy", "Adventure" ] } ],
"title": "The Lord of the Rings: The Fellowship of the Ring (2001)"
},
]
}
}
}
Command
curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
--data #./data.json \
"https://recommendationengine.googleapis.com/v1beta1/projects/[your-project]/locations/global/catalogs/default_catalog/catalogItems:import"
{
"name": "import-catalog-default_catalog-1179023525XX37366024",
"done": true
}
Please keep in mind that the above method requires Service Account authentication, otherwise you will get PERMISSION DENIED error.
"message" : "Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the translate.googleapis.com. We recommend that most server applications use service accounts instead. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/.",
"status" : "PERMISSION_DENIED"

How to highlight custom extractions using a2i's crowd-textract-analyze-document?

I would like to create a human review loop for images that undergone OCR using Amazon Textract and Entity Extraction using Amazon Comprehend.
My process is:
send image to Textract to extract the text
send text to Comprehend to extract entities
find the Block IDs in Textract's output of the entities extracted by Comprehend
add new Blocks of type KEY_VALUE_SET to textract's JSON output per the docs
create a Human Task with crowd-textract-analyze-document element in the template and feed it the modified textract output
What fails to work in this process is step 5. My custom entities are not rendered properly. By "fails to work" I mean that the entities are not highlighted on the image when I click them on the sidebar. There is no error in the browser's console.
Has anyone tried such a thing?
Sorry for not including examples. I will remove secrets/PII from my files and attach them to the question
I used the AWS documentation of the a2i-crowd-textract-detection human task element to generate the value of the initialValue attribute. It appears the doc for that attribute is incorrect. While the the doc shows that the value should be in the same format as the output of Textract, namely:
[
{
"BlockType": "KEY_VALUE_SET",
"Confidence": 38.43309020996094,
"Geometry": { ... }
"Id": "8c97b240-0969-4678-834a-646c95da9cf4",
"Relationships": [
{ "Type": "CHILD", "Ids": [...]},
{ "Type": "VALUE", "Ids": [...]}
],
"EntityTypes": ["KEY"],
"Text": "Foo bar"
},
]
the a2i-crowd-textract-detection expects the input to have lowerCamelCase attribute names (rather than UpperCamelCase). For example:
[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": { ... }
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{ "Type": "CHILD", "ids": [...]},
{ "Type": "VALUE", "ids": [...]}
],
"entityTypes": ["KEY"],
"text": "Foo bar"
},
]
I opened a support case about this documentation error to AWS.

GCP stackdriver enteries list pagination issue

i am sending the followin request for enteries list API ,here is the link to API
https://cloud.google.com/logging/docs/reference/v2/rest/v2/entries/list
{
"filter": "(jsonPayload.event_type=\"GCE_OPERATION_DONE\" OR protoPayload.serviceName=\"storage.googleapis.com\" OR protoPayload.serviceName=\"clientauthconfig.googleapis.com\" OR protoPayload.serviceName=\"iam.googleapis.com\" OR protoPayload.serviceName=\"compute.googleapis.com\") AND (jsonPayload.event_subtype=\"compute.instances.insert\" OR jsonPayload.event_subtype=\"compute.instances.delete\" OR protoPayload.methodName=\"storage.buckets.create\" OR protoPayload.methodName=\"storage.buckets.delete\" AND protoPayload.resourceOriginalState.direction=\"EGRESS\" AND protoPayload.request.disabled=true)) AND timestamp>=\"2020-05-16T12:52:00.820Z\" AND timestamp < \"2020-05-16T13:52:00.820Z\"",
"resourceNames": [
"projects/project1"
],
"orderBy": "timestamp desc",
"pageSize": 1000,
"pageToken":xxxx"
}
I am getting the following respone
{
"error": {
"code": 400,
"message": "page_token doesn't match arguments from the request",
"status": "INVALID_ARGUMENT"
}
}
Can anyone Suggest what does message imply with an example
this error is faced when,
the page token of some other project is being used in place of project1
Try to test the API using following format for the requested body and also try without any parameters.
{
"projectIds": [
string
],
"resourceNames": [
string
],
"filter": string,
"orderBy": string,
"pageSize": integer,
"pageToken": string
}