Predictionio evaluation fails with Text Classification template - evaluation

I am trying to predict a text field based on other text fields on predictionio. I used this guide for reference. I created a new app using
pio app new MyTextApp
and followed the guide upto evaluation using datasource provided in template. It was all okay upto evaluation. On evaluating data source I am getting error as pasted below.
[INFO] [CoreWorkflow$] runEvaluation started
[WARN] [Utils] Your hostname, my-ThinkCentre-Edge72 resolves to a loopback address: 127.0.0.1; using 192.168.65.27 instead (on interface eth0)
[WARN] [Utils] Set SPARK_LOCAL_IP if you need to bind to another address
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.65.27:59649]
[INFO] [CoreWorkflow$] Starting evaluation instance ID: AU29p8j3Fkwdnkfum_ke
[INFO] [Engine$] DataSource: org.template.textclassification.DataSource#faea4da
[INFO] [Engine$] Preparator: org.template.textclassification.Preparator#69f2cb04
[INFO] [Engine$] AlgorithmList: List(org.template.textclassification.NBAlgorithm#45292ec1)
[INFO] [Engine$] Serving: org.template.textclassification.Serving#1ad9b8d3
Exception in thread "main" java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:223)
at scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
at org.template.textclassification.PreparedData.<init>(Preparator.scala:152)
at org.template.textclassification.Preparator.prepare(Preparator.scala:38)
at org.template.textclassification.Preparator.prepare(Preparator.scala:34)
Do I have to edit any config files to make this work? I have successfully ran tests on movielens data.

So this particular error message occurs when your data isn't getting read properly through the DataSource class. If you're using a different text data set, then make sure that you are correctly reflecting any changes to the eventNames, entityType, and respective property field names in the readEventData method.
The maxBy method is used to pull the class with the highest number of observations. If the category to label Map is empty, it means that there are no classes being recorded, which essentially tells you have no data being fed in.
For example, I just did a spam detector using this engine. My e-mail data is of the form:
{"entityType": "content", "eventTime": "2015-06-04T00:22:39.064+0000", "entityId": 1, "event": "e-mail", "properties": {"label": "spam", "text": "content"}}
To use the engine for this data I made the following changes in the DataSource class:
entityType = Some("source"), // specify data entity type
eventNames = Some(List("documents")) // specify data event name
changes to
entityType = Some("content"), // specify data entity type
eventNames = Some(List("e-mail")) // specify data event name
and
)(sc).map(e => Observation(
e.properties.get[Double]("label"),
e.properties.get[String]("text"),
e.properties.get[String]("category")
)).cache
changes to:
)(sc).map(e => {
val label = e.properties.get[String]("label")
Observation(
if (label == "spam") 1.0 else 0.0,
e.properties.get[String]("text"),
label
)
}).cache
After this, I'm able to go through building, training, and deployment, as well as an evaluation.

Related

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?
since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline

I have started trying out google cloud data fusion as a prospect ETL tool that I can finally decide to use.When building a data pipeline to fetch data from a REST API source and load it to a MySQL database am facing this error Expected a string but was NULL at line 1 column 221'. Please check the system logs for more details. and yes it's true I have a field that is null from the JSON response am seeing
"systemanswertime": null
How do I deal with null values since the available dropdown in the cloud data fusion studio string is not working are they other optional data types that I can use?
Below are two screenshots showing my current data pipeline structure
geneneral view
view showing mapping and the output schema
Thank You!!
What you need to do is to tell HTTP plugin that you are expecting a null by checking the null checkbox in front of output on the right side. See below example
You might be getting this error because in the JSON schema you are defining the value properties. You should allow systemanswertime parameter to be NULL.
You could try to parse the JSON value as follow:
"systemanswertime": {
"type": [
"string",
"null"
]
}
In the case you don't have access to the JSON file, you could try to use this plug in in order to enable the HTTP to manage nulleable values by dynamically substituting the configurations that can be served by the HTTP Server. You will need access to the HTTP endpoint in order construct an accessible HTTP endpoint that can serve content similar to:
{
"name" : "output.schema", "type" : "schema", "value" :
[
{ "name" : "id", "type" : "int", "nullable" : true},
{ "name" : "first_name", "type" : "string", "nullable" : true},
{ "name" : "last_name", "type" : "string", "nullable" : true},
{ "name" : "email", "type" : "string", "nullable" : true},
]
},
In case you are facing an error such as: No matching schema found for union type: ["string","null"], you could try the following workaround. The root cause of this errors are when the entries in the response from the API doesn't have all the fields it needs to have. For example, some entries may have callerId, channel, last_channel, last data, etc... but others entries may have not have last_channel or whatever other field from the JSON. This leads to a mismatch in the schema provided in the HTTP source and the pipeline fails right away.
As pear this when nodes encounter null values, logical errors, or other sources of errors, you may use an error handler plugin to catch errors. The way is as following:
In the HTTP source plug-in, change the following:
Output schema to account for custom field.
JSON/XML field mapping to account into custom field.
Changed Non-HTTP Error Handling field to Send to Error. This way it pushes the records through error collector and the pipeline proceeds with subsequent records.
Added Error Collector and a sink to capture the error records.
With this method you will be able to run the pipeline and had the problematic fields detected.
Kind regards,
Manuel

Get tasks status in AWS Step Functions (boto3)

I am currently using boto3 (the Amazon Web Services (AWS) SDK for Python) to create state machines, start executions and also in my workers to retrieve tasks and report their status (completed successfully or failed).
I have another service that needs to know the tasks' status and I would like to do so by retrieving it from AWS. I searched the available methods and it is only possible to get the status of a state machine/execution as a whole (RUNNING|SUCCEEDED|FAILED|TIMED_OUT|ABORTED).
There is also the get_execution_history method but each step is identified by an id numbered sequentially and there is no information about the task itself (only in the "stateEnteredEventDetails" event, where the name of the task is present, but the subsequentially events may not be related to it, so it is impossible to know if the task was successful or not).
Is it really not possible to retrieve the status of a specific task, or am I missing something?
Thank you!
I had the same problem, and it seems that step functions does not consider the states and tasks as entities, and therefore there is not an API to get info about them.
In order to get info about the task's status you need to parse the information in the execution history. In my case I first check the execution status:
import boto3
import json
client = boto3.client("stepfunctions")
response = client.describe_execution(
executionArn=EXECUTION_ARN
)
status = response["status"]
and if it is "FAILED" then I analyze the history and get the most relevant fields for my use case (for events of type "TaskFailed"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000
)
events = response["events"]
while response.get("nextToken"):
response = client.get_execution_history(
executionArn=EXECUTION_ARN,
maxResults=1000,
nextToken=response["nextToken"]
)
events += response["events"]
causes = [
json.loads(e["taskFailedEventDetails"]["cause"])
for e in events
if e["type"] == "TaskFailed"
]
return [
{
"ClusterArn": cause["ClusterArn"],
"Containers": [
{
"ContainerArn": container["ContainerArn"],
"Name": container["Name"],
"ExitCode": container["ExitCode"],
"Overrides": cause["Overrides"]["ContainerOverrides"][i]
}
for i, container in enumerate(cause["Containers"])
],
"TaskArn": cause["TaskArn"],
"StoppedReason": cause["StoppedReason"]
}
for cause in causes
]

AWS Config: Custom rule for required-tags says "no tags" even when "Name" tag exists

I have customised the managed rule - required-tags and modified the lambda function to extend upto 9 tags and more.
The code seems to work and gives me the expected result.
This rule checks for tags given in "Rule Parameters". Empty and partially tagged are non compliant; fully tagged are compliant.
The problem I face is when a new EC2 instance is created and the custom rule triggers with configuration changes gives me "no tags" even when "name" tag is present.
When I re-evaluate for the second time, the expected result is got (missing tags except name tag)
The condition goes like this
if evaluation["compliance_type"] == "NON_COMPLIANT":
print ("NON_COMPLIANT")
if len(evaluation["current_tags"]) > 0:
print ("Non zero tags")
// evaluation report
else:
print ("Zero tags")
// evaluation report
EC2 instance was laucnhed around 12:39 PM
Cloudwatch logs after automatic trigger(configuration changes) around 12:41 PM
{
'current_tags': [],
'compliance_type': 'NON_COMPLIANT',
'annotation': 'Name, Customer, Environment, etc not present'
}
Cloudwatch logs after manual re-evaluation around 12:43 PM
{
'current_tags': [{u'value': u'instance_name', u'key': u'Name'}],
'compliance_type': 'NON_COMPLIANT',
'annotation': 'Customer, Environment, etc not present'
}
The event payload passed to lambda function (trigger: configuration changes) at instance creation time has current_tags: empty (even though name tag is added during instance creation; exists). Is there a way to find out how and when the tags gets added to the instance. Or the trigger can be delayed (not periodic trigger)

Extracting Cache Process information via SOAP in 2008.2

What is the best way to configure Intersystems Cache 2008.2 so that the Web Service interface could be utilized to export a table consisting of the process information?
It will be something like this:
Class SSH.WS Extends %SOAP.WebService [ ProcedureBlock ]
{
/// Name of the WebService.
Parameter SERVICENAME = "SSHTEST";
/// Return process list with some additional info
Method GetProcess() As %XML.DataSet [ WebMethod ]
{
set Dataset=##class(%XML.DataSet).%New()
set Dataset.ClassName="%SYS.ProcessQuery"
set Dataset.QueryName="SS"
quit Dataset
}
}
You may use any query from %SYS.ProcessQuery, or create your own based on one of these.