Continuous evaluation of AI Platform gives a `data_json_key` error - google-cloud-ml

With continuous evaluation of AI Platform, data_json_key error occurs.
Evaluation Job Inputs
Model objective: Image object detection (box)
IoU (Intersection over Union): 0.1
Data key: b64
Prediction label key: detection_classes
Prediction score key: detection_scores
Bounding box key: detection_boxes
Labeling Service: No
Daily sample percentage: 100%
Daily sample limit: 100
Error log
Partial Failures: [{"code":5,"message":"Can not find the image data under the data_json_key: image_bytes/b64"},{"code":5,"message":"Can not find the image data under the data_json_key: image_bytes/b64"},{"code":5,"message":"Can not find the image data under the data_json_key: image_bytes/b64"},{"message":"Found incorrect number of labeled dataset when preparing evaluation for dataset_id: 5ee3023a_0000_25e5_a9d2_94eb2c19321a"}]
I have data_json_key set to b64 and I think this is the correct predictive key for the model. However, the job returns an error as if it were expecting image_bytes/b64.
Supplement
The model is made by transfer learning ssd_mobilenetv2_oidv4*1 with TensorFlow Object Detection API according to this method*2.
1.https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
2.https://cloud.google.com/blog/products/gcp/performing-prediction-with-tensorflow-object-detection-models-on-google-cloud-machine-learning-engine

Your close. The Google documentation for Continuous Evaluation shows that data key is the key in the JSON example being submitted for prediction. In the AIP image classification example, the JSON is shown as:
{
"instances": [
{
"image_bytes": {
"b64": "iVBORw0KGgoAAAANSUhEUgAAAAYAAAAGCAYAAADgzO9IAAAAhUlEQVR4AWOAgZeONnHvHcXiGJDBqyDTXa+dVC888oy51F9+eRdY8NdWwYz/RyT//znEsAjEt277+syt5VMJw989DM/+H2MI/L8tVBQk4d38xcWp7ctLhi97ZCZ0rXV6yLA4b6dH59sjTq3fnji1fp4AsWS5j7PXstRg+/b3gU7N351AQgA8+jkf43sjaQAAAABJRU5ErkJggg=="
}
}
]
}
The tutorial the continues:
Then provide the following keys:
Data key: image_bytes/b64
Prediction label key: sentiments
Prediction score key: confidence

Related

ml::KNearest->findNearest() inconsistent result when category label changes

I am using OpenCV 3.4.1.
I am working on a video classification project and am trying to use KNearest to classify between 2 categories. I have 8 areas of interest in each video frame. To make decision on each frame, each KNearest is done on the pixel values on each area. The majority win (and favor to one category if it is a tie). So, I have 8 sets of training data (one for each area of interest).
Problem: The response generated from the knn model changed when I labelled the categories differently.
The training data sets are organized as rows of:
[category label], data0, data1, data2....etc. (different dimensions for each
training set)
where dataX = pixel data of a frame (1 row = 1 frame)
Then, I build the model by:
Ptr<TrainData> tdata = TrainData::loadFromCSV(filename, 0, 0, -1, String("cat"));
Mat raw = tdata->getTrainSamples();
Mat res = tdata->getResponses();
PCA pca(raw, noArray(), PCA::DATA_AS_ROW, 0.99);
Mat knnIn = pca.project(raw);
Ptr<ml::KNearest> knn = ml::KNearest::create();
knn->train(knnIn, ml::ROW_SAMPLE, res);
After that, testing data is passed to the pca and knn to get the response.
To testing it, I put 1 set of testing data to the 8 knn models.
If I use 0 & 1 as the [category label] in the training data sets, the 8 responses from KNN are 1,1,1,1,1,0,0,1.
If I change the label to 2 & 1 instead (replace all '0' by '2' in the first column of the training data), the 8 responses becomes 1,1,1,1,1,1,1,1, while I am expecting 1,1,1,1,1,2,2,1.
Some observations:
no editing error on the training data(while changing the category labels).
The KNN model isClassifier()=true. DefaultK=10. AlgorithmType=1 (BRUTE_FORCE).
The result is consistent with the same training data set and testing data set.
I don't see any pattern on the difference responses for the 2 label sets (after using different
training data sets and testing data sets).
Please shed some light. Thank you very much!

RAY - RLLIB - Failing to train DQN using offline sample batch - episode_len_mean: .nan value

RAY - RLLIB library - estimate a DQN model using offline batch data. Model fails to learn. episode_len_mean: .nan For CartPole example as well as personal-domain-specific dataset
Ubuntu
Ray library - RLIB
DQN
Offline
environment:- tried with Cartpole-v0 as well as with custom environment example.
episode_len_mean: .nan
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
Generate data using PG
rllib train --run=PG --env=CartPole-v0 --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' --stop='{"timesteps_total": 100000}'
Train model on offline data
rllib train --run=DQN --env=CartPole-v0 --config='{"input": "/tmp/cartpole-out","input_evaluation": ["is", "wis"],"soft_q": true, "softmax_temp": 1.0}'
Expected :-
episode_len_mean: numerical values
episode_reward_max: numerical values
episode_reward_mean: numerical values
episode_reward_min: numerical values
Actual Results (No improvement observed in tensorboard as well) :-
episode_len_mean: .nan
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
I had more or less the same problem and it was linked to the fact that the episode was never finishing because I didn't properly set the "done" value in the step function. Until an episode is "done", Ray doesn't calculate the metrics. In my case I had to specify a counter in the environment init function called self.count_steps and incremented on each step.
def step(self, action):
# Changes the yaw angles
self.cur_yaws = action
self.farm.calculate_wake(yaw_angles=action)
# power output
power = self.farm.get_farm_power()
reward = (power-self.best_power)/self.best_power
#if power > self.best_power:
# self.best_power = power
self.count_steps +=1
if self.count_steps > 50:
done = True
else:
done = False
return self.cur_yaws, reward, done, {}

Tensorflow return similar images

I want to use Google's Tensorflow to return similar images to an input image.
I have installed Tensorflow from http://www.tensorflow.org (using PIP installation - pip and python 2.7) on Ubuntu14.04 on a virtual machine CPU.
I have downloaded the trained model Inception-V3 (inception-2015-12-05.tgz) from http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz that is trained on ImageNet Large Visual Recognition Challenge using the data from 2012, but I think it has both the Neural network and the classifier inside it (as the task there was to predict the category). I have also downloaded the file classify_image.py that classifies an image in 1 of the 1000 classes in the model.
So I have a random image image.jpg that I an running to test the model. when I run the command:
python /home/amit/classify_image.py --image_file=/home/amit/image.jpg
I get the below output: (Classification is done using softmax)
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 3
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 3
trench coat (score = 0.62218)
overskirt (score = 0.18911)
cloak (score = 0.07508)
velvet (score = 0.02383)
hoopskirt, crinoline (score = 0.01286)
Now, the task at hand is to find images that are similar to the input image (image.jpg) out of a database of 60,000 images (jpg format, and kept in a folder at /home/amit/images). I believe this can be done by removing the final classification layer from the inception-v3 model, and using the feature set of the input image to find cosine distance from the feature set all the 60,000 images, and we can return the images having less distance (cos 0 = 1)
Please suggest me the way forward for this problem and how do I do this using Python API.
I think I found an answer to my question:
In the file classify_image.py that classifies the image using the pre trained model (NN + classifier), I made the below mentioned changes (statements with #ADDED written next to them):
def run_inference_on_image(image):
"""Runs inference on an image.
Args:
image: Image file name.
Returns:
Nothing
"""
if not gfile.Exists(image):
tf.logging.fatal('File does not exist %s', image)
image_data = gfile.FastGFile(image, 'rb').read()
# Creates graph from saved GraphDef.
create_graph()
with tf.Session() as sess:
# Some useful tensors:
# 'softmax:0': A tensor containing the normalized prediction across
# 1000 labels.
# 'pool_3:0': A tensor containing the next-to-last layer containing 2048
# float description of the image.
# 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
# encoding of the image.
# Runs the softmax tensor by feeding the image_data as input to the graph.
softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
feature_tensor = sess.graph.get_tensor_by_name('pool_3:0') #ADDED
predictions = sess.run(softmax_tensor,
{'DecodeJpeg/contents:0': image_data})
predictions = np.squeeze(predictions)
feature_set = sess.run(feature_tensor,
{'DecodeJpeg/contents:0': image_data}) #ADDED
feature_set = np.squeeze(feature_set) #ADDED
print(feature_set) #ADDED
# Creates node ID --> English string lookup.
node_lookup = NodeLookup()
top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
for node_id in top_k:
human_string = node_lookup.id_to_string(node_id)
score = predictions[node_id]
print('%s (score = %.5f)' % (human_string, score))
I ran the pool_3:0 tensor by feeding in the image_data to it. Please let me know if I am doing a mistake. If this is correct, I believe we can use this tensor for further calculations.
Tensorflow now has a nice tutorial on how to get the activations before the final layer and retrain a new classification layer with different categories:
https://www.tensorflow.org/versions/master/how_tos/image_retraining/
The example code:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py
In your case, yes, you can get the activations from pool_3 the layer below the softmax layer (or the so-called bottlenecks) and send them to other operations as input:
Finally, about finding similar images, I don't think imagenet's bottleneck activations are very pertinent representation for image search. You could consider to use an autoencoder network with direct image inputs.
(source: deeplearning4j.org)
Your problem sounds similar to this visual search project

How would I approach a lot of structured-but-inconsistent data?

I'm attempting to parse EDGAR documents - they're SEC filings. Specifically, I'm attempting to parse both SEC Schedule 13D and Schedule 13G filings.
There appears to be lots of failed attempts at parsing these filings, and I assume that's because doing so is a behemoth task that an entire team would have to tackle.
I was tasked with parsing those filings. We need the information from the data tables found throughout. The problem is that the filings on record make it hard for me to distinguish between data points, table section headers, etc.
So far, I've only been able to scrape information from around 10% of the Schedule 13D files, and even what I've scraped need considerable cleaning. In a nutshell, I'm matching a regular expression pattern to text. The pattern takes one known (English) section header and the one that comes next (I set each manually) and extracts what's in between: e.g., CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(.*?)SEC USE ONLY. Clearly, that's not going to get me very far, and it isn't. Using the same logic, here's what I get based on the following example string (as an example):
example text
NAMES OF REPORTING PERSONS I.R.S. IDENTIFICATION NOS. OF ABOVE PERSONS
(ENTITIES ONLY)Robert DePaloCHECK THE APPROPRIATE BOX IF A MEMBER OF A
GROUP(see
instructions)(a)    (b)    SEC
USE ONLYSOURCE OF FUNDS (see instructions)CHECK BOX IF DISCLOSURE OF
LEGAL PROCEEDINGS IS REQUIRED PURSUANT TO ITEMS 2(d) or
2(e)     CITIZENSHIP OR PLACE OF
ORGANIZATIONUnited StatesSOLE VOTING POWER45,119,857 (1)SHARED VOTING
POWER-0-SOLE DISPOSITIVE POWER45,119,857 (1)10.SHARED DISPOSITIVE
POWER-0-11.AGGREGATE AMOUNT BENEFICIALLY OWNED BY EACH REPORTING
PERSON45,119,857 (1)12.CHECK BOX IF THE AGGREGATE AMOUNT IN ROW (11)
EXCLUDES CERTAIN SHARES(see
instructions)    13.PERCENT OF CLASS
REPRESENTED BY AMOUNT IN ROW (11)33.4% (2)14.TYPE OF REPORTING PERSON
(see instructions)(1)  Consists of 44,194,298 shares of
Common Stock held by the Reporting Person and 925,559 shares of Common
Stock held by Arjent Limited UK.  The Reporting Person is
the Chairman of Arjent Limited UK and has voting and investment
authority over shares held by it.  Does not include any
classes of preferred shares that the Reporting Person and an entity
owned by the Reporting Person’s wife are entitled to receive, as
discussed in Item 6 below.(2)  Does not include the voting
interest that the Reporting Person is entitled to receive under the
SPHC Series B Preferred Shares, as discussed in Item 6 of this
Schedule 13D.
example output
key: CHECK THE | v: (a)    (b)    
key: CITIZENSHI | v: United States
key: CHECK BOX | v:      
key: SHARED VOT | v: -0-
key: PERCENT OF | v: PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW \(11\)
key: TYPE OF RE | v: TYPE OF REPORTING PERSON \(see instructions\)
key: CHECK BOX | v:     13.
key: SOLE DISPO | v: 45,119,857
key: SEC USE ON | v: SEC USE ONLY
key: SHARED DIS | v: -0
key: SOLE VOTIN | v: 45,119,857
key: NAMES OF R | v: Robert DePalo
key: AGGREGATE | v: 45,119,857 12.
key: SOURCE OF | v: SOURCE OF FUNDS \(see instructions\)
Are there any other approaches? This doesn't work for most of the 13D filings, and it won't work for 13G. I have a feeling I'm a little too naive in my approach, and I need a common approach to a problem like this. I'm looking to scrape at least 80% of at least 80% of the filings.

Neo4j regex string matching not returning expected results

I'm trying to use the Neo4j 2.1.5 regex matching in Cypher and running into problems.
I need to implement a full text search on specific fields that a user has access to. The access requirement is key and is what prevents me from just dumping everything into a Lucene instance and querying that way. The access system is dynamic and so I need to query for the set of nodes that a particular user has access to and then within those nodes perform the search. I would really like to match the set of nodes against a Lucene query, but I can't figure out how to do that so I'm just using basic regex matching for now. My problem is that Neo4j doesn't always return the expected results.
For example, I have about 200 nodes with one of them being the following:
( i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
This query produces one result:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.name =~ "(?i).*mosaic.*")
RETURN i
> Returned 1 row in 569 ms
But this query produces zero results even though the description property matches the expression:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 601 ms
And this query also produces zero results even though it includes the name property which returned results previously:
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 487 ms
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
RETURN searchText
>
...
SotoLinear Glass Mosaic Tiles Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
...
Even more odd, if I search for a different term, it returns all of the expected results without a problem.
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*plumbing.*")
RETURN i
> Returned 8 rows in 522 ms
I then tried to cache the search text on the nodes and I added an index to see if that would change anything, but it still didn't produce any results.
CREATE INDEX ON :node(searchText)
MATCH (p)-->(:group)-->(i:node)
WHERE (i.searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 3182 ms
I then tried to simplify the data to reproduce the problem, but in this simple case it works as expected:
MERGE (i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
WITH i, (
i.name + " " + COALESCE(i.description, "")
) AS searchText
WHERE searchText =~ "(?i).*mosaic.*"
RETURN i
> Returned 1 rows in 630 ms
I tried using the CYPHER 2.1.EXPERIMENTAL tag as well but that didn't change any of the results. Am I making incorrect assumptions on how the regex support works? Is there something else I should try or some other way to debug the problem?
Additional information
Here is a sample call that I make to the Cypher Transactional Rest API when creating my nodes. This is the actual plain text that is sent (other than some formatting for easier reading) when adding nodes to the database. Any string encoding is just standard URL encoding that is performed by Go when creating a new HTTP request.
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MATCH (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 })"
}
]}
If it is an encoding issue, why does a search on name work, description not work, and name + description not work? Is there any way to examine the database to see if/how the data was encoded. When I perform searches, the text returned appears correct.
just a few notes:
probably replace create unique with merge (which works a bit differently)
for your fulltext search I would go with the lucene legacy index for performance, if your group restriction is not limiting enough to keep the response below a few ms
I just tried your exact json statement, and it works perfectly.
inserted with
curl -H accept:application/json -H content-type:application/json -d #insert.json \
-XPOST http://localhost:7474/db/data/transaction/commit
json:
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MERGE (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 }) RETURN m"
}
]}
queried:
MATCH (p)-->(:group)-->(i:material)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
returns:
name: Linear Glass Mosaic Tiles
id: lsF3BxzFdn0kj
description: Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
object: material
What you can try to check your data is to look at the json or csv dumps that the browser offers (little download icons on the result and table-result)
Or you use neo4j-shell with my shell-import-tools to actually output csv or graphml and check those files.
Or use a bit of java (or groovy) code to check your data.
There is also the consistency-checker that comes with the neo4j-enterprise download. Here is a blog post on how to run it.
java -cp 'lib/*:system/lib/*' org.neo4j.consistency.ConsistencyCheckTool /tmp/foo
I added a groovy test script here: https://gist.github.com/jexp/5a183c3501869ee63d30
One more idea: regexp flags
Sometimes there is a multiline thing going on, there are two more flags:
multiline (?m) which also matches across multiple lines and
dotall (?s) which allows the dot also to match special chars like newlines
So could you try (?ism).*mosaic.*