Problems with Image Label Adjustment Job in Amazon Sagemaker Ground Truth - amazon-web-services

I'm trying to create a Image Label Adjustment Job in Ground Truth and I'm having some trouble. The thing is that I have a dataset of images, in which there are pre-made bounding boxes. I have an external python script that creates the "dataset.manifest" file with the json's of each image. Here are the first four lines of that manifest file:
{"source-ref": "s3://automatic-defect-detection/LM-WNB1-M-0000126254-camera_2_0022.jpg", "bounding-box": {"image_size": [{"width": 2048, "height": 1536, "depth": 3}], "annotations": [{"class_id": 0, "width": 80, "height": 80, "top": 747, "left": 840}]}, "bounding-box-metadata": {"class-map": {"0": "KK"}, "type": "groundtruth/object-detection", "human-annotated": "yes"}}
{"source-ref": "s3://automatic-defect-detection/LM-WNB1-M-0000126259-camera_2_0028.jpg", "bounding-box": {"image_size": [{"width": 2048, "height": 1536, "depth": 3}], "annotations": [{"class_id": 0, "width": 80, "height": 80, "top": 1359, "left": 527}]}, "bounding-box-metadata": {"class-map": {"0": "KK"}, "type": "groundtruth/object-detection", "human-annotated": "yes"}}
{"source-ref": "s3://automatic-defect-detection/LM-WNB1-M-0000126256-camera_3_0006.jpg", "bounding-box": {"image_size": [{"width": 2048, "height": 1536, "depth": 3}], "annotations": [{"class_id": 3, "width": 80, "height": 80, "top": 322, "left": 1154}, {"class_id": 3, "width": 80, "height": 80, "top": 633, "left": 968}]}, "bounding-box-metadata": {"class-map": {"3": "FF"}, "type": "groundtruth/object-detection", "human-annotated": "yes"}}
{"source-ref": "s3://automatic-defect-detection/LM-WNB1-M-0000126253-camera_2_0019.jpg", "bounding-box": {"image_size": [{"width": 2048, "height": 1536, "depth": 3}], "annotations": [{"class_id": 2, "width": 80, "height": 80, "top": 428, "left": 1058}]}, "bounding-box-metadata": {"class-map": {"2": "DD"}, "type": "groundtruth/object-detection", "human-annotated": "yes"}}
Now the problem is that I'm creating private jobs in Amazon Sagemaker to try it out. I have the manifest file and the images in a S3 bucket, and it actually kinda works. So I select the input manifest, activate the "Existing-labels display options". The existing labels for the bounding boxes do not appear automatically, so I have to enter them manually (don't know why), but if I do that and try the preview before creating the adjustment job, the bounding boxes appear perfectly and I can adjust them. The thing is that, me being the only worker invited for the job, the job never apears to start working on it, and it just auto-completes. I can see later that the images are there with my pre-made bounding boxes, but the job never appears to adjust those boxes. I don't have the "Automated data labeling" option activated. Is there something missing in my manifest file?

There can be multiple reasons for this. First of all, the automated labeling option is not support for label adjustment and verification tasks. so thats ruled out.
It looks like you have not setup the adjustment job properly. Some things to check for:
Have you specified the Task timeout and Task expiration time? If these values are practically low, then the tasks would be expired even before somebody can pick them.
Have you checked the "I want to display existing labels from the dataset for this job." box? It should be checked for your case.
Does your existing label are fetched properly? If this is not fetched correctly, either you need to review your manifest file or you need to manually provide the label values(which i guess you are doing)
Since you are the only worker in the workforce. Do you have correct permissions to access the labeling task?
How many images you have? Have you set any minimum batch size while setting the label adjustment job?

Related

What is the required data format for Google AutoML ".txt to .jsonl" script?

I'm trying to create dataset for entity recognition task in Google AutoML with their script to convert my .txt files in .jsonl and save it in Google Cloud Storage as explained in this tutorial. Data looks like (from their example - NCBI Disease Corpus):
"10021369 Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour<\/category> suppressor . "
After uploading in GCS labels are not recognized at all. What format of data is relevant?
I'm not quite sure if <category="Modifier"> should work, but as far as I know, the right way in the Quickstart is annotating in the following way:
{"annotations": [
{"text_extraction": {"text_segment": {"end_offset": 85, "start_offset": 52}}, "display_name": "Modifier"},
{"text_extraction": {"text_segment": {"end_offset": 144, "start_offset": 103}}, "display_name": "Modifier"},
{"text_extraction": {"text_segment": {"end_offset": 391, "start_offset": 376}}, "display_name": "Modifier"},
{"text_extraction": {"text_segment": {"end_offset": 1008, "start_offset": 993}}, "display_name": "Modifier"},
{"text_extraction": {"text_segment": {"end_offset": 1137, "start_offset": 1131}}, "display_name": "SpecificDisease"}],
"text_snippet": {"content": "10021369\tIdentification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor .\tThe ... APC - / - colon
carcinoma cells . Human APC2 maps to chromosome 19p13 . 3. APC and APC2 may therefore have comparable functions in development and cancer .\n "}
}
After importing the dataset, in the AutoML NL UI you will see the five annotations that are specified in the jsonl:
For more reference on the jsonl structure of the example above, you can take a look at the sample files in the Quickstart:
$ gsutil cat gs://cloud-ml-data/NL-entity/dataset.csv
TRAIN,gs://cloud-ml-data/NL-entity/train.jsonl
TEST,gs://cloud-ml-data/NL-entity/test.jsonl
$ gsutil cat gs://cloud-ml-data/NL-entity/train.jsonl
If you are using the python script for your own texts strings, you will see that it generates a csv file (dataset.csv) and jsonl files with content like:
{"text_snippet": {"content": "This is a disease\n Second line blah blabh"}, "annotations": []}
So, you will need to specify the annotations (using start_offset and the end_offset) whose manual process can be a bit overwhelm, or you can upload the CSV file in the AutoML UI and label entities interactively.

Batch in postman

I need post data from postman, but I have results limit (200 result for one query). I have 45 000 results. In this case, I need run query a lot of times for getting all data.
"select" : "(**)", "start": 0, "count": 200,
"select" : "(**)", "start": 201, "count": 401,
"select" : "(**)", "start": 402, "count": 502,
"select" : "(**)", "start": 503, "count": 603
Do we have any ways to run query using 1000 branch for example?

BASH: Regex -> empty result

I'm using bash but I do not get a bash rematch... Every online regex check tool worked fine for this string and regex.
#!/bin/bash
set -x
regex='hd_profile_pic_url_info": {"url": "([0-9a-zA-Z._:\/\-_]*)"'
str='{"user": {"pk": 12345, "username": "dummy", "full_name": "dummy", "is_private": true, "profile_pic_url": "censored", "profile_pic_id": "censored", "is_verified": false, "has_anonymous_profile_picture": false, "media_count": 0, "follower_count": 71114, "following_count": 11111, "biography": "", "external_url": "", "usertags_count": 0, "hd_profile_pic_versions": [{"width": 320, "height": 320, "url": "censored"}, {"width": 640, "height": 640, "url": "censored"}], "hd_profile_pic_url_info": {"url": "https://scontent-frt3-2.cdninstagram.com/vp/censored/censored_a.jpg", "width": 930, "height": 930}, "has_highlight_reels": false, "auto_expand_chaining": false}, "status": "ok"}'
[[ $str =~ $regex ]] && echo ${BASH_REMATCH}
Parsing json with bash it's not a good idea, as others said, jq is the right tool for the job.
Having said that, I think
regex='hd_profile_pic_url_info": {"url": "[0-9a-zA-Z._:\/_-]*"'
would work. Notice the '-' as the last char in the set, to avoid being interpreted as a range.
You have to remove the duplicate _ at the end of your regex :
regex='"hd_profile_pic_url_info": {"url": "([0-9a-zA-Z._:\/\-]*)"'

Use Sublime3 SFTP on EC2

I am trying to edit file in EC2 remotely, I spend a while to setup the config.json but I still got timeout error.
I am using mac and I already chmod 400 to .pem file
{
"type": "sftp",
"sync_down_on_open": true,
"host": "xxx.xx.xx.xxx",
"user": "ubuntu",
"remote_path": "/home/ubuntu/",
"connect_timeout": 30,
"sftp_flags": ["-o IdentityFile=/Users/kevinzhang/Desktop/zhang435_ec2.pem"],
}
I figure it out, Just in case anyone also have the same problem
I am use MAC OS
installed ubuntu
the config file is have is looks like
{
// The tab key will cycle through the settings when first created
// Visit http://wbond.net/sublime_packages/sftp/settings for help
// sftp, ftp or ftps
"type": "sftp",
// "save_before_upload": true,
"upload_on_save": true,
"sync_down_on_open": true,
"sync_skip_deletes": false,
"sync_same_age": true,
"confirm_downloads": false,
"confirm_sync": true,
"confirm_overwrite_newer": false,
"host": "xxxx.compute.amazonaws.com",
"user": "ubuntu",
//"password": "password",
"port": "22",
"remote_path": "/home/ubuntu/",
"ignore_regexes": [
"\\.sublime-(project|workspace)", "sftp-config(-alt\\d?)?\\.json",
"sftp-settings\\.json", "/venv/", "\\.svn/", "\\.hg/", "\\.git/",
"\\.bzr", "_darcs", "CVS", "\\.DS_Store", "Thumbs\\.db", "desktop\\.ini"
],
//"file_permissions": "664",
//"dir_permissions": "775",
//"extra_list_connections": 0,
"connect_timeout": 30,
//"keepalive": 120,
//"ftp_passive_mode": true,
//"ftp_obey_passive_host": false,
"ssh_key_file": "~/.ssh/id_rsa",
"sftp_flags": ["-o IdentityFile=<YOUR.PEM FILE path>"],
//"preserve_modification_times": false,
//"remote_time_offset_in_hours": 0,
//"remote_encoding": "utf-8",
//"remote_locale": "C",
//"allow_config_upload": false,
}
If you have permission problem :
chmod -R 0777 /home/ubuntu/YOURFILE/
this just enable read and write for all user
You may want to create a new user if above not working for you:
https://habd.as/sftp-to-ubuntu-server-sublime-text/
I do not know if this makes different , But looks like it start working for me for both user once Icreate a new user

Apache Drill: Not able to query the database

I am using UBUNTU 14.04.
I have started to explore about querying HDFS using apache drill, installed it my local system and configured the Storage plugin to point remote HDFS. Below is the configuration setup:
{
"type": "file",
"enabled": true,
"connection": "hdfs://devlpmnt.mycrop.kom:8020",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
}
},
"formats": {
"json": {
"type": "json"
}
}
}
After creating a json file "rest.json", I passed the query:
select * from hdfs.`/tmp/rest.json` limit 1
I am getting following error:
org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 'hdfs./tmp/rest.json' not found
I would appreciate if someone tries to help me figure out what is wrong.
Thanks in advance!!