Annotation specs - AutoML (GCP) - google-cloud-platform

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks

Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.

I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

Related

AWS GroundTruth text labeling - hide columns in the data, and checking quality of answers

I am new to SageMaker. I have a large csv dataset which I would like labelled:
sentence_id
sentence
pre_agreed_label
148392
A sentence
0
383294
Another sentence
1
For each sentence, I would like a) a yes/no binary classification in response to a question, and b) on a scale of 1-3, how obvious the classification was. I need the sentence id to map to other parts of the dataset, and will use the pre-agreed labels to assess accuracy.
I have identified SageMaker GroundTruth labelling jobs as a possible way to do this. Is this the best way? In trying to set it up I have run into a few problems.
The first problem is I can't find a way to display only the sentence column to the labellers, hiding the sentence_id and pre_agreed_labels.
The second is that there is either single labelling or multi labelling, but I would like a way to have two sets of single-selection labels:
Select one for binary classification:
Yes
No
Select one for difficulty of classification:
Easy
Medium
Hard
It seems as though this can be done using custom HTML, but I don't know how to do this - the template it gives you doesn't even render
Finally, having not used mechanical turk before, are there ways of ensuring people take the work seriously and don't just select random answers? I can see there's an option to have x number of people answer the same question, but is there also a way to put in an obvious question to which we already have a 'pre_agreed_label' every nth question, and kick people off the task if they get it wrong? There also appears to be a maximum of $1.20 per task which seems odd.

Google Cloud Vision not automatically splitting images for trainin/test

It's weird, for some reason GCP Vision won't allow me to train my model. I have met the minimum of 10 images per label, no images unlabeled and tried uploading a CSV pointing to 3 of this labels images as VALIDATION images.. Yet I get this error
Some of your labels (e.g. ‘Label1’) do not have enough images assigned to your Validation sets. Import another CSV file and assign those images to those sets.
any ideas would be appreciated
This error generally occurs when you did not labelled all the images because AutoML divides your images, including the mislabelled ones, into the categories and this error is triggered when the unlabelled images go to the VALIDATION set.
According to the documentation, it is recomended 1000 images per label. However, the minimum is 10 images for each label or 50 for complex cases. In addition,
The model works best when there are at most 100x more images for the most common label than for the least common label. We recommend removing very low frequency labels.
Furthermore, AutoML Vision uses the 80% of your content documents for training, 10% for validating, and 10% for testing. Since your images were not divided into these three categories, you should manually assign them to TRAIN, VALIDATION and TEST. You can do that by, uploading your images to a GCS bucket and referencing each labelled image in a .csv file, as follows:
TRAIN, gs://my_bucket/image1.jpeg,cat
As you can see above, it follows the format [SET],[GCS image path], [Label]. Note that you will be dividing your dataset manullay and it should respect the percentages already mentioned. Thus, you will have enough data in each category. You can follow the steps for preparing your training data here and here.
Note: please be aware that your .csv file is case sentive.
Lastly, in order to validate your dataset and inspect labelled/unlabelled images you can export the created dataset and check the exported .csv file. You can do it as described in the documentation. After exporting, download it and verify each SET( TRAIN, VALIDATION and TEST).

Clustering Using MapReduce

I have unstructured twitter data which is retrieved by the apache flume and stored it into the HDFS. So now I want to convert this unstructured data into structured one using the mapreduce.
Task wanted to do using the mapreduce:
1. conversion Unstructured to structure one.
2. I just want the text part which contain tweet part.
3. I want to identify the tweets for particular topic and grouped according to their sub part.
e.g. I have tweets of samsung handset so i want to make a group according to their handsets like groups of Samsung Note 4, Samsung galaxy etc.
It is my college project so my guide suggested me to use k means algorithm, I search a lot on k means but failed to understand how to identifies the Centroid for this basically i failed to understand how to apply K means to this situation in MapReduce.
Please gude me if I am doing wrong as I am new to this concept
K-means is clustering algorithm. It cluster or group similar data and calculate the common centroid. You can create time-series for the above questions you have mention. Group the tweets according to the topic.
K-mean implementation in MapReduce.
https://github.com/himank/K-Means
Using K-means in Twitter datasets.
You can check the following links
https://github.com/JulianHill/R-Tutorials/blob/master/r_twitter_cluster.r
http://www.r-bloggers.com/cluster-your-twitter-data-with-r-and-k-means/
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html

Google Visualization Annotated Time Line select dot size

Is there any way to change the "select dot" size on a Google Visualization API Annotated Time Line?
I have found that I can set the line size with the thickness property, but can't find anything for the select dot size.
chart.draw(data, {
displayAnnotations: true,
displayRangeSelector:false,
fill:30,
thickness:3,
colors:['#59761d', '#1d4376', '#761d1d']
});
I have read the documentation, but don't see anything on it. I had assumed Google themselves used this component in Google Analytics, and Google Analytics definitely has larger select dots. Perhaps they simply borrowed some code for that, and they are indeed different?
You can try to use Configuration Option Names from other visualizations(for example the option for dot size in a line graph is called "pointSize"), but I doubt they will work.
In my experience, the documentation for the visualizations are very thorough, and if you cannot find it there, it probably does not exist yet. Sorry!

Google Charts data encoding

I have recently started looking into Google Charts API for possible use within the product I'm working on. When constructing the URL for a given chart, the data points can be specified in three different formats, unencoded, using simple encoding and using extended encoding (http://code.google.com/apis/chart/formats.html). However, there seems to be no way around the fact that the highest value possible to specify for a data point is using extended encoding and is in that case 4095 (endoded as "..").
Am I missing something here or is this limit for real?
When using the Google Chart API, you will usually need to scale your data yourself so that it fits within the 0-4095 range required by the API.
For example, if you have data values from 0 to 1,000,000 then you could divide all your data by 245 so that it fits within the available range (1000000 / 245 = 4081).
Per data scaling, this may also help you:
http://code.google.com/apis/chart/formats.html#data_scaling
Note the chds parameter option.
You may also wish to consider leveraging a wrapper API that abstracts away some of these ugly details. They are listed here:
http://groups.google.com/group/google-chart-api/web/useful-links-to-api-libraries
I wrote charts4j which has functionality to help you deal with data scaling.