Google Charts data encoding - google-visualization

I have recently started looking into Google Charts API for possible use within the product I'm working on. When constructing the URL for a given chart, the data points can be specified in three different formats, unencoded, using simple encoding and using extended encoding (http://code.google.com/apis/chart/formats.html). However, there seems to be no way around the fact that the highest value possible to specify for a data point is using extended encoding and is in that case 4095 (endoded as "..").
Am I missing something here or is this limit for real?

When using the Google Chart API, you will usually need to scale your data yourself so that it fits within the 0-4095 range required by the API.
For example, if you have data values from 0 to 1,000,000 then you could divide all your data by 245 so that it fits within the available range (1000000 / 245 = 4081).

Per data scaling, this may also help you:
http://code.google.com/apis/chart/formats.html#data_scaling
Note the chds parameter option.
You may also wish to consider leveraging a wrapper API that abstracts away some of these ugly details. They are listed here:
http://groups.google.com/group/google-chart-api/web/useful-links-to-api-libraries
I wrote charts4j which has functionality to help you deal with data scaling.

Related

Annotation specs - AutoML (GCP)

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks
Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.
I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

SGDClassifier with HashingVectorizer and TfidfTransformer

I would like to understand if it is possible to train an online SGDClassifier (with partial_fit) using HashingVectorizer and TfidfTransformer. Simply joining them in a Pipeline will not work as TfidfTransformer is stateful so that would break the online learning process. This post says it's not possible to use tf-idf in an online fashion but a comment on this post suggests that it may somehow be possible: "In particular if you use stateful transformers as TfidfTransformer you will need to do several passes on your data". Is that possible without loading the whole training set into memory? If so, how? If not, is there an alternative solution to combine HashingVectorizer with tf-idf on large datasets?
Is that possible without loading the whole training set into memory?
No. TfidfTransformer needs to have the entire X matrix in memory. You'll need to roll your own tf-idf estimator, use that to compute per-term document frequencies in one pass over the data, then do another pass to produce tf-idf features and fit a classifier to them.

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.

Google Places vs. Qype vs. others

at the moment I am working on a regional evaluation system.
I actually want to e.g. find out how regions are composed, let us say given
a lat long coordinate and a radius. Hereby I would really like to be able to separate by type and it is also necessary for the data to be up to date.
So which API based services do you recommend, if the following factors are important:
support for lat/long coordinates with search radius
differentiation by type of location
up to date information
As far as I know Google places and qype.com offer APIs which should be able to do so.
Is there a better option or which of the both do you recommend and why?
As far as I found out only Qype and Google Places offer the APIs.
Google offers 1000 requests per day for free while Qype only offers 200,
but one could apply for multiple keys in Qype which enables you to do more requests a day.
With Qype it is possible to check the full amount of commercial establishments in range (bounding box or radius), while google places has a restriction to 60 places per request.
That is the reason why I decided to use Qype.
About whether or not the information is up to date I did not make an evaluation,
but Qype shows reasonable results when applied to Munich.