What does global step mean? - google-cloud-ml

I recently completed the Cloud ML Criteo tutorial, and one of the final log messages from the distributed training job on the "small" dataset (~40M examples) was:
Saving dict for global step 7520: accuracy = 0.78864, ...
What does "global step" refer to here? I originally thought it was:
global step = (number of training examples * number of epochs) / batch size
However the training set size is 40.8M, the batch size is 30K, and the number of epochs is 5, so this doesn't lead to the right answer:
(40.8M x 5) / 30K = 6800

I think I understand this now. Even though the training set size is 40.8M examples, there is a line in the code that says it is 45M examples (I don't know why). And
(45M x 5) / 30K = 7500
which basically matches the log message.

Related

GCP Console: How are percentile charts calculated?

I do not understand how the charts that show percentiles are calculated inside the Google Cloud Platform Monitoring UI.
Here is how I am creating the standard chart:
Example log events
Creating a log-based metric for request durations
https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics
https://console.cloud.google.com/logs/viewer
Here I have configured a histogram of 20 buckets, starting from 0, each bucket taking 100ms.
0 - 100,
100 - 200,
... until 2 seconds
Creating a chart to show percentiles over time
https://console.cloud.google.com/monitoring/metrics-explorer
I do not understand how these histogram buckets work with "aggregator", "aligner" and "alignment period".
The UI forces using an "aligner" and "alignment period".
Questions
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
B. Do the histogram buckets configured for the log-based metric affect these sums?
I've been looking into the same question for a couple of days and found the Understanding distribution percentiles section in official docs quite helpful.
The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.
They have a good example with buckets [0, 1) [1, 2) [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [64, 128) [128, 256) and only one measurement in the last bucket [128, 256) (none in all other buckets).
You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
I find the GCP Console UI for Metrics explorer a little misleading/confusing with the wording as well (but maybe it's just me unfamiliar with their terms). The key concepts here are Alignment and Reduction, I think.
The aligner produces a single value placed at the end of each alignment period.
A reducer is a function that is applied to the values across a set of time series to produce a single value.
The difference between the two are horizontal vs. vertical aggregations. With the UI, Aggregator (both primary and secondary) are reducers.
Back to the question, a sum alignment applying before a percentile reducer seems more useful in other use cases than yours. In short, a mean or max aligner may be more useful to your "duration_ms" metric, but they're not available in the dropdown on UI, and to be honest I haven't figured out how to implement them in MQL Editor either. Just referencing from the docs here. There are other aligners that may also be useful, but I'm just gonna leave them out for now.
B. Do the histogram buckets configured for the log-based metric affect these sums?
Same as #Anthony, I'm not quite sure what the question is implying either. Just going to assume you're asking if you can align/reduce log-based metrics using these aligners/reducers, and the answer would be yes. However, you'll need to know what metric type you're using (counter vs distribution) and aggregate them in corresponding ways as you need.
Before we look at your questions, we must understand Histograms.
By using the documentation you had provided in the post, there is a section in the document that explains Histogram Buckets. Looking at this section and reflecting your setup, we can see that you are using the Linear type to specify the boundaries between histogram buckets for distribution metrics.
Furthermore, the Linear type has three values for calculations:
offset value (Start value [a])
width value (Bucket width [b])
I value (Number of buckets [N])
Every bucket has the same width and the boundaries are calculated using the following formula: offset + width x I (Where I = 0,1,2,...,∞).
For example, if the start value is 5, the number of buckets is 4, and the bucket width is 15, then the bucket ranges are as follows:
[-INF, 5), [5, 20), [20, 35), [35, 50), [50, 65), [65, +INF]
Now we understand the formula, we can look at your questions and answer them:
How are percentile charts calculated?
If we look at this documentation on Selecting metrics, we can see that there is a section that speaks about how Aggregation works. I would suggest looking into this part to understand how Aggregation works in GCP
The formula to calculate the Percentile is the following:
R = P / 100 (N + 1)
Where R represents the rank order of the score. P represents the percentile rank. N represents the number of scores in the distribution.
If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
In the same section, it also explains what the Alignment Period is, but for the most part, the alignment period determines the length of time for subdividing the time series. For example, you can break a time series into one-minute chunks or one-hour chunks. The data in each period is summarized so that a single value represents that period. The default alignment period is one minute.
Although you can set the alignment interval for your data, time series might be realigned when you change the time interval displayed on a chart or change the zoom level.
Do the histogram buckets configured for the log-based metric affect these sums?
I am not too sure on what you are applying here, are you asking if when logs are created, the sums would be altered via by the logs being generated?
I hope this helps!

Google AutoML Importing text items very slow

I'm importing text items to Google's AutoML. Each row contains around 5000 characters and I'm adding 70K of these rows. This is a multi-label data set. There is no progress bar or indication of how long this process will take. Its been running for a couple of hours. Is there any way to calculate time remaining or total estimated time. I'd like to add additional data sets, but I'm worried that this will be a very long process before the training even begins. Any sort of formula to create even a semi-wild guess would be great.
-Thanks!
I don't think that's possible today, but I filed a feature request [1] that you can follow for updates. I asked for both training and importing data, as for training it could be useful too.
I tried training with 50K records (~ 300 bytes/record) and the load took more than 20 mins after which I killed it. I retried with 1K, which ran for 20 mins and then emailed me an error message saying I had multiple labels per input (yes, so what? training data is going to have some of those) and I had >100 labels. I simplified the classification buckets and re-ran. It took another 20 mins and was successful. Then I ran 'training' which took 3 hours and billed me $11. That maps to $550 for 50K recs, assuming linear behavior. The prediction results were not bad for a first pass, but I got the feeling that it is throwing a super large neural net at the problem. Would help if they said what NN it was and its dimensions. They do say "beta" :)
don't wast your time trying to using google for text classification. I am a GCP hard user but microsoft LUIS is far better, precise and so much faster that I can't believe that both products are trying to solve same problem.
Luis has a much better documentation, support more languages, has a much better test interface, way faster.. I don't know if is cheaper yet because the pricing model is different but we are willing to pay more.

same predictions for all the inputs with a fine tuned inception v3 model

I am trying to fine tune an inception v3 model with 2 categories . These are the steps i followed 1. created sharded files from custom data using build_image_data.py by changing the number of classes and examples in imagenet_data.py. Used a labelsfile.txt ; 2. changed the values accordingly in flowers_data.py and using flowers_train.py I trained the model. ; 3. I froze the model and got protobuf file. ; 4. My input node (x) expects a batch of size 32 and size 299x299x3 so I hacked my way by duplicating my test image 32 times and created an input batch ; 5. using input&output nodes, input batch and the script below, I am able to print the scores of predictions
image_data = create_test_batch(args.image_name)
graph=load_graph(args.frozen_model_filename)
x = graph.get_tensor_by_name('prefix/batch_processing/Reshape:0')
y = graph.get_tensor_by_name('prefix/tower_0/logits/predictions:0')
with tf.Session(graph=graph) as sess:
y_out=sess.run(y, feed_dict={x:image_data})
print(y_out)
I got the result which looks like:
[[ 0.02264258 0.16756369 0.80979371][ 0.02351799 0.16782859 0.80865341].... [ 0.02205461 0.1794569 0.7984885 ][ 0.02153662 0.16436867 0.81409472]](32)
For any image as input, I have been getting the maximum score only in column 3 which means I'd get the same prediction for any input.
Is there any point which I am missing in my process? Can anyone help me with this issue?
I am using python 2.7 in ubuntu 16.04 in cloudVM
Hi I was also facing the same problem but I found out that I didn't preprocess my test set in the same way that I preprocessed my train set, in my case I fixed the problem by following the same preprocessing steps for both of my test and train set.
This is the problem in mine:
Training:
train_image=[]
for i in files:
img=cv2.imread(i)
img=cv2.resize(img,(100,100))
img=image.img_to_array(img)
img=img/255
train_image.append(img)
X=np.array(train_image)
But while preprocessing the test I forgot to normalize my "img" (means doing img=img/255) so later adding this img=img/255 step in my test set I solved my problem.
Test Set:
img=cv2.imread(i)
img=cv2.resize(img,(100,100))
img=image.img_to_array(img)
img = img/255
test_image.append(img)
I had a similar issue recently and what I found out was I pre-processed my test data differently from the method in the fine tune program I was using. Basically the data ranges are different - for training images, the pixels range from 0 to 255 while for test images, the pixels range from 0 to 1. That's why when I fed my test data into the model, the model will output the same predictions all the time, since the pixels of the test data are in such a small range that it doesn't make any difference to the model.
Hope that helps even though it might not be your case.

Proper Python data structure for real-time analysis?

Community,
Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?
I want to be able to do things like:
Slice the data to get the value of the last logged datapoint.
Slice the data to get the mean of the datapoints for the last n seconds.
Perform a regression on the last n data points to get g/s.
Remove from the log data points older than n seconds.
Current Attempts:
Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.
log = {}
def log_data():
log[round(time.time(), 4)] = read_data()
Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.
This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.
Goal:
So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?
Thanks,
Michael
To start, I would question two assumptions:
You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?
Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.
Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?
data = []
new_observation = (timestamp, value)
# new data comes in
data.append(new_observation)
# Slice the data to get the value of the last logged datapoint.
data[-1]
# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])
# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)

Using RRD4J and XML to create graph?

I have a homework that I need to research RRD4J and create graph by using RRD4J library. My teacher just gave me only xml file. So, can I use XML with RRD4J to draw graph and how?
Without much more information it is difficult to answer your question. Those general steps might help you to understand what you should do, to solve the problem:
1) Depending on the granularity you would like to have (and the data frequency you have in XML file) create RRD
For example if you would like to have hourly and daily data, your archive creation should look like:
RrdDef rrdDef = new RrdDef(fileName, 60); // 60 is step, means you expect data to enter at 60 seconds interval
rrdDef.setStartTime(...); // Set initial timestamp here (must be 10 digit epoch timestamp)
rrdDef.addDatasource(DATASOURCE_NAME, DsType.GAUGE, 120, 0, Double.NaN); // DATASOURCE_NAME is the name of your variable in time series, DsType - is the type of data (always increasing, increasing and decreasing, etc), 120 is the timeout for new data entry, (i.e if no data enters in 120 seconds, NaN will be added to database), max and min values
rrdDef.addArchive(ConsolFun.AVERAGE, 0.99, 1, 60);
rrdDef.addArchive(ConsolFun.AVERAGE, 0.99, 24, 240);
RrdDb rrdDb = new RrdDb(rrdDef);
rrdDb.close();
(all of those configurations are coming from detailed analisys of time series you are working with, it's really hard to predict something without looking at data)
2) Parse XML file using SAX (I guess this one will be better since after insertiong into RRD database you won't need to access parsed values anymore)
3) While parsing XML, update RRD
RrdDb rrdDb = new RrdDb(fileName);
Sample sample = rrdDb.createSample();
sample.setAndUpdate(timestamp+":"+value);
rrdDb.close();
4) When all data is inserted generate some graphs (check the examples and options on RRD4J website)
P.S(use the intergration with MongoDB, which outperformes RRD4J many time, there is an example also on their page)
Hope this helped :-)
Is this XML a template ?
http://rrd4j.googlecode.com/git/javadoc/org/rrd4j/core/XmlTemplate.html
The best configuration for rrd4j is File and version 2 rrd.