Evaluating previous checkpoints - tensorboard

I'm fairly new to TensorFlow, and is experimenting with Bert in TensorFlow. I notice that the example scripts are storing checkpoints every 1000 epoch. It is both storing .data, .index and .meta for each checkpoints. It also creates an eval-folder with a events.out.tfevents.*-file.
It stores an eval_results.txt-file containing the evaluation results for the latest checkpoints.
I want look at the eval-results for previous checkpoints both to see progress and to see if I am overfitting.
I had some issues getting tensorboard running. Are these kind of data stored in the .meta or .index? Do I need tensorboard to see this data, or are there other ways? Or do I have to rerun predictions manually by loading each individual checkpoint?

Related

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

tensorflow repeated running of fully connected model

Question:
How can I "rerun" tensorflow code that depends on queues? Is the best way really to close the session, build the model again, load variables and run?
Motivation:
In a standing unanswered question I asked how in a fully connected model one could interleave actions (such as generating cumulative summaries, calc AUC on test data, etc.) with training that reads data from tensorflow TFRecords files and tf.Queues.
For example, tf.train.string_input_producer returns a filename_queue. As part of the constructor it takes a "num_epochs" arg. Instead of setting "num_epochs" to 100, I'm thinking to just set "num_epochs" to "2" to generate summaries every other epoch. This requires running the same code 50 times, hence the need for an efficient answer to above.

How to store daily builds in Amazon S3 cost-effectively?

I'm trying to make a daily build machine using EC2 and store the daily releases in S3.
The releases are complete disk images so they are very bloated(300+MB total, 95% OS kernel/RFS/libraries, 5% actual software). And they change very little across time.
Ideally, with good compression, the storage cost should be close to O(t), t for time.
But if I simply add those files to S3 every day, with version number as part of file name, or with the same file name each time but with the S3 bucket versioned, the cost would be O(t^2).
Because according to this, all versions takes space and I'm charged for the space a new version takes ever since a new version is created.
Glacier is cheaper but still O(t^2).
Any suggestions?
Basically what you're looking for is an incremental file-level backup. (i.e. only backup things that change) and rebuild the current state by using a full backup and applying the deltas (i.e. increments).
If you need to use the latest image you probably need to do incremental + keep latest image. You also probably want to do full backups from time to time to reduce the time it takes to rebuild from incremental (and you are going to need to keep some sort of metadata associated with the backups).
So to sum it up: what you are describing is possible, you just need to do extra work apart from just pushing the image. Presumably you have a build process that generates the image an the extra steps can be inserted between generation and upload. The restore process is going to be more complicated than currently.
To get you started look at binary diff tools like bsdiff/bspatch or xdelta. You could generate the delta and back up only the delta. The image is also compressed so if you diff the compressed versions you will not get very far, so you probably want to diff the uncompressed file. Another way to look at it is to do the diff before generating an image and picking up only files that changed (probably more complex)

How Hadoop calculate physical memory and virtual memory during a job execution

I have few queries related to the counters used in Hadoop to display memory usage.
A map reduce job executed on a cluster gives me below menitoned counter values. Input file used is just in KBs, but these counter shows 35GB and 420 GB usage.
PHYSICAL_MEMORY_BYTES=35110662144
VIRTUAL_MEMORY_BYTES=420121841664
For another different job on same input file it shows 309 MB (physical) and 3G(vitual) usage
PHYSICAL_MEMORY_BYTES=309526528
VIRTUAL_MEMORY_BYTES=3435827200
First job is more CPU intensive than other and creates more objects than the other one but still its usage shown seems very high.
So I just wanted to know how this memory usage is calculated. I tried going through some posts and gave an over view on this below link which seems to be
requirement task for describing these variables (https://issues.apache.org/jira/i#browse/MAPREDUCE-1218 ) but couldnt find how these are calculated. It does gives me an idea on how these values are passed to Job Tracker,but no information on how these are determined. So if some one could give some insight on this than it would be really helpfull.
You can find few references here and here. The second link in particular to map and reducer job and how slots are decided based on memory allocations. Happy Learning

Building compatible datasets for Weka for large, evolving data

I have a largish dataset that I am using Weka to explore. It goes like this: today I will analyze as much data as I can, and create a trained classifier. I'll save this model as a file. Then tomorrow I will acquire a new batch of data, and want to use the saved model to predict the class for the new data. This repeats every day. Eventually I will update the saved model, but for now assume that it is static.
Due to the size and frequency of this task, I want to run this automatically, which means the command line or similar. However, my problem exists in the Explorer, as well.
My question has to do with the fact that, as my dataset grows, the list of possible labels for attributes also grows. Weka says such attribute lists cannot change, or the training set and test set are said to be incompatible (see: http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F). But in my world there is no way that I could possibly know today all the attribute labels that I will stumble across next week.
To rectify the situation, it is suggested that I run batch filtering (http://weka.wikispaces.com/How+do+I+generate+compatible+train+and+test+sets+that+get+processed+with+a+filter%3F). Okay, that appears to mean that I need to re-build my model with the refiltered training data each day.
At this point the whole thing seems difficult enough that I fear I am making a horrible, simple newbie mistake, and so I ask for help.
DETAILS:
The model was created by
java -Xmx1280M weka.classifiers.meta.FilteredClassifier ^
-t .\training.arff -d .\my.model -c 15 ^
-F "weka.filters.supervised.attribute.Discretize -R first-last" ^
-W weka.classifiers.trees.J48 -- -C 0.25 -M 2
Naively, to predict I would try:
java -Xmx1280M weka.core.converters.DatabaseLoader ^
-url jdbc:odbc:(database) ^
-user (user) ^
-password (password) ^
-Q "exec (my_stored_procedure) '1/1/2012', '1/2/2012' " ^
\> .\NextDay.arff
And then:
java -Xmx1280M weka.classifiers.trees.J48 ^
-T .\NextDay.arff ^
-l .\my.model ^
-c 15 ^
-p 0 ^
\> .\MyPredictions.txt
this yields:
java.lang.Exception: training and test set are not compatible
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1035)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
A related question is asked at kdkeys.net/training-and-test-set-are-not-compatible-weka/
An associated problem is that the command-line version of the database extraction requires generation of a temporary .arff file, and it appears JDBC-generated arff files do not handle "date" data correctly. My database generates dates of the ISO-8601 format "yyyy-MM-dd'T'HH:mm:ss" but both Explorer and generated .arff files from JDBC data represent these as type NOMINAL. And so the list of labels for date attributes in the header is very, very long and never the same from dataset to dataset.
I'm not a java or python programmer, but if that's what it takes, I'll go buy some books! Thanks in advance.
I think you can use Incremental classifiers. But only few classifier can support for this option. Like SMO, J48 classifiers wont support this. So you will use some other classifier to classify.
To know more visit
http://weka.wikispaces.com/Classifying+large+datasets
http://wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka
There is a bigger problem with your plan too, it seems. If you have data from day 1 and you use it to build a model, then you use it on data from day n that has new and never before seen class labels, it will be impossible to predict the new labels because there is no training data for them. Similarly, if you have new attributes, it will be impossible to use those for classification because none of your training data has them to associate with the class labels.
Thus, if you want to use a model trained on data with only a subset of the new data's attributes/classes, then you might as well filter the new data to remove the new classes/attributes since they wouldn't be used even if you could execute weka without errors on two dissimilar datasets.
If it's not in your training set, exclude it from your test set. Then everything should work. If you need to be able to test/predict on it, then you need to retrain a new model that has examples of the new classes/attributes.
Doing this in your environment might require manually querying data out of the database into arff files, so as to query out only the attributes/classes that were in the training set. Look into sql and any major scripting language (e.g. perl, python) to do this without much fuss.
The university who maintains Weka also created MOA (Massive Online Analysis) to analyse and solve your kind of problem. All of their classifiers are updatable and you can compare classifiers performance over the time for your data flow. It also allows you to detect change of models (concept drift/shift) and optimize (ie. limit) your data window over the time (forget old data mechanism...).
Once you're done with testing and tuning with MOA, you can then use MOA classifiers from Weka (there is an extension to enable it) and batch all your process.