Tensorflow runs "Running per image evaluation" indefinitly - python-2.7

I am running my first tensorflow job (object detection training) right now, using the tensorflow API. I am using the ssd mobilenet network from the modelzoo. I used the >>ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config<< as a config-file and as a fine tune checkpoint the >>ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03<< checkpoint.
I started my training with the following command:
PIPELINE_CONFIG_PATH='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config'
MODEL_DIR='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/train'
NUM_TRAIN_STEPS=200000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
No coming to my problem, I hope the community can help me with. I trained the network over night and it trained for 1400 steps and then started evaluating per image, which was running the entire night. Next morning I saw, that network only evaluated and the training was still at 1400 steps. You can see part of the console output in the image below.
Console output from evaluation
I tried to take control by using the eval config parameter in the config file.
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
num_examples: 5000
}
I added max_evals = 1, because the documentation says that I can limit the evaluation like this. I also changend eval_interval_secs = 3600 because I only wanted one eval every hour. Both options had no effect.
I also tried other config-files from the modelzoo, with no luck. I searched google for hours, only to find answers which told me to change the parameters I already changed. So I am coming to stackoverflow to find help in this Matter.
Can anybody help me, maybe hat the same experience? Thanks in advance for all your help!
Environment information
$ pip freeze | grep tensor
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-gpu==1.11.0
$ python -V
Python 2.7.12

I figured out a solution for the problem. The problem with tensorflow 1.10 and after is, that you can not set checkpoint steps or checkpoint secs in the config file like before. By default tensorflow 1.10 and after saves a checkpoint every 10 min. If your hardware is not fast enough and you need more then 10 min for evaluation, you are stuck in a loop.
So to change the time steps or training steps till a new checkpoint is safed (which triggers the evaluation), you have to navigate to the model_main.py in the following folder:
tensorflow/models/research/object_detection/
Once you opened model_main.py, navigate to line 62. Here you will find
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
To trigger the checkpoint save after 2500 steps for example, change the entry to this:
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,save_checkpoints_steps=2500).
Now the model is saved every 2500 steps and afterwards an evaluation is done.
There are multiple parameters you can pass through this option. You can find a documentation here:
tensorflow/tensorflow/contrib/learn/python/learn/estimators/run_config.py.
From Line 231 to 294 you can see the parameters and documentation.
I hope I can help you with this and you don't have to look for an answer as long as I did.

Could it be that evaluation takes more than 10 minutes in your case? It could be that since 10 minutes is the default interval for making evaluation, it keeps evaluating.
Unfortunately, the current API doesn't easily support altering the time interval for evaluation.
By default, evaluation happens after every checkpoint saving, which by default is set to 10 minutes.
Therefore you can change the time for saving a checkpoint by specifying save_checkpoint_secs or save_checkpoint_steps as an input to the instance of MonitoredSession (or MonitoredTrainingSession). Unfortunately and best to my knowledge, these parameters are not available to be set as flags to model_main.py or from the config file. Therefore, you can either change their value by hard coding, or exporting them out so that they will be available.
An alternative way, without changing the frequency of saving a checkpoint, is modifying the evaluation frequency which is specified as throttle_secs to tf.estimator.EvalSpec.
See my explanation here as to how to export this parameter to model_main.py.

Related

Cannot launch Vertex Hypertune Job (Google Cloud Platform)

I am trying to use Vertex Hypertune from the google cloud console. I filled the forms to indicate my dataset, python package, compute resources, and so on.
Everything seems good up until the point where I submitted the job, I got an instant error (so probably it is not an issue from my code because there is no way it has even been run):
Unable to parse `training_pipeline.training_task_inputs` into hyperparameter tuning task `inputs` defined in the file gs://google-cloud-aiplatform/schema/trainingjob/definition/hyperparameter_tuning_task_1.0.0.yaml
I am really confused about why I get this error as I launched a training job without hyperparameter tuning with the same arguments and it worked just fine.
Any help would be truly appreciated
Note: I used a tabular dataset that comes from a BigQuery table (loaded with the dataset functionality). Default parameters were chosen for this dataset.
I picked the tensorflow 1.15 pre-built container and added my python code in an archive .tar.gz (generated with python setup.py sdist).
I configured only one hyperparameter (learning rate, double, between 0.001 and 0.1 to maximize 'accuracy', as declared in hypertune) and picked the lightest standard machine (n1-standard-4).
EDIT: Following the comment of #Jofre, it works now so it was probably originated by a temporary UI bug.

TSFRESH library for python is taking way too long to process

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.
I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.
from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)
My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.
Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.
Update
It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by
if __name__ == "__main__":
main()
Once I added
if __name__ == "__main__":
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.
which version of tsfresh did you use? Which OS?
We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.
Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.
from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())
Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).
Syntax has changed slightly (see docs), the current approach would be:
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())
Or
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())
Since version 0.15.0 we have improved our bindings for Apache Spark and dask.
It is now possible to use the tsfresh feature extraction directly in your usual dask or Spark computation graph.
You can find the bindings in tsfresh.convenience.bindings with the documentation here. For example for dask, it would look something like this (assuming df is a dask.DataFrame, for example the robot failure dataframe from our example)
df = df.melt(id_vars=["id", "time"],
value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
var_name="kind", value_name="value")
df_grouped = df.groupby(["id", "kind"])
features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
column_sort="time", column_value="value",
default_fc_parameters=EfficientFCParameters())
# or any other parameter set
Using either dask or Spark (or anything alike) might help you with very large data - both for memory as well as speed (as you can distribute the work over multiple machines). Of course, we still support the usual distributors (docu) as before.
Additional to that, it is also possible to run tsfresh together with a task orchestration system, such as luigi. You can create a task to
* read in the data for only one id and kind
* extract the features
* write out the result to disk
and let luigi handle all the rest. You may find a possible implementation of this here on my blog.
I've found, at least on a multicore machine, that a better way to distribute extract_features calculation over independent subgroups (identified by the column_id value) is through joblib.Parallel with the Loky backend.
For example, you define your features extraction function on a single value of columnd_id and you apply it
from joblib import Parallel, delayed
def map_extract_features(df):
return extract_features(
timeseries_container=df,
default_fc_parameters=settings,
column_id="ID",
column_sort="DATE",
n_jobs=1,
disable_progressbar=True
).reset_index().rename({"index":"ID_CONTO"}, axis=1)
out = Parallel(n_jobs=cpu_count()-1)(
delayed(map_extract_features)(
my_dataframe[my_dataframe["ID"]==id]
) for id in tqdm(my_dataframe["ID"].unique())
)
This method takes way less memory than specifying column_id directly in the extract_features function.

Amazon Mechanical Turks Qualification Multiple Locals

I am trying to define a worker qualification for a HIT via command line tool.
The requirements to workers are
Locale: US, CA
Number of accomplished HITs >= 30
Acceptance rate >= 75%
The following is the part of external_hit.properties file
# Worker_PercentAssignmentsApproved > 75%
qualification.1:000000000000000000L0
qualification.comparator.1:greaterthan
qualification.value.1:75
qualification.private.1:false
# Worker_Locale
qualification.2:00000000000000000071
qualification.comparator.2:in
qualification.LocaleValue.2.Country.1=US
qualification.LocaleValue.2.Country.2=CA
# Worker_NumberHITsApproved > 30
qualification.3:00000000000000000040
qualification.comparator.3:GreaterThan
qualification.value.3:30
The problem is I am not sure about the syntax, especially about this part, I made up this part.
qualification.LocaleValue.2.Country.1=US
qualification.LocaleValue.2.Country.2=CA
I didn't find any example in command line tool format with multiple locales.
I appreciate if you could check the syntax.
The format is a set of comma-separated values:
qualification.LocaleValue.2.Country=US,CA
However, AWS does not update the command line tools (and the last time I looked, they appeared to be officially unsupported), so this feature may not work at all, thus requiring you to go through the web API.

Weka - Measuring testing time

I'm using Weka 3.6.8 to carry out some machine learning and I'm want to find the 'time taken to test model on training/testing data'. When I test a predictive model on evaluation data, this parameter seems to be missing. Has this feature been removed from Weka or is it just a setting I'm missing? All I seem to be able to find is the time taken to build the actual predictive model. (I've also checked the Weka Manual but can't find anything)
Thanks in advance
That feature was added to 3.7.7, you need to upgrade. You should be able to get this data by running the test on the command line with the -T parameter.

Profiling for wall-time on Linux

I have an application that I want to profile wrt how much time is spent in various activities. Since this application is I/O intensive, I want to get a report that will summarize how much time is spent in every library/system call (wall time).
I've tried oprofile, but it seems it gives time in terms of Unhalted CPU cycles (thats cputime, not real time)
I've tried strace -T, which gives wall time, but the data generated is huge and getting the summary report is difficult (and awk/py scripts exist for this ?)
Now I'm looking upto SystemTap, but I don't find any script that is close enough and can be modified, and the onsite tutorial didn't help much either. I am not sure if what I am looking for can be done.
I need someone to point me in the right direction.
Thanks a lot!
Judging from this commit, the recently released strace 4.9 supports this with:
strace -w -c
They call it "syscall latency" (and it's hard to see from the manpage alone that's what -w does).
Are you doing this just out of measurement curiosity, or because you want to find time-drains that you can fix to make it run faster?
If your goal is to make it run as fast as possible, then try random-pausing.
It doesn't measure anything, except very roughly.
It may be counter-intuitive, but what it does is pinpoint the code that will result in the greatest speed-up.
See the fntimes.stp systemtap sample script. https://sourceware.org/systemtap/examples/index.html#profiling/fntimes.stp
The fntimes.stp script monitors the execution time history of a given function family (assumed non-recursive). Each time (beyond a warmup interval) is then compared to the historical maximum. If it exceeds a certain threshold (250%), a message is printed.
# stap fntimes.stp 'kernel.function("sys_*")'
or
# stap fntimes.stp 'process("/path/to/your/binary").function("*")'
The last line of that .stp script demonstrates the way to track time consumed in a given family of functions
probe $1.return { elapsed = gettimeofday_us()-#entry(gettimeofday_us()) }