AttributeError: Can't get attribute 'TunerInternal._validate_overwrite_trainable' on <module 'ray.tune.impl.tuner_internal' - ray

When I am trying to run a Basic Ray Model in our server, Tune Code is breaking. I didnt understand the issue and I didnt find any useful information in internet. Can any one help me with it. FYI, I am able to run them properly when I use local ray cluster which I setup in my local machine.
AttributeError: Can't get attribute 'TunerInternal._validate_overwrite_trainable' on <module 'ray.tune.impl.tuner_internal' from '/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py'>
import ray
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(
# Number of workers to use for data parallelism.
num_workers=2,
# Whether to use GPU acceleration.
use_gpu=False,
),
label_column="y",
num_boost_round=20,
params={
# XGBoost specific params
"objective": "binary:logistic",
# "tree_method": "gpu_hist", # uncomment this to use GPUs.
"eval_metric": ["logloss", "error"],
},
datasets={"train": traindata_ray},
# preprocessor=preprocessor,
)
result = trainer.fit()
print(result.metrics)
Reference: https://docs.ray.io/en/latest/train/train.html

This is likely due to a ray version mismatch. _validate_overwrite_trainable is available in nightly wheels but not in the latest 2.2.0 release.
Can you verify that you are using the same version of ray across your local machine and cluster?

Related

How to pass the experiment configuration to a SagemakerTrainingOperator while training?

Idea:
To use experiments and trials to log the training parameters and artifacts in sagemaker while using MWAA as the pipeline orchestrator
I am using the training_config to create the dict to pass the training configuration to the Tensorflow estimator, but there is no parameter to pass the experiment configuration
tf_estimator = TensorFlow(entry_point='train_model.py',
source_dir= source
role=sagemaker.get_execution_role(),
instance_count=1,
framework_version='2.3.0',
instance_type=instance_type,
py_version='py37',
script_mode=True,
enable_sagemaker_metrics = True,
metric_definitions=metric_definitions,
output_path=output
model_training_config = training_config(
estimator=tf_estimator,
inputs=input
job_name=training_jobname,
)
training_task = SageMakerTrainingOperator(
task_id=test_id,
config=model_training_config,
aws_conn_id="airflow-sagemaker",
print_log=True,
wait_for_completion=True,
check_interval=60
)
You can use the experiment_config in estimator.fit. More detailed example can be found here
The only way that i found right now is to use the CreateTrainigJob API (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-RoleArn). The following steps are needed:
I am not sure if this will work with Bring your own script method for E.g with a Tensorflow estimator
it works with a build your own container approach
Using the CreateTrainigJob API i created the configs which in turn includes all the needed configs like - training, experiment, algporthm etc and passed that to SagemakerTrainingOperator

Error when running Pytest with Delta Lake Tables

I am working in the VDI of a company and they use their own artifactory for security reasons.
Currently I am writing unit tests to perform tests for a function that deletes entries from a delta table. When I started, I received an error of unresolved dependencies, because my spark session was configured in a way that it would load jars from maven. I was able to solve this issue by loading these jars locally from /opt/spark/jars. Now my code looks like this:
class TestTransformation(unittest.TestCase):
#classmethod
def test_ksu_deletion(self):
self.spark = SparkSession.builder\
.appName('SPARK_DELETION')\
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")\
.config("spark.jars", "/opt/spark/jars/delta-core_2.12-0.7.0.jar, /opt/spark/jars/hadoop-aws-3.2.0.jar")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.getOrCreate()
os.environ["KSU_DELETION_OBJECT"]="UNITTEST/"
deltatable = DeltaTable.forPath(self.spark, "/projects/some/path/snappy.parquet")
deltatable.delete(col("DATE") < get_current()
However, I am getting the error message:
E py4j.protocol.Py4JJavaError: An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
E : java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
Do you have any idea by what this is caused? I am assuming it has to do with the way I am configuring spark.sql.extions and/or the spark.sql.catalog, but to be honest, I am quite a newb in Spark.
I would greatly appreciate any hint.
Thanks a lot in advance!
Edit:
We are using Spark 3.0.2 (Scala 2.12.10). According to https://docs.delta.io/latest/releases.html, this should be compatible. Apart from the SparkSession, I trimmed down the subsequent code to
df.spark.read.parquet(Path/to/file.snappy.parquet)
and now I am getting the error message
java.lang.IncompatibleClassChangeError: class org.apache.spark.sql.catalyst.plans.logical.DeltaDelete has interface org.apache.spark.sql.catalyst.plans.logical.UnaryNode as super class
As I said, I am quite new to (Py)Spark, so please dont hesitate to mention things you consider completely obvious.
Edit 2: I checked the Python path I am exporting in the Shell before running the code and I can see the following:
Could this cause any problem? I dont understand why I do not get this error when running the code within pipenv (with spark-submit)
It looks like that you're using incompatible version of the Delta lake library. 0.7.0 was for Spark 3.0, but you're using another version - either lower, or higher. Consult Delta releases page to find mapping between Delta version & required Spark versions.
If you're using Spark 3.1 or 3.2, consider using delta-spark Python package that will install all necessary dependencies, so you just import DeltaTable class.
Update: Yes, this happens because of the conflicting versions - you need to remove delta-spark and pyspark Python package, and install pyspark==3.0.2 explicitly.
P.S. Also, look onto pytest-spark package that can simplify specification of configuration for all tests. You can find examples of it + Delta here.

Pytorch Lightning Tensorboard Logger Across Multiple Models

I'm relatively new to Lightning and Loggers vs manually tracking metrics. I am trying to train two distinct models and have their accuracy and loss plotted on the same charts in tensorboard (or any other logger) within Colab.
What I have right now is basically:
trainer1 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer2 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer1.fit(Model1, train_loader, val_loader)
trainer2.fit(Model2, train_loader, val_loader)
#Then later:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
What I'd like to see at this point are those logged metrics charted together on the same chart, any help would be appreciated. I've spent some time trying to toy with this but I'm a bit out of my depth on this, thank you!
The exact chart used for logging a specific metric depends on the key name you provide in the .log() call (its a feature that Lightning inherits from TensorBoard itself)
def validation_step(self, batch, _):
# This string decides which chart to use in the TB web interface
# vvvvvvvvv
self.log('valid_acc', acc)
Just use the same string for both .log() calls and have both runs saved in same directory.
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model1')
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model2')
If you run tesnsorboard --logdir ./lightning_logs pointing at the parent directory, you should be able to see both metric in the same chart with the key named valid_acc.

Google App Engine deferred.defer task not getting executed

I have a Google App Engine Standard Environment application that has been working fine for a year or more, that, quite suddenly, refuses to enqueue new deferred tasks using deferred.defer.
Here's the Python 2.7 code that is making the deferred call:
# Find any inventory items that reference the product, and change them too.
# because this could take some time, we'll do it as a deferred task, and only
# if needed.
if upd:
updater = deferredtasks.InvUpdate()
deferred.defer(updater.run, product_key)
My app.yaml file has the necessary bits to support deferred.defer:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.deferred.application
login: admin
builtins:
- deferred: on
And my deferred task has logging in it so I should see it running when it does:
#-------------------------------------------------------------------------------
# DEFERRED routine that updates the inventory items for a particular product. Should be callecd
# when ANY changes are made to the product, because it should trigger a re-download of the
# inventory record for that product to the iPad.
#-------------------------------------------------------------------------------
class InvUpdate(object):
def __init__(self):
self.to_put = []
self.product_key = None
self.updcount = 0
def run(self, product_key, batch_size=100):
updproduct = product_key.get()
if not updproduct:
logging.error("DEFERRED ERROR: Product key passed in does not exist")
return
logging.info(u"DEFERRED BEGIN: beginning inventory update for: {}".format(updproduct.name))
self.product_key = product_key
self._continue(None, batch_size)
...
When I run this in the development environment on my development box, everything works fine. Once I deploy it to the App Engine server, the inventory updates never get done (i.e. the deferred task is not executed), and there are no errors (and no other logging from the deferred task in fact) in the log files on the server. I know that with the sudden move to get everybody on Python 3 as quickly as possible, the deferred.defer library has been marked as not recommended because it only works with the 2.7 Python environment, and I planned on moving to task queues for this, but I wasn't expecting deferred.defer to suddenly stop working in the existing python environment.
Any insight would be greatly appreciated!
I'm pretty sure you cant pass the method of an instance to appengine taskqueue, because that instance will not get exist when your task runs since it will be running in a different process. I actually dont understand how your task ever worked when running remotely in the first place (and running locally is not an accurate representation of how things will run remotely)
Try changing your code to this:
if upd:
deferred.defer(deferredtasks.InvUpdate.run_cls, product_key)
and then InvUpdate is the same but has a new function run_cls:
class InvUpdate(object):
#classmethod
def run_cls(cls, product_key):
cls().run(product_key)
And I'm still on the process of migrating to cloud tasks and my deferred tasks still work

Use TensorBoard with Keras Tuner

I ran into an apparent circular dependency trying to use log data for TensorBoard during a hyper-parameter search done with Keras Tuner, for a model built with TF2. The typical setup for the latter needs to set up the Tensorboard callback in the tuner's search() method, which wraps the model's fit() method.
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(build_model, #this method builds the model
hyperparameters=hp, objective='val_accuracy')
tuner.search(x=train_x, y=train_y,
validation_data=(val_x, val_y),
callbacks=[tensorboard_cb]
In practice, the tensorboard_cb callback method needs to set up the directory where data will be logged and this directory has to be unique to each trial. A common way is to do this by naming the directory based on the current timestamp, with code like below.
log_dir = time.strftime('trial_%Y_%m_%d-%H_%M_%S')
tensorboard_cb = TensorBoard(log_dir)
This works when training a model with known hyper-parameters. However, when doing hyper-parameters search, I have to define and specify the TensorBoard callback before invoking tuner.search(). This is the problem: tuner.search() will invoke build_model() multiple times and each of these trials should have its own TensorBoard directory. Ideally defining log_dir will be done inside build_model() but the Keras Tuner search API forces the TensorBoard to be defined outside of that function.
TL;DR: TensorBoard gets data through a callback and requires one log directory per trial, but Keras Tuner requires defining the callback once for the entire search, before performing it, not per trial. How can unique directories per trial be defined in this case?
The keras tuner creates a subdir for each run (statement is probably version dependent).
I guess finding the right version mix is of importance.
Here is how it works for me, in jupyterlab.
prerequisite:
pip requirements
keras-tuner==1.0.1
tensorboard==2.1.1
tensorflow==2.1.0
Keras==2.2.4
jupyterlab==1.1.4
(2.) jupyterlab installed, built and running [standard compile arguments: production:minimize]
Here is the actual code. First i define the log folder and the callback
# run parameter
log_dir = "logs/" + datetime.datetime.now().strftime("%m%d-%H%M")
# training meta
stop_callback = EarlyStopping(
monitor='loss', patience=1, verbose=0, mode='auto')
hist_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
embeddings_freq=1,
write_graph=True,
update_freq='batch')
print("log_dir", log_dir)
Then i define my hypermodel, which i do not want to disclose. Afterwards
i set up the hyper parameter search
from kerastuner.tuners import Hyperband
hypermodel = get_my_hpyermodel()
tuner = Hyperband(
hypermodel
max_epochs=40,
objective='loss',
executions_per_trial=5,
directory=log_dir,
project_name='test'
)
which i then execute
tuner.search(
train_data,
labels,
epochs=10,
validation_data=(val_data, val_labels),
callbacks=[hist_callback],
use_multiprocessing=True)
tuner.search_space_summary()
While the notebook with this code searches for adequate hyper parameters i control the loss in another notebook. Since tf V2 tensorboard can be called via a magic function
Cell 1
import tensorboard
Cell 2
%load_ext tensorboard
Cell 3
%tensorboard --logdir 'logs/'
Sitenote: Since i run jupyterlab in a docker container i have to specifiy the appropriate address and port for tensorboard and also forward this in the dockerfile.
The result is not really predictable for me... I did not understand yet, when i can expect histograms and distributions in tensorboard.
Some runs the loading time seems really excessive... so have patience
Under scalars i find a list of the turns as follows
"logdir"/"model_has"/execution[iter]/[train/validation]
E.g.
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/train
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/validation