save the model and checkpointing for algorithm-Trainers in ray-rllib - ray

Does anyone know how can I do checkpointing and saving the model for algorithm-Trainer models in ray-rllib?
I know that that is available for ray.tune, but it seems that it is not directly possible to do so for the rllib algorithms.

The trainer class has a save_checkpoint method as well as a load_checkpoint one.
#override(Trainable)
def save_checkpoint(self, checkpoint_dir: str) -> str:
checkpoint_path = os.path.join(
checkpoint_dir, "checkpoint-{}".format(self.iteration)
)
pickle.dump(self.__getstate__(), open(checkpoint_path, "wb"))
return checkpoint_path
#override(Trainable)
def load_checkpoint(self, checkpoint_path: str) -> None:
extra_data = pickle.load(open(checkpoint_path, "rb"))
self.__setstate__(extra_data)

Related

Passing Audio Files To Celery Task

I have a music uploading app and believe that it would be smart to pass the files to a celery task to handle uploading. However, when attempting to pass the files, as I will show in my code below, I get a message stating that they are not JSON serializable. What would be the correct way to handle this operation?
Everything below uploaded_songs in .views.py is my current code that successfully uploads the audio tracks. It doesn't, however, utilize celery yet.
.task.py
from django.contrib.auth import get_user_model
from Beyond_April_Base_Backend.celery import app
from django.contrib.auth.models import User
#app.task
def upload_songs(songs, user_id):
try:
user = User.objects.get(pk=user_id)
print('user and songs')
print(user)
print(songs)
except User.DoesNotExist:
logging.warning("Tried to find non-exisiting user '%s'" % user_id)
.views.py
class ConcertUploadView(APIView):
permission_classes = [permissions.IsAuthenticated]
def post(self, request):
track_files = request.FILES.getlist('files')
current_user = self.request.user
upload_songs.delay(track_files, current_user.pk)
try:
selected_band = Band.objects.get(name=request.data['band'])
except ObjectDoesNotExist:
print('band not received from form')
selected_band = Band.objects.get(name='Band')
venue_name = request.data['venue']
concert_date_str = request.data['concertDate']
concert_date_split = concert_date_str.split('(')[0]
concert_date = datetime.strptime(concert_date_split, '%a %b %d %Y %H:%M:%S %Z%z ')
concert_city = request.data['city']
concert_state = request.data['state']
concert_country = request.data['country']
new_concert = Concert(
venue=venue_name,
date=concert_date,
city=concert_city,
state=concert_state,
country=concert_country,
band=selected_band,
user=current_user,
)
new_concert.save()
i = 0
for song in track_files:
audio_metadata = music_tag.load_file(track_files[i].temporary_file_path())
temp_path = song.temporary_file_path
song_title = str(audio_metadata['title'])
audio_file_instance = Song(
title=song_title,
concert=new_concert,
user=current_user,
concert_order = i + 1,
audio_file = track_files[i],
)
audio_file_instance.save()
i += 1
return Response(status=status.HTTP_201_CREATED)
When you create a celery task, it serializes the arguments so that it can store the message in the queue backend (RabbitMQ, Redis, etc). The default serializer is JSON, and a binary file is not JSON-serializable. See celery's serialization docs for more info.
You could base64 encode the binary file to text, but you shouldn't: it will increase the size of the data, and you'll be passing around potentially very large messages. With lots of large messages, you could run out of memory/space in your backend, and it will make it hard to inspect or log messages.
Instead, you should store the binary file somewhere, and pass a reference (filename, S3 URL, database key, etc) to the task. The task can then load the file, do what it needs to, and delete the original (if appropriate).

get() in Google Datastore doesn't work as intended

I'm building a basic blog from the Web Development course by Steve Hoffman on Udacity. This is my code -
import os
import webapp2
import jinja2
from google.appengine.ext import db
template_dir = os.path.join(os.path.dirname(__file__), 'templates')
jinja_env = jinja2.Environment(loader = jinja2.FileSystemLoader(template_dir), autoescape = True)
def datetimeformat(value, format='%H:%M / %d-%m-%Y'):
return value.strftime(format)
jinja_env.filters['datetimeformat'] = datetimeformat
def render_str(template, **params):
t = jinja_env.get_template(template)
return t.render(params)
class Entries(db.Model):
title = db.StringProperty(required = True)
body = db.TextProperty(required = True)
created = db.DateTimeProperty(auto_now_add = True)
class MainPage(webapp2.RequestHandler):
def get(self):
entries = db.GqlQuery('select * from Entries order by created desc limit 10')
self.response.write(render_str('mainpage.html', entries=entries))
class NewPost(webapp2.RequestHandler):
def get(self):
self.response.write(render_str('newpost.html', error=""))
def post(self):
title = self.request.get('title')
body = self.request.get('body')
if title and body:
e = Entries(title=title, body=body)
length = db.GqlQuery('select * from Entries order by created desc').count()
e.put()
self.redirect('/newpost/' + str(length+1))
else:
self.response.write(render_str('newpost.html', error="Please type in a title and some content"))
class Permalink(webapp2.RequestHandler):
def get(self, id):
e = db.GqlQuery('select * from Entries order by created desc').get()
self.response.write(render_str('permalink.html', id=id, entry = e))
app = webapp2.WSGIApplication([('/', MainPage),
('/newpost', NewPost),
('/newpost/(\d+)', Permalink)
], debug=True)
In the class Permalink, I'm using the get() method on the query than returns all records in the descending order of creation. So, it should return the most recently added record. But when I try to add a new record, permalink.html (it's just a page with shows the title, the body and the date of creation of the new entry) shows the SECOND most recently added. For example, I already had three records, so when I added a fourth record, instead of showing the details of the fourth record, permalink.html showed me the details of the third record. Am I doing something wrong?
I don't think my question is a duplicate of this - Read delay in App Engine Datastore after put(). That question is about read delay of put(), while I'm using get(). The accepted answer also states that get() doesn't cause any delay.
This is because of eventual consistency used by default for GQL queries.
You need to read:
https://cloud.google.com/appengine/docs/python/datastore/data-consistency
https://cloud.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency
https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
search & read on SO and other source about strong & eventual consistency in Google Cloud Datastore.
You can specify read_policy=STRONG_CONSISTENCY for your query but it has associated costs that you should be aware of and take into account.

mongoengine know when to delete document

New to django. I'm doing my best to implement CRUD using Django, mongodb, and mongoengine. I'm able to query the database and render my page with the correct information from the database. I'm also able to change some document fields using javascript and do an Ajax POST back to the original Django View class with the correct csrf token.
The data payload I'm sending back and forth is a list of each Document Model (VirtualPageModel) serialized to json (each element contains ObjectId string along with the other specific fields from the Model.)
This is where it starts getting murky. In order to update the original document in my View Class post function I do an additional query using the object id and loop through the dictionary items, setting the respective fields each time. I then call save and any new data is pushed to the Mongo collection correctly.
I'm not sure if what I'm doing to update existing documents is correct or in the spirit of django's abstracted database operations. The deeper I get the more I feel like I'm not using some fundamental facility earlier on (provided by either django or mongoengine) and because of this I'm having to make things up further downstream.
The way my code is now I would not be able to create a new document (although that's easy enough to fix). However what I'm really curious about is how I would know when to delete a document which existed in the initial query, but was removed by the user/javascript code? Am I overthinking things and the contents of my POST should contain a list of ObjectIds to delete (sounds like a security risk although this would be an internal tool.)
I was assuming that my View Class might maintain either the original document objects (or simply ObjectIds) it queried and I could do my comparisions off of that set, but I can't seem to get that information to persist (as a class variable in VolumeSplitterView) from its inception to when I received the POST at the end.
I would appreciate if anyone could take a look at my code. It really seems like the "ease of use" facilities of Django start to break when paired with Mongo and/or a sufficiently complex Model schema which needs to be directly available to javascript as opposed to simple Forms.
I was going to use this dev work to become django battle-hardened in order to tackle a future app which will be much more complicated and important. I can hack on this thing all day and make it functional, but what I'm really interested in is anyone's experience in using Django + MongoDB + MongoEngine to implement CRUD on a Database Schema which is not vary Form-centric (think more nested metadata).
Thanks.
model.py: uses mongoengine Field types.
class MongoEncoder(JSONEncoder):
def default(self, o):
if isinstance(o, VirtualPageModel):
data_dict = (o.to_mongo()).to_dict()
if isinstance(data_dict.get('_id'), ObjectId):
data_dict.update({'_id': str(data_dict.get('_id'))})
return data_dict
else:
return JSONEncoder.default(self, o)
class SubTypeModel(EmbeddedDocument):
filename = StringField(max_length=200, required=True)
page_num = IntField(required=True)
class VirtualPageModel(Document):
volume = StringField(max_length=200, required=True)
start_physical_page_num = IntField()
physical_pages = ListField(EmbeddedDocumentField(SubTypeModel),
default=list)
error_msg = ListField(StringField(),
default=list)
def save(self, *args, **kwargs):
print('In save: {}'.format(kwargs))
for k, v in kwargs.items():
if k == 'physical_pages':
self.physical_pages = []
for a_page in v:
tmp_pp = SubTypeModel()
for p_k, p_v in a_page.items():
setattr(tmp_pp, p_k, p_v)
self.physical_pages.append(tmp_pp)
else:
setattr(self, k, v)
return super(VirtualPageModel, self).save(*args, **kwargs)
views.py: My attempt at a view
class VolumeSplitterView(View):
#initial = {'key': 'value'}
template_name = 'click_model/index.html'
vol = None
start = 0
end = 20
def get(self, request, *args, **kwargs):
self.vol = self.kwargs.get('vol', None)
records = self.get_records()
records = records[self.start:self.end]
vp_json_list = []
img_filepaths = []
for vp in records:
vp_json = json.dumps(vp, cls=MongoEncoder)
vp_json_list.append(vp_json)
for pp in vp.physical_pages:
filepath = get_file_path(vp, pp.filename)
img_filepaths.append(filepath)
data_dict = {
'img_filepaths': img_filepaths,
'vp_json_list': vp_json_list
}
return render_to_response(self.template_name,
{'data_dict': data_dict},
RequestContext(request))
def get_records(self):
return VirtualPageModel.objects(volume=self.vol)
def post(self, request, *args, **kwargs):
if request.is_ajax:
vp_dict_list = json.loads(request.POST.get('data', []))
for vp_dict in vp_dict_list:
o_id = vp_dict.pop('_id')
original_doc = VirtualPageModel.objects.get(id=o_id)
try:
original_doc.save(**vp_dict)
except Exception:
print(traceback.format_exc())

Transcode video using celery and ffmpeg in django

I would like to transcode user uploaded videos using celery. I think first I should upload the video, and spawn a celery task for transcoding.
Maybe something like this in the tasks.py:
subprocess.call('ffmpeg -i path/.../original path/.../output')
Just completed First steps with celery, so confused how to do so in the views.py and tasks.py. Also is it a good solution? I would really appreciate your help and advice. Thank you.
models.py:
class Video(models.Model):
user = models.ForeignKey(User)
title = models.CharField(max_length=100)
original = models.FileField(upload_to=get_upload_file_name)
mp4_480 = models.FileField(upload_to=get_upload_file_name, blank=True, null=True)
mp4_720 = models.FileField(upload_to=get_upload_file_name, blank=True, null=True)
privacy = models.CharField(max_length=1,choices=PRIVACY, default='F')
pub_date = models.DateTimeField(auto_now_add=True, auto_now=False)
my incomplete views.py:
#login_required
def upload_video(request):
if request.method == 'POST':
form = VideoForm(request.POST, request.FILES)
if form.is_valid():
if form.cleaned_data:
user = request.user
#
#
# No IDEA WHAT TO DO NEXT
#
#
return HttpResponseRedirect('/')
else:
form = VideoForm()
return render(request, 'upload_video.html', {
'form':form
})
I guess you already have solved the problem but I will provide a bit more information to what already said GwynBleidD because I had the same issue.
So as GwynBleidD you need to call Celery tasks, but how to code those tasks ? here is the structure :
the task get the video from the database
it encodes it with ffmepg and outputs it anywhere you want
when done with the encoding, it sets the corresponding attribute to the model and saves it (be careful, if you run various tasks on the same video, do not save with the old instance, as you may lose information from other tasks running)
First, set a FFMPEG_PATH variable in your settings, then:
import os, subprocess
from .models import Video
#app.task
def encode_mp4(video_id, height):
try:
video = Video.objects.get(id = video_id)
input_file_path = video.original.path
input_file_name = video.original.name
#get the filename (without extension)
filename = os.path.basename(input_file_path)
# path to the new file, change it according to where you want to put it
output_file_name = os.path.join('videos', 'mp4', '{}.mp4'.format(filename))
output_file_path = os.path.join(settings.MEDIA_ROOT, output_file_name)
# 2-pass encoding
for i in range(1):
subprocess.call([FFMPEG_PATH, '-i', input_file_path, '-s', '{}x{}'.format(height * 16 /9, height), '-vcodec', 'mpeg4', '-acodec', 'libvo_aacenc', '-b', '10000k', '-pass', i, '-r', '30', output_file_path])
# Save the new file in the database
video.mp4_720.name = output_file_name
video.save(update_fields=['mp4_720'])
Modify your model so you can save original (uploaded) video without transcoded version(s) and maybe add some flag into your model that will save state if video was transcoded (and based on that flag you can display to user that video transcoding is still in progress).
After uploading video and saving it's model to database, run celery task passing ID of your video into it. In celery task retrieve video from database, transcode it and save it into database with changed flag.

django-photologue upload_to

I have been playing around with django-photologue for a while, and find this a great alternative to all other image handlings apps out there.
One thing though, I also use django-cumulus to push my uploads to my CDN instead of running it on my local machine / server.
When I used imagekit, I could always pass a upload_to='whatever' but I cannot seem to do this with photologue as it automatically inserts the imagefield. How would I go about achieving some sort of an overwrite?
Regards
I think you can hook into the pre_save signal of the Photo model, and change the upload_to field, just before the instance is saved to the database.
Take a look at this:
http://docs.djangoproject.com/en/dev/topics/signals/
Managed to find a workaround for it, however this requires you to make the changes in photologue/models.py
if PHOTOLOGUE_PATH is not None:
if callable(PHOTOLOGUE_PATH):
get_storage_path = PHOTOLOGUE_PATH
else:
parts = PHOTOLOGUE_PATH.split('.')
module_name = '.'.join(parts[:-1])
module = __import__(module_name)
get_storage_path = getattr(module, parts[-1])
else:
def get_storage_path(instance, filename):
dirname = 'photos'
if hasattr(instance, 'upload_to'):
if callable(instance.upload_to):
dirname = instance.upload_to()
else: dirname = instance.upload_to
return os.path.join(PHOTOLOGUE_DIR,
os.path.normpath( force_unicode(datetime.now().strftime(smart_str(dirname))) ),
filename)
And then in your apps models do the following:
class MyModel(ImageModel):
text = ...
name = ...
def upload_to(self):
return 'yourdirectorynamehere'
You can use the setting PHOTOLOGUE_PATH to provide your own callable. Define a method which takes instance and filename as parameters then return whatever you want. For example in your settings.py:
import photologue
...
def PHOTOLOGUE_PATH(instance, filename):
folder = 'myphotos' # Add your logic here
return os.path.join(photologue.models.PHOTOLOGUE_DIR, folder, filename)
Presumably (although I've not tested this) you could find out more about the Photo instance (e.g. what other instances it relates to) and choose the folder accordingly.