I have an ipython notebook that I invoke through Django's
shell_plus --notebook
command.
I would like to save the notebook, meaning the code cells, without saving the output that follows each code cell.
I use this notebook to do analytics and reporting on sensitive patient data covered by HIPAA and so I'd like to be able to persist the notebook in git without exposing the sensitive patient data in the git repository.
Reposting as an answer:
You can set up a git hook that will strip the output from the notebook whenever you commit: gist.github.com/minrk/6176788
A bit more advanced, but less tested, is a tool I wrote called nbexplode, that splits the notebook up into multiple pieces and recombines them. The advantage of this is that you could keep the output in your local copy and only commit the code. But if the simpler approach works for you, I'd go for that.
Related
I'm currently using Sagemaker notebook instance (not from Sagemaker Studio), and I want to run a notebook that is expected to take around 8 hours to finish. I want to leave it overnight, and see the output from each cell, the output is a combination of print statements and plots.
Howevever, when I start running the notebook and make sure the initial cells run, I close the Jupyterlab tab in my browser, and some minutes after, I open it again to see how is it going, but the notebook is stopped.
Is there any way where I can still use my notebook as it is, see the output from each cell (prints and plots) and do not have to keep the Jupyterlab tab open (turn my laptop off, etc)?
Jupyter will stop your kernel when you close the tab. If you want to benefit from your jobs running after you close the jupyter tab, I would recommend looking into using SageMaker Processing or Training jobs for your workloads. Alternatively, this link provides some options on how to keep the notebook running with the tab closed.
Answering my own question.
I ended up using Sagemaker Processing jobs for this. As initially suggested by the other answer. I found this library developed a few months ago: Sagemaker run notebook, which helped still keep my notebook structure and cells as I had them, and be able to run it using Sagemaker run notebook using a bigger instance, and modifying the notebook in a smaller one.
The output of each cell was saved, along the plots I had, in S3 as a jupyter notebook.
I see that no constant support is given to the library, but you can fork it and make changes to it, and use it as per your requirements. For example, creating a docker container based on your needs.
Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.
I have been reading about process to do backup and restore in django.
The best i could come up with was to dumpdata.
i.e. python manage.py dumpdata >foo.json
Now to restore this data we will have to delete or drop the present table, then restore this json file using it as a fixture. i.e. do syncdb
Is there any particular way of doing this ? i.e. is there any process which can be used every-time we do backup and restore?
I am looking for a tool like south, which can used for database backup and restore.
I am planning to get my site online.So any help will be highly appreciated.
The django-dbbackup package can do database backups and restores.
For proper backup and restore, use the tools that came with your database.
If you must use django, write your own custom management commands (but again, I question the wisdom of this).
How do people deploy/version control cronjobs to production? I'm more curious about conventions/standards people use than any particular solution, but I happen to be using git for revision control, and the cronjob is running a python/django script.
If you are using Fabric for deploment you could add a function that edits your crontab.
def add_cronjob():
run('crontab -l > /tmp/crondump')
run('echo "#daily /path/to/dostuff.sh 2> /dev/null" >> /tmp/crondump')
run('crontab /tmp/crondump')
This would append a job to your crontab (disclaimer: totally untested and not very idempotent).
Save the crontab to a tempfile.
Append a line to the tmpfile.
Write the crontab back.
This is propably not exactly what you want to do but along those lines you could think about checking the crontab into git and overwrite it on the server with every deploy. (if there's a dedicated user for your project.)
Using Fabric, I prefer to keep a pristine version of my crontab locally, that way I know exactly what is on production and can easily edit entries in addition to adding them.
The fabric script I use looks something like this (some code redacted e.g. taking care of backups):
def deploy_crontab():
put('crontab', '/tmp/crontab')
sudo('crontab < /tmp/crontab')
You can also take a look at:
http://django-fab-deploy.readthedocs.org/en/0.7.5/_modules/fab_deploy/crontab.html#crontab_update
django-fab-deploy module has a number of convenient scripts including crontab_set and crontab_update
You can probably use something like CFEngine/Chef for deployment (it can deploy everything - including cron jobs)
However, if you ask this question - it could be that you have many production servers each running large number of scheduled jobs.
If this is the case, you probably want a tool that can not only deploy jobs, but also track success failure, allow you to easily look at logs from the last run, run statistics, allow you to easily change the schedule for many jobs and servers at once (due to planned maintenance...) etc.
I use a commercial tool called "UC4". I don't really recommend it, so I hope you can find a better program that can solve the same problem. I'm just saying that administration of jobs doesn't end when you deploy them.
There are really 3 options of manually deploying a crontab if you cannot connect your system up to a configuration management system like cfengine/puppet.
You could simply use crontab -u user -e but you run the risk of someone having an error in their copy/paste.
You could also copy the file into the cron directory but there is no syntax checking for the file and in linux you must run touch /var/spool/cron in order for crond to pickup the changes.
Note Everyone will forget the touch command at some point.
In my experience this method is my favorite manual way of deploying a crontab.
diff /var/spool/cron/<user> /var/tmp/<user>.new
crontab -u <user> /var/tmp/<user>.new
I think the method I mentioned above is the best because you don't run the risk of copy/paste errors which helps you maintain consistency with your version controlled file. It performs syntax checking of the cron tasks inside of the file, and you won't need to perform the touch command as you would if you were to simply copy the file.
Having your project under version control, including your crontab.txt, is what I prefer. Then, with Fabric, it is as simple as this:
#task
def crontab():
run('crontab deployment/crontab.txt')
This will install the contents of deployment/crontab.txt to the crontab of the user you connect to the server. If you dont have your complete project on the server, you'd want to put the crontab file first.
If you're using Django, take a look at the jobs system from django-command-extensions.
The benefits are that you can keep your jobs inside your project structure, with version control, write everything in Python and configure crontab only once.
I use Buildout to manage my Django projects. With Buildout, I use z3c.recipe.usercrontab to install cron jobs in deploy or update.
You said:
I'm more curious about conventions/standards people use than any particular solution
But, to be fair, the particular solution will depend in your environment and there is no universal elegant silver bullet. Given that you happen to be using Python/Django, I recommend Celery. It is an asynchronous task queue for Python, which integrates nicely with Django. And, on top of the features that it gives as an asynchronous task queue, it also has specific features for periodic tasks.
I have personally used the django-celery-beat integration and it integrates perfectly with Django settings and behaves correctly in distributed environments. If your periodic tasks are related to Django stuff, I strongly recommend to take a look at Celery I started using it only for certain asynchronous mailing and ended up using it for a lot of asynchronous tasks + periodic sanity checks and other web application maintenance stuff.
Overview
I'm building a website in django. I need to allow people to begin to add flatpages, and set some settings in the admin. These changes should be definitive, since that information comes from the client. However, I'm also developing the backend, and as such will am creating and migrating tables. I push these changes to the hub.
Tools
django
git
south
postgres
Problem
How can I ensure that I get the database changes from the online site down to me on my lappy, and also how can I push my database changes up to the live site, so that we have a minimum of co-ordination needed? I am familiar with git hooks, so that option is in play.
Addendum:
I guess I know which tables can be modified via the admin. There should not be much overlap really. As I consider further, the danger really is me pushing data that would overwrite something they have done.
Thanks.
For getting your schema changes up to the server, just use South carefully. If you modify any table they might have data in, make sure you write both a schema migration and as necessary a data migration to preserve the sense of their data.
For getting their updated data back down to you (which doesn't seem critical, but might be nice to work with up-to-date test data as you're developing), I generally just use Django fixtures and the dumpdata and loaddata commands. It's easy enough to dump a fixture and commit it to your repo, then a loaddata on your end.
You could try using git hooks to automate some of this, but if you want automation I do recommend trying something like Fabric instead. Much of this stuff doesn't need to be run every single time you push/pull (in particular, I usually wouldn't want to dump a new data fixture that frequently).
You should probably take a look at South:
http://south.aeracode.org/
It seems to me that you could probably create a git hook that triggers off South if you are doing some sort of continuous integration system.
Otherwise, every time you do a push you will have to manually execute the migration steps yourself. Don't forget to put up the "site is under maintenance" message. ;)
I recommend that you use mk-table-sync to pull changes from live server to your laptop.
mk-table-sync takes a lot of parameters so you can automate this process by using fabric. You would basically create a fabric function that executes mk-table-sync on each tablet that you want to pull from the server.
This means that you can not make dabatase changes yourself, because they will be overwritten by the pull.
The only changes that you would be making to the live database are using South. You would push the code to the server and then run migrate to update the database schema.