Delaying system shutdown during json DB update in python - python-2.7

So I have a rather large json database that I'm maintaining with python. It's basically scraping data from a website on an hourly basis and I'm running daily restarts on the system (Linux Mint) via crontab. My issue is that if the system happens to restart during the database updating process I get corrupted json files.
My question is if there is anyway to delay the system restart in my script to ensure the system shuts down at a safe time? I could issue the restart command inside the script itself but if I decide to run multiple scripts that are similar to this in the future I'll obviously have a problem.
Any help here would be greatly appreciated. Thanks
Edit: Just to clarify I'm not using the python jsondb package. I am doing all file handling myself

So my solution to this was quite simple (Just protect data integrity):
Before write - backup the file
On successful write - delete the backup (Avoids doubling the size of the DB)
Where ever a corrupted file is encountered - revert to backup
The idea being that if the system closes the script during the file backup, it doesn't matter, we still have the original and if the system closes the script during write to the original file, the backup never gets deleted and we can just use that instead. All and all it was just an extra 6 lines of code and appears to have solved the issue.

Related

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Selecting google cloud tool for executing demanding python script

Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.

Delete file after a certain number of days?

I am making a web-app that auto-deletes an uploaded file after a certain number of hours since it has been uploaded. My question is, what would be the best backend way of implementing this?
Should I poll a folder for files older than X hours then call a script to delete those files? Since it is going to be a web-app, is there something server-side language that I could use for it?
A simple back end shell script scheduled as a cronjob every x minutes will do the job if files are stored directly on file system. If file references , locations etc are stored in DB with time stamp, same can be cleaned up with shell script cronjob too.
This is the way it was being done in one of my previous projects.

Spark application stops when reading file from s3

I have an application which runs on EMR and reads a csv file from s3.
However, the whole thing seems to stop (I've let it run for about an hour) when I try to read in that file from s3. Nothing happens and nothing is written to the logs any more except that the application is still running. The step in which this application is running does not fail!
I've tried copying the file to the cluster via the flag --files of spark-submit and reading it directly within the application with sc.textFile(filename).
Is there anything I am missing?
After a while I finally got back to that problem again and could "solve" it myself (I don't really know what the problem was, though...)
It seems like spark was failing to allocate worker nodes. After setting spark.dynamicAllocation.enabled to true everything is working as expected now.

How to delete files via FTP when directory has over 100,000 files?

I went to upload a new file to my web server only to get a message in return saying that my disk quota was full... I wasn't using up my allotted space but rather my allotted FILE QUANTITY. My host caps my total number of files at about 260,000.
Checking through my folders I believe I found the culprit...
I have a small DVD database application (Video dB By split Brain) that I have installed and hidden away on my web site for my own personal use. It apparently caches data from IMDB, and over the years has secretly amassed what is probably close to a MIRROR of IMDB at this point. I don't know for certain but I did have a 2nd (inactive) copy of the program on the host that I created a few years back that I was using for testing when I was modifying portions of it. The cache folder in this inactive copy had 40,000 files totalling 2.3GB in size. I was able to delete this folder over FTP but it took over an hour. Thankfully it also gave me some much needed breathing room.
...But now as you can imagine the cache folder for the active copy of this web-app likely has closer to 150000 files totalling about 7GB worth of data.
This is where my problem comes in... I use Flash FXP for my FTP client and whenever I try to delete the cache folder, or even just view the contents it will sit and try to load a file list for a good 5 minutes and then lose connection to the server...
my host has a web based file browser and it crashes when trying to do this... as do free online FTP clients like net2ftp.com. I don't have SSH ability on this server so I can't login directly to delete either.
Anyone have any idea how I can delete these files? Is there a different FTP program I can download that would have better success... or perhaps a small script I could run that would be able to take care of it?
Any help would be greatly appreciated.
Anyone have any idea how I can delete
these files?
Submit a support request asking for them to delete it for you?
It sounds like it might be time for a command line FTP utility. One ships with just about every operating system. With that many files, I would write a script for my command-line FTP client that goes to the folder in question and performs a directory listing, redirecting the output to a file. Then, use magic (or perl or whatever) to process that file into a new FTP script that runs a delete command against all of the files. Yes, it will take a long time to run.
If the server supports wildcards, do that instead and just delete ..
If that all seems like too much work, open a support ticket with your hosting provider and ask them to clean it up on the server directly.
Having said all that, this isn't really a programming question and should probably be closed.
We had a question a while back where I ran an experiment to show that Firefox can browse a directory with 10,000 files no problem, via FTP. Presumably 150,000 will also be ok. Firefox won't help you delete, but it might be helpful in capturing the names of the files you need to delete.
But first I would just try the command-line client ncftp. It is well engineered and I have had good luck with it in the past. You can delete a large number of files at once using shell patterns. And it is available for Windows, MacOS, Linux, and many other platforms.
If that doesn't work, you sound like a long-term customer---could you beg your ISP the privilege of a shell account for a week so you can remote login with Putty or ssh and blow away the entire directory with a single rm -r command?
If your ISP provides ssh access, you can use one rm command to remove the files.
If there is no command line access, you can have a try with some powerful FTP client like CrossFTP. It works on win, mac, and linux. When you select to delete the huge amount of files on your server, it can queue in the delete operations, so that you don't need to reload the folder again. When you restart CrossFTP, the queue can also be restored and continued.