tensorflow fully connected control flow per n-epoch summary - python-2.7

When I don't use queues, I like to tally the loss, accuracy, ppv etc during an epoch of training and submit that tf.summary at the end of every epoch.
I'm not sure how to replicate this behavior with queues. Is there a signal I can listen to for when an epoch is complete?
(version 0.9)
A typical setup goes as follows:
queue=tf.string_input_producer(num_epochs=7)
...#build graph#...
#training
try:
while not coord.should_stop():
sess.run(train_op)
except:
#file has been read num_epoch times
#do some stuff.. maybe summaries
coord.request_stop()
finally:
coord.join(threads)
So, clearly I could just set num_epoch=1 and create summaries in the except block. This would require running my entire program once per epoch and somehow it doesn't seem the most efficient.

EDIT Changed to account for edits to the question.
An epoch is not something that is a built-in or 'known' to TensorFlow. You have to keep track of the epochs in your training loop and run the summary ops at the end of an epoch. A pseudo code like the following should work :
num_mini_batches_in_epoch = ... # something like examples_in_file / mini_batch_size
try:
while True:
for i in num_mini_batches_in_epoch:
if coord.should_stop(): raise Exception()
sess.run(train_op)
sess.run([loss_summary, accuracy_summary])
except:
#file has been read num_epoch times
#do some stuff.. maybe summaries
coord.request_stop()
finally:
coord.join(threads)

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Calling ftplib retrbinary() inside timeit using lambda freezes program

I'm trying to gather some metrics regarding a network issue in a small LAN with SSH disconnecting occasionally and ping exhibiting huge delays (up to a minute instead of less than a second!).
Using timeit (I've read that it's a good way to check elapsed execution time when calling some code snippet) I try to download some data from a FTP server that also runs locally, measure the time and store it inside a log file.
from ftplib import FTP
from timeit import timeit
from datetime import datetime
ftp = FTP(host='10.0.0.8')
ftp.login(user='****', passw='****')
ftp.cwd('updates/')
ftp.retrlines('LIST')
# Get timestamp when the download starts
curr_t = datetime.now()
print('Download large file and measure the time')
file_big_t = timeit(lambda f=ftp: f.retrbinary('RETR update_big', open('/tmp/update_big', 'wb').write))
print('Download medium file and measure the time')
file_medium_t = timeit(lambda f=ftp: f.retrbinary('RETR file_medium ', open('/tmp/file_medium ', 'wb').write))
print('Download small file and measure the time')
file_small_t = timeit(lambda f=ftp: f.retrbinary('RETR update_small', open('/tmp/update_small', 'wb').write))
# Write timestamp and measured timings to file
# ...
If I call the retrbinary(...) without the timeit it works without any issues. However the code above results in the script freezing right after the first timeit call.
In case someone else wants to do what I've described in my question I found the solution here. For some reason passing the lambda directly to timeit results in the behaviour I've already mentioned. However if the same lambda is first passed to an instance of timeit.Timer and that that instance's timeit() function is called it works.
For the example above (let's take just file_big_t) I did
file_big_timer = Timer(lambda f=ftp: f.retrbinary('RETR update_big', open('/tmp/update_big', 'wb').write))
file_big_t = file_big_timer.timeit()

Stopping a While loop when it ends a cycle in Python

This may be a strange request. I have an infinite While loop and each loop lasts ~7 minutes, then the program sleeps for a couple minutes to let the computer cool down, and then starts over.
This is how it looks:
import time as t
t_cooling = 120
while True:
try:
#7 minutes of uninterrupted calculations here
t.sleep(t_cooling)
except KeyboardInterrupt:
break
Right now if I want to interrupt the process, I have to wait until the program sleeps for 2 minutes, otherwise all the calculations done in the running cycle are wasted. Moreover the calculations involve writing on files and working with multiprocessing, so interrupting during the calculation phase is not only a waste, but can potentially damage the output on the files.
I'd like to know if there is a way to signal to the program that the current cycle is the last one it has to execute, so that there is no risk of interrupting at the wrong moment. To add one more limitation, it has to be a solution that works via command line. It's not possible to add a window with a stop button on the computer the program is running on. The machine has a basic Linux installation, with no graphical environment. The computer is not particularly powerful or new and I need to use the most CPU and RAM possible.
Hope everything is clear enough.
Not so elegant, but it works
#!/usr/bin/env python
import signal
import time as t
stop = False
def signal_handler(signal, frame):
print('You pressed Ctrl+C!')
global stop
stop = True
signal.signal(signal.SIGINT, signal_handler)
print('Press Ctrl+C')
t_cooling = 1
while not stop:
t.sleep(t_cooling)
print('Looping')
You can use a separate Thread and an Event to signal the exit request to the main thread:
import time
import threading
evt = threading.Event()
def input_thread():
while True:
if input("") == "quit":
evt.set()
print("Exit requested")
break
threading.Thread(target=input_thread).start()
t_cooling = 5
while True:
#7 minutes of uninterrupted calculations here
print("starting calculation")
time.sleep(5)
if evt.is_set():
print("exiting")
break
print("cooldown...")
time.sleep(t_cooling)
Just for completeness, I post here my solution. It's very raw, but it works.
import time as t
t_cooling = 120
while True:
#7 minutes of uninterrupted calculations here
f = open('stop', 'r')
stop = f.readline().strip()
f.close()
if stop == '0':
t.sleep(t_cooling)
else:
break
I just have to create a file named stop and write a 0 in it. When that 0 is changed to something else, the program stops at the end of the cycle.

Delayed Job Overwhelming DB

I have a method which updates all DNS records for an account with 1 delayed job for each record. There's a lot of workers and queues which is great for getting other jobs done quickly, but this particular job completes quickly and overwhelms the database. Because each job requires DNS to resolve, it's difficult to move this to a process which collects the information then writes once. So I'm instead looking for a way to stagger delayed jobs.
As far as I know, just using sleep(0.1) in the after method should do the trick. I wanted to see if anyone else has specifically dealt with this situation and solved it.
I've created a custom job to test out a few different ideas. Here's some example code:
def update_dns
Account.active.find_each do |account|
account.domains.where('processed IS NULL').find_each do |domain|
begin
Delayed::Job.enqueue StaggerJob.new(self.id)
rescue Exception => e
self.domain_logger.error "Unable to update DNS for #{domain.name} (id=#{domain.id})"
self.domain_logger.error e.message
self.domain_logger.error e.backtrace
end
end
end
end
When a cron job calls Domain.update_dns, the delayed job table floods with tens of thousands of jobs, and the workers start working through them. There's so many workers and queues that even setting the lowest priority overwhelms the database and other requests suffer.
Here's the StaggerJob class:
class StaggerJob < Struct.new(:domain_id)
def perform
domain.fetch_dns_job
end
def enqueue(job)
job.account_id = domain.account_id
job.owner = domain
job.priority = 10 # lowest
job.save
end
def after(job)
# Sleep to avoid overwhelming the DB
sleep(0.1)
end
private
def domain
#domain ||= Domain.find self.domain_id
end
end
This may entirely do the trick, but I wanted to verify if this technique was sensible.
It turned out the priority for these jobs were set to 0 (highest). Setting to 10 (lowest) helped. Sleeping in the job in the after method would work, but there's a better way.
Delayed::Job.enqueue StaggerJob.new(domain.id, :fetch_dns!), run_at: (Time.now + (0.2*counter).seconds) # stagger by 0.2 seconds
This ends up pausing outside the job instead of inside. Better!

Limiting the lifetime of a file in Python

Helo there,
I'm looking for a way to limit a lifetime of a file in Python, that is, to make a file which will be auto deleted after 5 minutes after creation.
Problem:
I have a Django based webpage which has a service that generates plots (from user submitted input data) which are showed on the web page as .png images. The images get stored on the disk upon creation.
Image files are created per session basis, and should only be available a limited time after the user has seen them, and should be deleted 5 minutes after they have been created.
Possible solutions:
I've looked at Python tempfile, but that is not what I need, because the user should have to be able to return to the page containing the image without waiting for it to be generated again. In other words it shouldn't be destroyed as soon as it is closed
The other way that comes in mind is to call some sort of an external bash script which would delete files older than 5 minutes.
Does anybody know a preferred way doing this?
Ideas can also include changing the logic of showing/generating the image files.
You should write a Django custom management command to delete old files that you can then call from cron.
If you want no files older than 5 minutes, then you need to call it every 5 minutes of course. And yes, it would run unnecessarily when there are no users, but that shouln't worry you too much.
Ok that might be a good approach i guess...
You can write a script that checks your directory and delete outdated files, and choose the oldest file from the un-deleted files. Calculate how much time had passed since that file is created and calculate the remaining time to deletion of that file. Then call sleep function with remaining time. When sleep time ends and another loop begins, there will be (at least) one file to be deleted. If there is no files in the directory, set sleep time to 5 minutes.
In that way you will ensure that each file will be deleted exactly 5 minutes later, but when there are lots of files created simultaneously, sleep time will decrease greatly and your function will begin to check each file more and more often. To aviod that you add a proper latency to sleep function before starting another loop, like, if the oldest file is 4 minutes old, you can set sleep to 60+30 seconds (adding all time calculations 30 seconds).
An example:
from datetime import datetime
import time
import os
def clearDirectory():
while True:
_time_list = []
_now = time.mktime(datetime.now().timetuple())
for _f in os.listdir('/path/to/your/directory'):
if os.path.isfile(_f):
_f_time = os.path.getmtime(_f) #get file creation/modification time
if _now - _f_time < 300:
os.remove(_f) # delete outdated file
else:
_time_list.append(_f_time) # add time info to list
# after check all files, choose the oldest file creation time from list
_sleep_time = (_now - min(_time_list)) if _time_list else 300 #if _time_list is empty, set sleep time as 300 seconds, else calculate it based on the oldest file creation time
time.sleep(_sleep_time)
But as i said, if files are created oftenly, it is better to set a latency for sleep time
time.sleep(_sleep_time + 30) # sleep 30 seconds more so some other files might be outdated during that time too...
Also, it is better to read getmtime function for details.