How to refresh training in chatterbot bot in python? - python-2.7

I made a simple chatbot using chatterbot library and python. The way I trained it is, I made it read a few text files containing chat examples, and it learns how to reply to messages based on those training examples. The problem I am facing is - Even if I erase the contents of the training text files, and run the application, the chatbot still behaves in the same manner as before, i.e. it's memory doesn't get refreshed. I tried starting a new file and copy pasted the same code and changed the name of the program, but it still doesn't help. How do I solve this problem? Here is the code for reference:
from chatterbot.trainers import ListTrainer
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
import os
bot = ChatBot('trialBot')
bot.set_trainer(ListTrainer)
#directory containing training text files
mainDir = 'C:\\Users\\xyz\\Desktop\\trainfiles\\'
for _file in os .listdir(mainDir):
chats = open(mainDir + _file, 'r').readlines()
bot.train(chats)
while True:
request = raw_input('You: ')
response = bot.get_response(request)
print('Bot: ' + str(response))

It sounds like you might want to use an in-memory database so that the content is only persisted as long as the chat bot is running.
bot = ChatBot(
'trialBot',
database_uri=None
)
Setting database_uri to None will cause the chat bot to use a Sqlite database that is stored in-memory so store the knowledge that it is trained with. As a result, you will have a fresh database to work with each time you run your program.

Related

Google App Engine deferred.defer task not getting executed

I have a Google App Engine Standard Environment application that has been working fine for a year or more, that, quite suddenly, refuses to enqueue new deferred tasks using deferred.defer.
Here's the Python 2.7 code that is making the deferred call:
# Find any inventory items that reference the product, and change them too.
# because this could take some time, we'll do it as a deferred task, and only
# if needed.
if upd:
updater = deferredtasks.InvUpdate()
deferred.defer(updater.run, product_key)
My app.yaml file has the necessary bits to support deferred.defer:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.deferred.application
login: admin
builtins:
- deferred: on
And my deferred task has logging in it so I should see it running when it does:
#-------------------------------------------------------------------------------
# DEFERRED routine that updates the inventory items for a particular product. Should be callecd
# when ANY changes are made to the product, because it should trigger a re-download of the
# inventory record for that product to the iPad.
#-------------------------------------------------------------------------------
class InvUpdate(object):
def __init__(self):
self.to_put = []
self.product_key = None
self.updcount = 0
def run(self, product_key, batch_size=100):
updproduct = product_key.get()
if not updproduct:
logging.error("DEFERRED ERROR: Product key passed in does not exist")
return
logging.info(u"DEFERRED BEGIN: beginning inventory update for: {}".format(updproduct.name))
self.product_key = product_key
self._continue(None, batch_size)
...
When I run this in the development environment on my development box, everything works fine. Once I deploy it to the App Engine server, the inventory updates never get done (i.e. the deferred task is not executed), and there are no errors (and no other logging from the deferred task in fact) in the log files on the server. I know that with the sudden move to get everybody on Python 3 as quickly as possible, the deferred.defer library has been marked as not recommended because it only works with the 2.7 Python environment, and I planned on moving to task queues for this, but I wasn't expecting deferred.defer to suddenly stop working in the existing python environment.
Any insight would be greatly appreciated!
I'm pretty sure you cant pass the method of an instance to appengine taskqueue, because that instance will not get exist when your task runs since it will be running in a different process. I actually dont understand how your task ever worked when running remotely in the first place (and running locally is not an accurate representation of how things will run remotely)
Try changing your code to this:
if upd:
deferred.defer(deferredtasks.InvUpdate.run_cls, product_key)
and then InvUpdate is the same but has a new function run_cls:
class InvUpdate(object):
#classmethod
def run_cls(cls, product_key):
cls().run(product_key)
And I'm still on the process of migrating to cloud tasks and my deferred tasks still work

Right way to delay file download in Django

I have a class-based view which triggers the composition and downloading of a report for a user.
Normally in def get of the class I just compile the report, add response['Content-Disposition'] = 'attachment; filename="somefilename.pdf"' and return response to a user.
The problem is that some reports are large and while they are compiling the request timeout happens.
I know that the right way of dealing with this is to delegate it to a background process (like Celery). But the problem is that it means that instead of creating a temporary file which ceases to exist the moment the user downloads a report, I have to store these reports somewhere, and write a cronjob which will regularly clean the reports directory.
Is there any more elegant way in Django to deal with this issue?
One solution less fancy than using celery is to use is Django's StreamingHttpResponse:
(https://docs.djangoproject.com/en/2.0/ref/request-response/#django.http.StreamingHttpResponse
With this, you use a generator function, which is a python function that uses yield to return its results as an iterator. This allows you to return the data as you generate it, rather than all at once at after you're finished. You can yield after each line or section of the report.. thus keeping a flow of data back to the browser.
But.. this only works if you are building up the finished file bit by bit.. for example, a CSV file. If you're returning something that you need to format all at once, for example if you're using something like wkhtmltopdf to generate a pdf file after you're done, then it's not as easy.
But there's still a solution:
What you can do in that case is, use StreamingHttpReponse along with a generator function to generate your report into a temporary file, instead of back to the browser. But as you are doing this, yield HTML snippets back to the browser which lets the user know the progress, eg:
def get(self, request, **kwargs):
# first you need a tempfile name.. do that however you like
tempfile = "kfjkdsjfksjfks"
# then you need to create a view which will open that file and serve it
# but I won't show that here.
# For security reasons it has to serve only out of one directory
# that is dedicated to this.
fetchurl = reverse('reportgetter_url') + '?file=' + tempfile
def reportgen():
yield 'Starting report generation..<br>'
# do some stuff to generate your report into the tempfile
yield 'Doing this..<br>'
# do this
yield 'Doing that..<br>'
# do that
yield 'Finished.<br>'
# when the browser receives this script, it'll go to fetchurl where
# you will send them the finished report.
yield '<script>document.location="%s";</script>' % fetchurl
return http.StreamingHttpResponse(reportgen())
That's not a complete example obviously, but should give you the idea.
When your user fetches this view, they will see the progress of the report as it comes along. At the end, you're sending the javacript which redirect the browser to the other view you will have to write which returns the response containing the finished file. When the browser gets this javacript, if the view returning the tempfile is setting the response Content-Disposition as an attachment before returning it, eg:
response['Content-Disposition'] = 'attachment; filename="%s"' % filename
..then the browser will stay on the current page showing your progress.. and simply pop up a file save dialog for the user.
For cleanup, you'll need a cron job regardless.. because if people don't wait around, they'll never pick up the report. Sometimes things don't work out... So you could just clean up files older than let's say 1 hour. For a lot of systems this is acceptable.
But if you want to clean up right away, what you can do, if you are on unix/linux, is to use an old unix filesystem trick: Files which are deleted while they are open are not really gone until they are closed. So, open your tempfile.. then delete it. Then return your response. As soon as the response has finished sending, the space used by the file will be freed.
PS: I should add.. that if you take this second approach, you could use one view to do both jobs.. just:
if `file` in request.GET:
# file= was in the url.. they are trying to get an already generated report
with open(thepathname) as f:
os.unlink(f)
# file has been 'deleted' but f is still a valid open file
response = HttpResponse( etc etc etc)
response['Content-Disposition'] = 'attachment; filename="thereport"'
response.write(f)
return response
else:
# generate the report
# as above
This is not really a Django question but a general architecture question.
You can always increase your server time outs but it would still, IMO, give you a bad user experience if the user has to sit watching the browser just spin.
Doing this on a background task is the only way to do it right. I don’t know how large the reports are, but using email can be a good solution. The background task simply generates the report, sends it via email and deletes it.
If the files are too large to send via email, then you will have to store them. Maybe send an email with a link and a message indicating the link will not work after X days/hours. Once you have a background worker, creating a daily or hourly clean up task would be very easy.
Hope it helps

PyMongo find query returns empty/partial cursor when running in a Django+uWsgi project

We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.
Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.

No headers in HAR response

I parse website 'http://ok.ru'. To get data from the post request I need to send a specific token that is generated by Javascript on the website and this token is contained in headers.
So I thought maybe one solution would be to open the website, let it generate token, grab headers and that's it.
One tool that can implement Java scripts is Selenium, however, to get headers I need to use brosermob-proxy (or equivalent). That is where I'm stuck.
There's no headers in response and I can't figure it out. Maybe someone who worked with browsermob can see what's wrong? I would also be glad to hear another solutions to my task. The code itself is below:
from browsermobproxy import Server
from selenium import webdriver
from ast import literal_eval
import json, os
os.chdir('C:/browsermob-proxy-2.1.0-beta-2/bin')
server = Server()
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har('test')
driver.get('http://ok.ru')
driver.find_element_by_xpath('//input[#name="st.email"]').send_keys('****#****.com')
driver.find_element_by_xpath('//input[#name="st.password"]').send_keys('****')
driver.find_element_by_xpath(u'//input[contains(#value,"Log in")]').click()
result = literal_eval(json.dumps(proxy.har, ensure_ascii=False))
driver.close()
for entry in result['log']['entries']:
if len(entry['response']['headers']) > 0:
print entry['response']['headers']
The answer turned to be easy: just to add options to new_har:
proxy.new_har('test', options={'captureHeaders': True})
However, there is no token in headers, which is a new puzzle to me...

Getting an 500 internal error when uploading file using cgi and python

Hi I have a project that involves uploading documents. I am using html for the front end and python for the backend. I've managed to link my html and python file but I'm having a problem with the server. At first I though it was a random thing but I'm pretty sure it's because of what I added to the python code. I have:
import cgi
import sys
import os
htmlform = cgi.FieldStorage()
file_data = htmlform['myfile']
if not fileitem.file:
return
(name,ext) = os.path.splitext( fileitem.filename)
#if ext == “.jpg” or ext == “.png” or ext == “.gif”:
#ioFlag = “wb”
#else:
#ioFlag = “w”
I was able to log into my page go to the html form submit the form and got to a basic success html page I had below the above input. Now Im pretty new to python and didnt realise that the if statements should be indented. And I get a 500 internal error when I uncommented the if statement. I did it once and then went through commenting out my code being completely confused as to why I was getting error but after a while it just started working again. My guess is the incorrect if statement somehow got it stuck. I expect after about an hour it'll be working again but ideally I'd like to know if I could stop the process on the server if possible. I was following this guide http://www.alwaysgetbetter.com/blog/2009/01/02/python-file-upload/
Fixed it! The problem seems to be the indentation. If you're ever unsure about this stuff look at the error logs. I'm using an apache server and I dont have access to the error logs so I used
sudo cat /etc/log/apache2/error.log
It gave me the answer and this should hopefully help you even if your question is unrelated.
EDIT: An for completeness sake file_data should be fileitem