Can Django run in a threaded model? - django

I was looking through the code base, particularly the database connectivity parts, and came accross this issue.
First, one gets a cursor to the database using the following stamentent:
from django.db import connection, transaction
cursor = connection.cursor()
connection is a module attribute, so in a threaded model, all threads would share that variable. Sounds a bit strange. Diving in further, the cursor() method belongs to django.db.backends.BaseDatabaseWrapper and looks like this:
def cursor(self):
self.validate_thread_sharing()
if (self.use_debug_cursor or
(self.use_debug_cursor is None and settings.DEBUG)):
cursor = self.make_debug_cursor(self._cursor())
else:
cursor = util.CursorWrapper(self._cursor(), self)
return cursor
The key is the call to _cursor(), which executes the backend code for whatever database is being used. In the case of MySQL, it executes the _cursor() method on django.db.backends.mysql.DatabaseWrapper, which looks like:
def _cursor(self):
new_connection = False
if not self._valid_connection():
new_connection = True
kwargs = {
'conv': django_conversions,
'charset': 'utf8',
'use_unicode': True,
}
settings_dict = self.settings_dict
if settings_dict['USER']:
kwargs['user'] = settings_dict['USER']
if settings_dict['NAME']:
kwargs['db'] = settings_dict['NAME']
if settings_dict['PASSWORD']:
kwargs['passwd'] = settings_dict['PASSWORD']
if settings_dict['HOST'].startswith('/'):
kwargs['unix_socket'] = settings_dict['HOST']
elif settings_dict['HOST']:
kwargs['host'] = settings_dict['HOST']
if settings_dict['PORT']:
kwargs['port'] = int(settings_dict['PORT'])
# We need the number of potentially affected rows after an
# "UPDATE", not the number of changed rows.
kwargs['client_flag'] = CLIENT.FOUND_ROWS
kwargs.update(settings_dict['OPTIONS'])
self.connection = Database.connect(**kwargs)
self.connection.encoders[SafeUnicode] = self.connection.encoders[unicode]
self.connection.encoders[SafeString] = self.connection.encoders[str]
connection_created.send(sender=self.__class__, connection=self)
cursor = self.connection.cursor()
if new_connection:
# SQL_AUTO_IS_NULL in MySQL controls whether an AUTO_INCREMENT column
# on a recently-inserted row will return when the field is tested for
# NULL. Disabling this value brings this aspect of MySQL in line with
# SQL standards.
cursor.execute('SET SQL_AUTO_IS_NULL = 0')
return CursorWrapper(cursor)
So a new cursor is not necessarily created. If a call to _cursor() had already been made, the previously used cursor would have been returned.
In a threaded model, that means multiple threads are possibly sharing the same database cursor, which seems like a no-no.
There are also other signs that indicate that threading is not allowed in Django. This module-level code from django/db/init.py, from for example:
def close_connection(**kwargs):
for conn in connections.all():
conn.close()
signals.request_finished.connect(close_connection)
So if any request finished, all database connections are closed. What if there are concurrent requests?
Seems like a lot of stuff is being shared, which indicates that threading is not allowed. I didn't see synchronization code anywhere.
Thanks!

Related

Django when looping over a queryset, when does the db read happen?

I am looping through my database and updating all my Company objects.
for company in Company.objects.filter(updated=False):
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
My problem is that it's taking too long so I wanted to run another instance of this same code, but I'm curious when the actual db reads happen. If company.visited get's changed to True while this loop is running, will still be visited by this loop? What if I added a second check for visited? I don't want to start a second loop if the first instance isn't going to recognize the work of the second instance:
for company in Company.objects.filter(updated=False):
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
Company.objects.filter(updated=False) translates to an ordinary SQL query:
SELECT * FROM appName_company WHERE updated is false
This SQL query is executed when you start iterating through Company objects. It's executed only once. The second server will not recognize the work of the first one, because they both will go through the same Company objects.
Lock rows to avoid race conditions using atomic transactions and select_for_update():
from django.db import transaction
for company in Company.objects.filter(updated=False):
with transaction.atomic():
Company.objects.select_for_update().get(id=company.id)
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
You can run this code on multiple servers. Each Company will be processed just once.
If you need to execute this code regularly, I highly recommend using Celery. Dispatch a task per each company, and let multiple workers do the work in parallel:
from celery import shared_task
#shared_task
def dispatch_tasks():
for company in Company.objects.filter(updated=False):
process_company.delay(company.id)
#shared_task
#transaction.atomic
def process_company(company_id):
company = Company.objects.select_for_update().get(id=company_id)
if company.visited:
continue
driver.get(company.company_url)
company.adress = driver.find_element_by_id("address").text
company.visited = True
company.save()
Edit: oh, I see that you've tagged the question with the sqlite tag. I recommend switching to PostgreSQL, as SQLite is really bad at concurrency. My answer should work with SQlite, but locks may slow down the database.

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com.
There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.
I don't wan't a for loop for handling each request. So i followed the below mentioned steps.
started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal
After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
itemIDs = [11111,22222,33333]
current_item_num = 0
def __init__(self, itemids=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.item_scraped, signals.item_scraped)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.quit()
def start_requests(self):
request = self.make_requests_from_url('http://example.com/itemview')
yield request
def parse(self,response):
self.driver = webdriver.PhantomJS()
self.driver.get(response.url)
first_data = self.driver.find_element_by_xpath('//div[#id="itemview"]').text.strip()
yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)
def processDetails(self,response):
itemID = self.itemIDs[self.current_item_num]
..form submission with the current itemID goes here...
...the content of the page is updated with the given itemID...
yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)
def processData(self,response):
...some more scraping goes here...
item = ExamplecrawlerItem()
item['first_data'] = response.meta['first_data']
yield item
def item_scraped(self,item,response,spider):
self.current_item_num += 1
#i need to call the processDetails function here for the next itemID
#and the process needs to contine till the itemID finishes
self.parse(response)
My piepline:
class ExampleDBPipeline(object):
def process_item(self, item, spider):
MYCOLLECTION.insert(dict(item))
return
I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.
self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))
However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:
yield item
self.counter += 1
yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}
Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.
Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:
import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()
And then there is no need to increment the counter and your program is thread safe.

Scrapy spider closes prematurely after yield request inside start_requests function

I have no idea at all how to do this. Basically I want to run a database insert before running the scraping program. I have tried to do this by placing a yield request at the top of the start_requests() function, like so, except what happens is the pipline executes and the scrape terminates instead of actually going to the next line in start_requests.
def start_requests(self):
db = MyDatabase()
link = "http://alexanderwhite.se/"
item = PopulateListingAvItem()
self.start_urls.append(link)
yield Request(link, callback=self.listing_av_populate, meta={'item': item}, priority=300)
#program terminates successfully completing the above request, but I need it to continue to the next line
query1 = "SELECT listing_id FROM listing_availability WHERE availability=1"
listing_ids = db.query(query1)
for lid in listing_ids:
query2 = "SELECT url from listings where listing_id="+str(lid['listing_id'])
self.start_urls.append( db.query(query2)[0]['url'] )
for url in self.start_urls:
yield Request(url, self.parse, priority=1)
The solution is not obvious and could be helpful to others. In this case, when you need to perform an insert query before doing any scraping, you can use yield item, followed by the rest of your start_url generator code inside the first callback function as suggested by ice13berg. The whole key is to use "yield item".
I don't quite understand why it works but it does, correctly executing the insert with the item, and using the result of the insert for the next select queries.
def listing_av_populate(self,response):
db = MyDatabase()
item = response.meta['item']
item['update_bid_av'] = 3
yield item
query1 = "SELECT listing_id FROM listing_availability WHERE availability=1"
listing_ids = db.query(query1)
for lid in listing_ids:
query2 = "SELECT url from listings where listing_id="+str(lid['listing_id'])
self.start_urls.append( db.query(query2)[0]['url'] )
for url in self.start_urls:
yield Request(url, self.parse, priority=1)
def start_requests(self):
link = "http://alexanderwhite.se/"
item = PopulateListingAvItem()
yield Request(link, callback=self.listing_av_populate, meta={'item': item}, priority=300)

django subprocess p.wait() doesn't return web

With a django button, I need to launch multiples music (with random selection).
In my models.py, I have two functions 'playmusic' and 'playmusicrandom' :
def playmusic(self, music):
if self.isStarted():
self.stop()
command = ("sudo /usr/bin/mplayer "+music.path)
p = subprocess.Popen(command+str(music.path), shell=True)
p.wait()
def playmusicrandom(request):
conn = sqlite3.connect(settings.DATABASES['default']['NAME'])
cur = conn.cursor()
cur.execute("SELECT id FROM webgui_music")
list_id = [row[0] for row in cur.fetchall()]
### Get three IDs randomly from the list ###
selected_ids = random.sample(list_id, 3)
for i in (selected_ids):
music = Music.objects.get(id=i)
player.playmusic(music)
With this code, three musics are played (one after the other), but the web page is just "Loading..." during execution...
Is there a way to display the refreshed web page to the user, during the loop ??
Thanks.
Your view is blocked from returning anything to the web server while it is waiting for playmusicrandom() to finish.
You need to arrange for playmusicrandom() to do its task after you're returned the HTTP status from the view.
This means that you likely need a thread (or similar solution).
Your view will have something like this:
import threading
t = threading.Thread(target=player_model.playmusicrandom,
args=request)
t.setDaemon(True)
t.start()
return HttpResponse()
This code snippet came from here, where you will find more detailed information about the issues you face and possible solutions.

How do I deal with this race condition in django?

This code is supposed to get or create an object and update it if necessary. The code is in production use on a website.
In some cases - when the database is busy - it will throw the exception "DoesNotExist: MyObj matching query does not exist".
# Model:
class MyObj(models.Model):
thing = models.ForeignKey(Thing)
owner = models.ForeignKey(User)
state = models.BooleanField()
class Meta:
unique_together = (('thing', 'owner'),)
# Update or create myobj
#transaction.commit_on_success
def create_or_update_myobj(owner, thing, state)
try:
myobj, created = MyObj.objects.get_or_create(owner=user,thing=thing)
except IntegrityError:
myobj = MyObj.objects.get(owner=user,thing=thing)
# Will sometimes throw "DoesNotExist: MyObj matching query does not exist"
myobj.state = state
myobj.save()
I use an innodb mysql database on ubuntu.
How do I safely deal with this problem?
This could be an off-shoot of the same problem as here:
Why doesn't this loop display an updated object count every five seconds?
Basically get_or_create can fail - if you take a look at its source, there you'll see that it's: get, if-problem: save+some_trickery, if-still-problem: get again, if-still-problem: surrender and raise.
This means that if there are two simultaneous threads (or processes) running create_or_update_myobj, both trying to get_or_create the same object, then:
first thread tries to get it - but it doesn't yet exist,
so, the thread tries to create it, but before the object is created...
...second thread tries to get it - and this obviously fails
now, because of the default AUTOCOMMIT=OFF for MySQLdb database connection, and REPEATABLE READ serializable level, both threads have frozen their views of MyObj table.
subsequently, first thread creates its object and returns it gracefully, but...
...second thread cannot create anything as it would violate unique constraint
what's funny, subsequent get on the second thread doesn't see the object created in the first thread, due to the frozen view of MyObj table
So, if you want to safely get_or_create anything, try something like this:
#transaction.commit_on_success
def my_get_or_create(...):
try:
obj = MyObj.objects.create(...)
except IntegrityError:
transaction.commit()
obj = MyObj.objects.get(...)
return obj
Edited on 27/05/2010
There is also a second solution to the problem - using READ COMMITED isolation level, instead of REPEATABLE READ. But it's less tested (at least in MySQL), so there might be more bugs/problems with it - but at least it allows tying views to transactions, without committing in the middle.
Edited on 22/01/2012
Here are some good blog posts (not mine) about MySQL and Django, related to this question:
http://www.no-ack.org/2010/07/mysql-transactions-and-django.html
http://www.no-ack.org/2011/05/broken-transaction-management-in-mysql.html
Your exception handling is masking the error. You should pass a value for state in get_or_create(), or set a default in the model and database.
One (dumb) way might be to catch the error and simply retry once or twice after waiting a small amount of time. I'm not a DB expert, so there might be a signaling solution.
Since 2012 in Django we have select_for_update which lock rows until the end of the transaction.
To avoid race conditions in Django + MySQL
under default circumstances:
REPEATABLE_READ in the Mysql
READ_COMMITTED in the Django
you can use this:
with transaction.atomic():
instance = YourModel.objects.select_for_update().get(id=42)
instance.evolve()
instance.save()
The second thread will wait for the first thread (lock), and only if first is done, the second will read data saved by first, so it will work on updated data.
Then together with get_or_create:
def select_for_update_or_create(...):
instance = YourModel.objects.filter(
...
).select_for_update().first()
if order is None:
instnace = YouModel.objects.create(...)
return instance
The function must be inside transaction block, otherwise, you will get from Django:
TransactionManagementError: select_for_update cannot be used outside of a transaction
Also sometimes it's good to use refresh_from_db()
In case like:
instance = YourModel.objects.create(**kwargs)
response = do_request_which_lasts_few_seconds(instance)
instance.attr = response.something
you'd like to see:
instance = MyModel.objects.create(**kwargs)
response = do_request_which_lasts_few_seconds(instance)
instance.refresh_from_db() # 3
instance.attr = response.something
and that # 3 will reduce a lot a time window of possible race conditions, thus chance for that.