I'm working on a django app that goes through Illinois' General Assembly website to scrape some pdfs. While deployed on my desktop it works fine until urllib2 times out. When I try to deploy on my Bluehost server, the lxml part of the code throws up an error. Any help would be appreciated.
import scraperwiki
from bs4 import BeautifulSoup
import urllib2
import lxml.etree
import re
from django.core.management.base import BaseCommand
from legi.models import Votes
class Command(BaseCommand):
def handle(self, *args, **options):
chmbrs =['http://www.ilga.gov/house/', 'http://www.ilga.gov/senate/']
for chmbr in chmbrs:
site = chmbr
url = urllib2.urlopen(site)
content = url.read()
soup = BeautifulSoup(content)
links = []
linkStats = []
table = soup.find('table', cellpadding=3)
for a in soup.findAll('a',href=True):
if re.findall('Bills', a['href']):
l = (site + a['href']+'&Primary=True')
print x
for link in links:
url = urllib2.urlopen(link)
content = url.read()
soup = BeautifulSoup(content)
table = soup.find('table', cellpadding=3)
for a in table.findAll('a',href=True):
if re.findall('BillStatus', a['href']):
for linkStat in linkStats:
url = urllib2.urlopen(linkStat)
content = url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('votehistory', a['href']):
vl = 'http://ilga.gov/legislation/'+a['href']
url = urllib2.urlopen(vl)
content = url.read()
soup = BeautifulSoup(content)
for b in soup.findAll('a',href=True):
if re.findall('votehistory', b['href']):
llink = 'http://ilga.gov'+b['href']
u = urllib2.urlopen(llink)
x = scraperwiki.pdftoxml(u.read())
root = lxml.etree.fromstring(x)
pages = list(root)
chamber = str()
for page in pages:
print "working_1"
for el in page:
print "working_2"
if el.tag == 'text':
if int(el.attrib['top']) == 168:
chamber = el.text
if re.findall("Senate Vote", chamber):
if int(el.attrib['top']) >= 203 and int(el.attrib['top']) < 231:
title = el.text
if (re.findall('House', title)):
title = (re.findall('[0-9]+', title))
title = "HB"+title[0]
elif (re.findall('Senate', title)):
title = (re.findall('[0-9]+', title))
title = "SB"+title[0]
if int(el.attrib['top']) >350 and int(el.attrib['top']) <650:
r = el.text
names = re.findall(r'[A-z-\u00F1]{3,}',r)
vs = re.findall(r'[A-Z]{1,2}\s',r)
for name in names:
legi = name
for vote in vs:
v = vote
if Votes.objects.filter(legislation=title).exists() == False:
c = Votes(legislation=title, legislator=legi, vote=v)
print 'saved'
print 'not saved'
elif int(el.attrib['top']) == 189:
chamber = el.text
if re.findall("HOUSE ROLL CALL", chamber):
if int(el.attrib['top']) > 200 and int(el.attrib['top']) <215:
title = el.text
if (re.findall('HOUSE', title)):
title = (re.findall('[0-9]+', title))
title = "HB"+title[0]
elif (re.findall('SENATE', title)):
title = (re.findall('[0-9]+', title))
title = "SB"+title[0]
if int(el.attrib['top']) >385 and int(el.attrib['top']) <1000:
r = el.text
names = re.findall(r'[A-z-\u00F1]{3,}',r)
votes = re.findall(r'[A-Z]{1,2}\s',r)
for name in names:
legi = name
for vote in votes:
v = vote
if Votes.objects.filter(legislation=title).exists() == False:
c = Votes(legislation=title, legislator=legi, vote=v)
print 'saved'
print 'not saved'
Here's the error trace
Traceback (most recent call last):
File "manage.py", line 10, in <module>
File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
self.execute(*args, **options.__dict__)
File "/home7/maythirt/python27/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
output = self.handle(*args, **options)
File "/home7/maythirt/GAB/legi/management/commands/vote.py", line 51, in handle
root = lxml.etree.fromstring(x)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src/lxml/lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92476)
File "parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91939)
lxml.etree.XMLSyntaxError: None

As Jonathan mentioned, it may be the output of scraperwiki.pdftoxml() that's causing a problem. You could display or log the value of x to confirm it.
Specifically, pdftoxml() runs an external program pdftohtml and uses temporary files to store the PDF and XML.
What I'd also check for is:
Is pdftohtml correctly set up on your server?
If so, does the conversion to XML work if you directly run it in a shell on the server with the PDF that the code's failing on? The command it's executing is pdftohtml -xml -nodrm -zoom 1.5 -enc UTF-8 -noframes "input.pdf" "output.xml"
If there's an issue when you directly run the command, then that's there your issue lies. With the way pdftohtml runs in the scraperwiki code, there's no easy way you'd be able to tell if the command fails.

They way I would go about this is add a try: except: clause and when you get the error you simply save the xml file as well as the link down to your harddrive. That way you can inspect the xml file separately.
It might be that scraperwiki.pdftoxml makes an illegal xml file for some reason. I've had that happen to me when using another pdftoxml tool.
And please refactor your code into more functions it will become a lot easier to read and maintain :).
Another way would of course to download all of the pdfs first, and then parse them all. That way you can avoid hitting the website several times whenever you fail for some reason.


scrapy TypeError: object() takes no parameters

I am new to Scrapy and trying to crawl a couple of links as a test using Scrapy. Whenever I run scrapy crawl tier1, I get "TypeError: object() takes no parameters" as the following:
Traceback (most recent call last):
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/btaek/TaeksProgramming/adv/crawler/adv_crawler/adv_crawler/spiders/tier1_crawler.py", line 93, in parse
mk_loader.add_xpath('title', 'h1[#class="top_title"]') # Title of the article
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 167, in add_xpath
self.add_value(field_name, values, *processors, **kw)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 77, in add_value
self._add_value(field_name, value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 91, in _add_value
processed_value = self._process_input_value(field_name, value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 150, in _process_input_value
return proc(value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/processors.py", line 28, in __call__
next_values += arg_to_iter(func(v))
TypeError: object() takes no parameters
2017-08-23 17:25:02 [tier1-parse-logger] INFO: Entered the parse function to parse and index: http://news.mk.co.kr/newsRead.php?sc=30000001&year=2017&no=535166
2017-08-23 17:25:02 [tier1-parse-logger] ERROR: Error (object() takes no parameters) when trying to parse <<date>> from a mk article: http://news.mk.co.kr/newsRead.php?sc=30000001&year=2017&no=535166
2017-08-23 17:25:02 [tier1-parse-logger] ERROR: Error (object() takes no parameters) when trying to parse <<author>> from a mk article: http://news.mk.co.kr/newsRead.php?sc=30000001&year=2017&no=535166
2017-08-23 17:25:02 [scrapy.core.scraper] ERROR: Spider error processing <GET http://news.mk.co.kr/newsRead.php?sc=30000001&year=2017&no=535166> (referer: None)
Traceback (most recent call last):
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/btaek/TaeksProgramming/adv/crawler/adv_crawler/adv_crawler/spiders/tier1_crawler.py", line 93, in parse
mk_loader.add_xpath('title', 'h1[#class="top_title"]') # Title of the article
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 167, in add_xpath
self.add_value(field_name, values, *processors, **kw)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 77, in add_value
self._add_value(field_name, value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 91, in _add_value
processed_value = self._process_input_value(field_name, value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 150, in _process_input_value
return proc(value)
File "/Users/btaek/TaeksProgramming/adv/crawler/lib/python2.7/site-packages/scrapy/loader/processors.py", line 28, in __call__
next_values += arg_to_iter(func(v))
TypeError: object() takes no parameters
And, my spider file (tier1_crawler.py):
# -*- coding: utf-8 -*-
import sys
import os
import logging
import scrapy
from scrapy.loader import ItemLoader
from adv_crawler.items import AdvCrawlerItem
from datetime import datetime, date, time
t1_parse_logger = logging.getLogger("tier1-parse-logger")
t1_parse_logger.LOG_FILE = "Tier1-log.txt"
content_type_dic = {
'news': 'news',
class Tier1Crawler(scrapy.Spider):
name = "tier1"
def start_requests(self):
urls = ['http://news.mk.co.kr/newsRead.php?sc=30000001&year=2017&no=535982',
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
t1_parse_logger.info("Entered the parse function to parse and index: %s" % response.url) # Log at the beginning of the parse function
item_loader = ItemLoader(item=AdvCrawlerItem(), response=response)
if 'mk.co.kr' in response.url:
mk_loader = item_loader.nested_xpath('//div[#id="top_header"]/div[#class="news_title"]/div[#class="news_title_text"]')
mk_loader.add_xpath('date', 'div[#class="news_title_author"]/ul/li[#class="lasttime"]')
except AttributeError: # if the date is not in "lasttime" li tag
mk_loader.add_xpath('date', 'div[#class="news_title_author"]/ul/li[#class="lasttime1"]')
except Exception as e: # in case the error is not AttributeError
t1_parse_logger.error("Error "+"("+str(e)+")"+" when trying to parse <<date>> from a mk article: %s" % response.url)
mk_loader.add_xpath('author', 'div[#class="news_title_author"]/ul/li[#class="author"]')
except AttributeError: # in case there is no author (some mk articles have no author)
item_loader.add_value('author', "None") # ir error, replace with the line below
# item['author'] = "None" # if the above gives any error, replace the above with this line
except Exception as e: # in case the error is not AttributeError
t1_parse_logger.error("Error "+"("+str(e)+")"+" when trying to parse <<author>> from a mk article: %s" % response.url)
item_loader.add_xpath('content', '//div[#id="Content"]/div[#class="left_content"]/div[#id="article_body"]/div[#class="art_txt"]') # Content of the article (entire contents)
mk_loader.add_xpath('title', 'h1[#class="top_title"]') # Title of the article
item_loader.add_value('content_type', content_type_dic['news'])
item_loader.add_value('timestamp', str(datetime.now())) # timestamp of when the document is being indexed
item_loader.add_value('url', response.url) # url of the article
t1_parse_logger.info("Parsed and indexed: %s" % response.url)
return item_loader.load_item()
And, my items.py file:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_date(value):
if isinstance(value, unicode):
(year, month, day) = str(value.split(" ")[-2]).split(".")
return year+"-"+month+"-"+day
def filter_utf(value):
if isinstance(value, unicode):
return value.encode('utf-8')
class AdvCrawlerItem(scrapy.Item):
author = scrapy.Field(input_processor=MapCompose(remove_tags, TakeFirst, filter_utf),) # Name of the publisher/author
content = scrapy.Field(input_processor=MapCompose(remove_tags, Join, filter_utf),) # Content of the article (entire contents)
content_type = scrapy.Field()
date = scrapy.Field(input_processor=MapCompose(remove_tags, TakeFirst, filter_date),)
timestamp = scrapy.Field() # timestamp of when the document is being indexed
title = scrapy.Field(input_processor=MapCompose(remove_tags, TakeFirst, filter_utf),) # title of the article
url = scrapy.Field() # url of the article
And, pipelines.py file:
import json
from scrapy import signals
from scrapy.exporters import JsonLinesItemExporter
class AdvCrawlerJsonExportPipeline(object):
def open_spider(self, spider):
self.file = open('crawled-articles1.txt', 'w')
def close_spider(self, spider):
def process_item(self, item, spider):
line = json.dummps(dict(item)) + "\n"
return item
I am aware that "TypeError: object() takes no parameters" error is usually thrown when __init__ method of a class is not defined at all or not defined to take in parameter(s).
However, in the case above, how can i fix the error? Am I doing something wrong using the item loader or nested item loader??
When using scrapy processors you should use the classes to create objects that do the processing:
# wrong
field = Field(output_processor=MapCompose(TakeFirst))
# right
field = Field(output_processor=MapCompose(TakeFirst()))

Log warning from Selenium on Django [duplicate]

Whenever I try to construct a string based on self.live_server_url, I get python TypeError messages. For example, I've tried the following string constructions (form 1 & 2 below), but I experience the same TypeError. My desired string is the Live Server URL with "/lists" appended. NOTE: the actual test does succeed to create a server and I can manually access the server, and more specifically, I can manually access the exact URL that I'm trying to build programmatically (e.g. 'http://localhost:8081/lists').
TypeErrors occur with these string constructions.
# FORM 1
lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
# FORM 2
lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
There is no python error with this form (nothing appended to string), albeit my test fails (as I would expect since it isn't accessing /lists).
Here is the python error that I'm getting.
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/bin/python3.4 /Applications/PyCharm.app/Contents/helpers/pycharm/django_test_manage.py test functional_tests.lists_tests.LiveNewVisitorTest.test_can_start_a_list_and_retrieve_it_later /Users/myusername/PycharmProjects/mysite_proj
Testing started at 11:55 AM ...
Creating test database for alias 'default'...
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/wsgiref/handlers.py", line 137, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1104, in __call__
return super(FSFilesHandler, self).__call__(environ, start_response)
File "/usr/local/lib/python3.4/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1087, in get_response
return self.serve(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1099, in serve
return serve(request, final_rel_path, document_root=self.get_base_dir())
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/posixpath.py", line 82, in join
path += b
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'
Am I unknowingly attempting to modify the live_server_url, which is leading to these TypeErrors? How could I programmatically build a string of live_server_url + "/lists"?
Here is the test that I am attempting...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from django.test import LiveServerTestCase
class LiveNewVisitorTest(LiveServerTestCase):
def setUp(self):
self.browser = webdriver.Chrome()
def tearDown(self):
def test_can_start_a_list_and_retrieve_it_later(self):
#lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
#lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
lists_live_server_url = self.live_server_url
self.assertIn('To-Do', self.browser.title)
header_text = self.browser.find_element_by_tag_name('h1').text
self.assertIn('To-Do', header_text)
See this discussion on Reddit featuring the same error Traceback.
Basically, this is not a problem with anything within the Selenium tests but rather with your project's static file configuration.
From your question, I believe the key line within the Traceback is:
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
This line indicates that an unsuccessful os.path.join is being attempted within django.views.static.
Set STATIC_ROOT in your project's settings.pyfile and you should be good.
Use StaticLiveServerTestCase instead may help

Scrapy-Scraper Does Not Run

I can run python using Beautiful Soup and Mechanized, but for some reason when I try to use Spray-Scraper it just doesn't work. Here's an example of what happens when I attempt to test the scraper with a tutorial:
Project name & BOT name = "tutorial"
The following scripts are the items.py and settings.py that I used.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
C:\Users\Turbo>scrapy startproject tutorial
New Scrapy project 'tutorial' created in:
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
C:\Users\Turbo>cd tutorial
C:\Users\Turbo\tutorial>scrapy crawl dmoz
Traceback (most recent call last):
File "C:\Python27\Scripts\scrapy-script.py", line 9, in <module>
load_entry_point('scrapy==0.24.4', 'console_scripts', 'scrapy')()
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 89, in _run_print_help
func(*a, **kw)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 150, in _run_command
cmd.run(args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\commands\cr
awl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\spidermanag
er.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'
The problem is that you are putting your spider into the items.py.
Instead, create a package spiders, inside it create a dmoz.py and put your spider into it.
See more at Our first Spider paragraph of the tutorial.

How to extract text from PDF uploaded in Google App Engine using PyPDF2?

Is there any way to extract text and documentInfo from PDF file uploaded via Google app engine? I want to use PyPDF2, and my code is this:
pdf_file = self.request.POST['file'].file
pdf_reader = pypdf.PdfFileReader(pdf_file)
This gives me error:
Traceback (most recent call last):
File "/myrepo/myproj/main.py", line 154, in post
pdf_text = pypdf.PdfFileReader(pdf_file)
File "lib/PyPDF2/pdf.py", line 649, in __init__
File "lib/PyPDF2/pdf.py", line 1100, in read
raise utils.PdfReadError, "EOF marker not found"
PdfReadError: EOF marker not found
It gives this error for any file, even for those that can successfully be read from file on the disk via open(filename, 'r')
am i missing something? thanks in advance!
the solution is to use get_uploads from blobstore_handlers.BlobstoreUploadHandler:
from google.appengine.ext.webapp import blobstore_handlers
from cStringIO import StringIO
import PyPDF2
class UploadHandler(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
upload_files = self.get_uploads('file')
blob_info = upload_files[0]
blob_reader = blobstore.BlobReader(blob_info)
blob_content = StringIO(blob_reader.read())
pdf_info = PyPDF2.PdfFileReader(blob_content)

Django Official Tutorial Part 1 index out of bound error

I started learning Django recently and am having a strange problem with the tutorial. Everything was going fine until I started playing with the interactive shell and then I got an error whenever I tried to call all the objects in one of the tables.
I am using Django 1.1, Python 2.5 on MacOs X.
For those unfamiliar with the tutorial you are making a website to manage Polls. You have the following code in the model:
from django.db import models
import datetime
class Poll(models.Model):
question = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
def __unicode__(self):
return self.question
def was_published_today(self):
return self.pub_date.date() == datetime.date.today()
was_published_today.short_description = 'Published today?'
class Choice(models.Model):
poll = models.ForeignKey(Poll)
choice = models.CharField(max_length=200)
votes = models.IntegerField()
def __unicode__(self):
return self.choice
After creating the model you add a poll item and then add some choices to it.
Everything was fine until I tried to see all the objects in the choices table or tried to see all the choices in a particular poll. Then I got an error. Heres an example series of commands in the interactive shell. Please note that the count of the choices is correct (I have experimented a bit after running into the error so the count is a bit high.)
>>> from mysite.polls.models import Poll, Choice
>>> Poll.objects.all()
[<Poll: What's up>, <Poll: Yups>]
>>> Choice.objects.count()
>>> Choice.objects.all()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Library/Python/2.5/site-packages/django/db/models/query.py", line 68, in __repr__
data = list(self[:REPR_OUTPUT_SIZE + 1])
File "/Library/Python/2.5/site-packages/django/db/models/query.py", line 83, in __len__
File "/Library/Python/2.5/site-packages/django/db/models/query.py", line 238, in iterator
for row in self.query.results_iter():
File "/Library/Python/2.5/site-packages/django/db/models/sql/query.py", line 287, in results_iter
for rows in self.execute_sql(MULTI):
File "/Library/Python/2.5/site-packages/django/db/models/sql/query.py", line 2369, in execute_sql
cursor.execute(sql, params)
File "/Library/Python/2.5/site-packages/django/db/backends/util.py", line 19, in execute
return self.cursor.execute(sql, params)
File "/Library/Python/2.5/site-packages/django/db/backends/sqlite3/base.py", line 193, in execute
return Database.Cursor.execute(self, query, params)
File "/Library/Python/2.5/site-packages/django/db/backends/util.py", line 82, in typecast_timestamp
seconds = times[2]
IndexError: list index out of range
The Django tutorial(part 1) can be found here
The problem seemed to be that the database was not synchronized with the models. Resetting the database worked fine. Thanks to Alasdair for the suggestion.
It looks like the problem is that was_published_today() is comparing a datetime to a date. Try changing it to:
return self.pub_date.date() == datetime.date.today()
Since the problem seems to be in the code that interprets strings as timestamps, I'd be interested to see the actual data in the db. It looks like there's a timestamp in there that isn't in the proper form. Not sure how it got there without seeing it, but I bet there's a clue there.