I have been trying to get xapian working django haystack for a project im working on that requires some search functionality but have run into a bit of a wall!! Basically i got everything installed per the instructions, so:
ran make install for xapian-core and the xapian bindings
ran pip install haystack and pip install xapian-haystack and everything installed correctly
As im using the django cms app i simply copied thier example over to give the search functionality a test and ran into this error
InvalidIndexError at /search/
Unable to open index at /home/mike/sites/xapian_search
I have tried several different paths for the HAYSTACK_XAPIAN_PATH setting and have also encountered another error
OSError at /
(13, 'Permission denied')
the folder xapain_search has been given full perms (chmod 777) and theres an xapian_index.php file with full perms too. Im not sure what im missing here but im desperate to try and get this working!!
my haystack settings look like
HAYSTACK_SITECONF = 'lactoseintolerant.lactose_search'
HAYSTACK_SEARCH_ENGINE = 'xapian'
HAYSTACK_XAPIAN_PATH = '/home/mike/sites/xapian_search'
HAYSTACK_SEARCH_RESULTS_PER_PAGE = 50
Any advice would be greatly appreciated!!
edit
Hey again
i think this error is relating to the fact that there are no indexes(is that right?) i have run the commands update_index rebuild_index clear_index all of which dont seem to do anything, there are no errors outputted but still the indexs never seem to be built when the commands are run
i have an app called lactose_search which my HAYSTACK_SITECONF points to like so projectname.lactose_search in this app folder i have a file called search_indexs.py. For now i have simply c+p the example from the django cms site as it is the cms_app content i want to search
this file looks like
from django.conf import settings
from django.utils.translation import string_concat, ugettext_lazy
from haystack import indexes, site
from cms.models.managers import PageManager
from cms.models.pagemodel import Page
def page_index_factory(lang, lang_name):
if isinstance(lang_name, basestring):
lang_name = ugettext_lazy(lang_name)
def get_absolute_url(self):
return '/%s%s' % (lang, Page.get_absolute_url(self))
class Meta:
proxy = True
app_label = 'cms'
verbose_name = string_concat(Page._meta.verbose_name, ' (', lang_name, ')')
verbose_name_plural = string_concat(Page._meta.verbose_name_plural, ' (', lang_name, ')')
attrs = {'__module__': Page.__module__,
'Meta': Meta,
'objects': PageManager(),
'get_absolute_url': get_absolute_url}
_PageProxy = type("Page%s" % lang.title() , (Page,), attrs)
_PageProxy._meta.parent_attr = 'parent'
_PageProxy._meta.left_attr = 'lft'
_PageProxy._meta.right_attr = 'rght'
_PageProxy._meta.tree_id_attr = 'tree_id'
class _PageIndex(indexes.SearchIndex):
language = lang
text = indexes.CharField(document=True, use_template=False)
pub_date = indexes.DateTimeField(model_attr='publication_date')
login_required = indexes.BooleanField(model_attr='login_required')
url = indexes.CharField(stored=True, indexed=False, model_attr='get_absolute_url')
title = indexes.CharField(stored=True, indexed=False, model_attr='get_title')
def prepare(self, obj):
self.prepared_data = super(_PageIndex, self).prepare(obj)
plugins = obj.cmsplugin_set.filter(language=lang)
text = ''
for plugin in plugins:
instance, _ = plugin.get_plugin_instance()
if hasattr(instance, 'search_fields'):
text += ''.join(getattr(instance, field) for field in instance.search_fields)
self.prepared_data['text'] = text
return self.prepared_data
def get_queryset(self):
return _PageProxy.objects.published().filter(title_set__language=lang, publisher_is_draft=False).distinct()
return _PageProxy, _PageIndex
for lang_tuple in settings.LANGUAGES:
lang, lang_name = lang_tuple
site.register(*page_index_factory(lang, lang_name))
and can be found here http://docs.django-cms.org/en/2.1.3/extending_cms/searchdocs.html
Hope this extra info may make answering this question abit easier!
it's more likely that you have not built the index using following command
python manage.py update_index
same thing happens to me, just needed to run above command.
This is a rather strange issue that I haven't yet encountered (and no one has yet reported here: https://github.com/notanumber/xapian-haystack/issues)
Older versions of the Xapian-Haystack required write permission (to be able create indexes) and had a check at startup that verified this was the case, but this was removed.
As long as the process can read the HAYSTACK_XAPIAN_PATH folder you shouldn't be receiving any Permission Denied errors.
Can you confirm what version of the backend you are using? If possible, I'd also suggest trying to swap out the backend with Whoosh just as a sanity check that there's not something hokey going on.
I figured out what my issue was here, when i installed the packages to my env i ran sudo pip install instead of simply using pip. I cant explain why this affected the haystack install but once i removed all the packages and re-installed them i managed to get haystack running
Related
As seems to frequently happen here, I am quite new to Python 2.7 and Scrapy. Our project has us scraping website date, following some links and more scraping, and so on. This was all working fine. Then I updated Scrapy.
Now when I launch my spider, I get the following message:
This wasn't coming up anywhere previously (none of my prior error messages looked anything like this). I am now running scrapy 1.1.0 on Python 2.7. And none of the spiders that had previously worked on this project are working.
I can provide some example code if need be, but my (admittedly limited) knowledge of Python suggests to me that its not even getting to my script before bombing out.
EDIT:
OK, so this code is supposed to start at the first authors page for Deakin University academics on The Conversation, and go through and scrape how many articles they have written and comments they have made.
import scrapy
from ltuconver.items import ConversationItem
from ltuconver.items import WebsitesItem
from ltuconver.items import PersonItem
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import bs4
class ConversationSpider(scrapy.Spider):
name = "urls"
allowed_domains = ["theconversation.com"]
start_urls = [
'http://theconversation.com/institutions/deakin-university/authors']
#URL grabber
def parse(self, response):
requests = []
people = Selector(response).xpath('///*[#id="experts"]/ul[*]/li[*]')
for person in people:
item = WebsitesItem()
item['url'] = 'http://theconversation.com/'+str(person.xpath('a/#href').extract())[4:-2]
self.logger.info('parseURL = %s',item['url'])
requests.append(Request(url=item['url'], callback=self.parseMainPage))
soup = bs4.BeautifulSoup(response.body, 'html.parser')
try:
nexturl = 'https://theconversation.com'+soup.find('span',class_='next').find('a')['href']
requests.append(Request(url=nexturl))
except:
pass
return requests
#go to URLs are grab the info
def parseMainPage(self, response):
person = Selector(response)
item = PersonItem()
item['name'] = str(person.xpath('//*[#id="outer"]/header/div/div[2]/h1/text()').extract())[3:-2]
item['occupation'] = str(person.xpath('//*[#id="outer"]/div/div[1]/div[1]/text()').extract())[11:-15]
item['art_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[1]/h2/text()').extract())[3:-3])
item['com_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[2]/h2/text()').extract())[3:-3])
And in my Settings, I have:
BOT_NAME = 'ltuconver'
SPIDER_MODULES = ['ltuconver.spiders']
NEWSPIDER_MODULE = 'ltuconver.spiders'
DEPTH_LIMIT=1
Apparently my six.py file was corrupt (or something like that). After swapping it out with the same file from a colleague, it started working again 8-\
I can't seem to figure out how to access POST data using WSGI. I tried the example on the wsgi.org website and it didn't work. I'm using Python 3.0 right now. Please don't recommend a WSGI framework as that is not what I'm looking for.
I would like to figure out how to get it into a fieldstorage object.
Assuming you are trying to get just the POST data into a FieldStorage object:
# env is the environment handed to you by the WSGI server.
# I am removing the query string from the env before passing it to the
# FieldStorage so we only have POST data in there.
post_env = env.copy()
post_env['QUERY_STRING'] = ''
post = cgi.FieldStorage(
fp=env['wsgi.input'],
environ=post_env,
keep_blank_values=True
)
body= '' # b'' for consistency on Python 3.0
try:
length= int(environ.get('CONTENT_LENGTH', '0'))
except ValueError:
length= 0
if length!=0:
body= environ['wsgi.input'].read(length)
Note that WSGI is not yet fully-specified for Python 3.0, and much of the popular WSGI infrastructure has not been converted (or has been 2to3d, but not properly tested). (Even wsgiref.simple_server won't run.) You're in for a rough time doing WSGI on 3.0 today.
This worked for me (in Python 3.0):
import urllib.parse
post_input = urllib.parse.parse_qs(environ['wsgi.input'].readline().decode(),True)
I had the same issue and I invested some time researching a solution.
the complete answer with details and ressources (since the one accepted here didnt work for me on python3, many errors to correct in env library etc):
# the code below is taken from and explained officially here:
# https://wsgi.readthedocs.io/en/latest/specifications/handling_post_forms.html
import cgi
def is_post_request(environ):
if environ['REQUEST_METHOD'].upper() != 'POST':
return False
content_type = environ.get('CONTENT_TYPE', 'application/x-www-form-urlencoded')
return (content_type.startswith('application/x-www-form-urlencoded' or content_type.startswith('multipart/form-data')))
def get_post_form(environ):
assert is_post_request(environ)
input = environ['wsgi.input']
post_form = environ.get('wsgi.post_form')
if (post_form is not None
and post_form[0] is input):
return post_form[2]
# This must be done to avoid a bug in cgi.FieldStorage
environ.setdefault('QUERY_STRING', '')
fs = cgi.FieldStorage(fp=input,
environ=environ,
keep_blank_values=1)
new_input = InputProcessed()
post_form = (new_input, input, fs)
environ['wsgi.post_form'] = post_form
environ['wsgi.input'] = new_input
return fs
class InputProcessed(object):
def read(self, *args):
raise EOFError('The wsgi.input stream has already been consumed')
readline = readlines = __iter__ = read
# the basic and expected application function for wsgi
# get_post_form(environ) returns a FieldStorage object
# to access the values use the method .getvalue('the_key_name')
# this is explained officially here:
# https://docs.python.org/3/library/cgi.html
# if you don't know what are the keys, use .keys() method and loop through them
def application(environ, start_response):
start_response('200 OK', [('Content-type', 'text/plain')])
user = get_post_form(environ).getvalue('user')
password = get_post_form(environ).getvalue('password')
output = 'user is: '+user+' and password is: '+password
return [output.encode()]
I would suggest you look at how some frameworks do it for an example. (I am not recommending any single one, just using them as an example.)
Here is the code from Werkzeug:
http://dev.pocoo.org/projects/werkzeug/browser/werkzeug/wrappers.py#L150
which calls
http://dev.pocoo.org/projects/werkzeug/browser/werkzeug/utils.py#L1420
It's a bit complicated to summarize here, so I won't.
I've got an app deployed on Heroku using Django, and so far it seems to be working but I'm having a problem uploading new thumbnails. I have installed Pillow to allow me to resize images when they're uploaded and save the resized thumbnail, not the original image. However, every time I upload, I get the following error: "This backend doesn't support absolute paths." When I reload the page, the new image is there, but it is not resized. I am using Amazon AWS to store the images.
I'm suspecting it has something to do with my models.py. Here is my resize code:
class Projects(models.Model):
project_thumbnail = models.FileField(upload_to=get_upload_file_name, null=True, blank=True)
def __unicode__(self):
return self.project_name
def save(self):
if not self.id and not self.project_description:
return
super(Projects, self).save()
if self.project_thumbnail:
image = Image.open(self.project_thumbnail)
(width, height) = image.size
image.thumbnail((200,200), Image.ANTIALIAS)
image.save(self.project_thumbnail.path)
Is there something that I'm missing? Do I need to tell it something else?
Working with Heroku and AWS, you just need to change the method of FileField/ImageField 'path' to 'name'. So in your case it would be:
image.save(self.project_thumbnail.name)
instead of
image.save(self.project_thumbnail.path)
I found the answer with the help of others googling as well, since my searches didn't pull the answers I wanted. It was a problem with Pillow and how it uses absolute paths to save, so I switched to using the storages module as a temp save space and it's working now. Here's the code:
from django.core.files.storage import default_storage as storage
...
def save(self):
if not self.id and not self.project_description:
return
super(Projects, self).save()
if self.project_thumbnail:
size = 200, 200
image = Image.open(self.project_thumbnail)
image.thumbnail(size, Image.ANTIALIAS)
fh = storage.open(self.project_thumbnail.name, "w")
format = 'png' # You need to set the correct image format here
image.save(fh, format)
fh.close()
NotImplementedError: This backend doesn't support absolute paths - can be fixed by replacing file.path with file.name
How it looks in the the console
c = ContactImport.objects.last()
>>> c.json_file
<FieldFile: protected/json_files/data_SbLN1MpVGetUiN_uodPnd9yE2prgeTVTYKZ.json>
>>> c.json_file.name
'protected/json_files/data_SbLN1MpVGetUiN_uodPnd9yE2prgeTVTYKZ.json'
I'm trying to use django, and mongoengine to provide the storage backend only with GridFS. I still have a MySQL database.
I'm running into a strange (to me) error when I'm deleting from the django admin and am wondering if I am doing something incorrectly.
my code looks like this:
# settings.py
from mongoengine import connect
connect("mongo_storage")
# models.py
from mongoengine.django.storage import GridFSStorage
class MyFile(models.Model):
name = models.CharField(max_length=50)
content = models.FileField(upload_to="appsfiles", storage=GridFSStorage())
creation_time = models.DateTimeField(auto_now_add=True)
last_update_time = models.DateTimeField(auto_now=True)
I am able to upload files just fine, but when I delete them, something seems to break and the mongo database seems to get in an unworkable state until I manually delete all FileDocument.objects. When this happens I can't upload files or delete them from the django interface.
From the stack trace I have:
/home/projects/vector/src/mongoengine/django/storage.py in _get_doc_with_name
doc = [d for d in docs if getattr(d, self.field).name == name] ...
▼ Local vars
Variable Value
_[1]
[]
d
docs
Error in formatting: cannot set options after executing query
name
u'testfile.pdf'
self
/home/projects/vector/src/mongoengine/fields.py in __getattr__
raise AttributeError
Am I using this feature incorrectly?
UPDATE:
thanks to #zeekay's answer I was able to get a working gridfs storage plugin to work. I ended up not using mongoengine at all. I put my adapted solution on github. There is a clear sample project showing how to use it. I also uploaded the project to pypi.
Another Update:
I'd highly recommend the django-storages project. It has lots of storage backed options and is used by many more people than my original proposed solution.
I think you are better off not using MongoEngine for this, I haven't had much luck with it either. Here is a drop-in replacement for mongoengine.django.storage.GridFSStorage, which works with the admin.
from django.core.files.storage import Storage
from django.conf import settings
from pymongo import Connection
from gridfs import GridFS
class GridFSStorage(Storage):
def __init__(self, host='localhost', port=27017, collection='fs'):
for s in ('host', 'port', 'collection'):
name = 'GRIDFS_' + s.upper()
if hasattr(settings, name):
setattr(self, s, getattr(settings, name))
for s, v in zip(('host', 'port', 'collection'), (host, port, collection)):
if v:
setattr(self, s, v)
self.db = Connection(host=self.host, port=self.port)[self.collection]
self.fs = GridFS(self.db)
def _save(self, name, content):
self.fs.put(content, filename=name)
return name
def _open(self, name, *args, **kwars):
return self.fs.get_last_version(filename=name)
def delete(self, name):
oid = fs.get_last_version(filename=name)._id
self.fs.delete(oid)
def exists(self, name):
return self.fs.exists({'filename': name})
def size(self, name):
return self.fs.get_last_version(filename=name).length
GRIDFS_HOST, GRIDFS_PORT and GRIDFS_COLLECTION can be defined in your settings or passed as host, port, collection keyword arguments to GridFSStorage in your model's FileField.
I referred to Django's custom storage documenation, and loosely followed this answer to a similar question.
With Django's new multi-db functionality in the development version, I've been trying to work on creating a management command that let's me synchronize the data from the live site down to a developer machine for extended testing. (Having actual data, particularly user-entered data, allows me to test a broader range of inputs.)
Right now I've got a "mostly" working command. It can sync "simple" model data but the problem I'm having is that it ignores ManyToMany fields which I don't see any reason for it do so. Anyone have any ideas of either how to fix that or a better want to handle this? Should I be exporting that first query to a fixture first and then re-importing it?
from django.core.management.base import LabelCommand
from django.db.utils import IntegrityError
from django.db import models
from django.conf import settings
LIVE_DATABASE_KEY = 'live'
class Command(LabelCommand):
help = ("Synchronizes the data between the local machine and the live server")
args = "APP_NAME"
label = 'application name'
requires_model_validation = False
can_import_settings = True
def handle_label(self, label, **options):
# Make sure we're running the command on a developer machine and that we've got the right settings
db_settings = getattr(settings, 'DATABASES', {})
if not LIVE_DATABASE_KEY in db_settings:
print 'Could not find "%s" in database settings.' % LIVE_DATABASE_KEY
return
if db_settings.get('default') == db_settings.get(LIVE_DATABASE_KEY):
print 'Data cannot synchronize with self. This command must be run on a non-production server.'
return
# Fetch all models for the given app
try:
app = models.get_app(label)
app_models = models.get_models(app)
except:
print "The app '%s' could not be found or models could not be loaded for it." % label
for model in app_models:
print 'Syncing %s.%s ...' % (model._meta.app_label, model._meta.object_name)
# Query each model from the live site
qs = model.objects.all().using(LIVE_DATABASE_KEY)
# ...and save it to the local database
for record in qs:
try:
record.save(using='default')
except IntegrityError:
# Skip as the record probably already exists
pass
Django command extension's Dumpscript should help a lot.
This doesn't answer your question exactly but why not just do a db dump and a db restore?