Serve scraped HTML data as an API using Django Rest Framework - django

I'm trying to build a public facing API that collects data through scraping HTML (the content of the page is what is important, not the pages themselves). I've elected to use Django-Rest-Framework as my backend. My question is: How exactly would I organize the structure of this project so that the Django ORM stores the scraped content and then it can be accessed using Django-Rest-Framework's API?
I've looked into Scrapy, but that seems less focused on content scraping and more focused on webcrawling. Additionally, it deploys in its own project, which makes conflicts with Django's bootstrapping.
Is my best bet just running cronjobs? That seems inelegant.

Use Celery to create asynchronous and periodic tasks.
If you need something lightweight for scraping, you can use BeautifulSoup. Here is a tutorial.
Overall, this is what you need to do:
Start ordinary Django project.
Add Celery to it.
Write some scraping code.
Call your custom scraping code from celery tasks. Save the scraped content to the database.
Use Django-Rest-Framework to create an API which will serve the content from the database.

Related

Integrating django-import-export with react

I have a postgresql database and I want to add data to it.
I want to upload an excel file containing the data and save it to database.
I have a backend server of django and a frontend server of React.
I am easily able to import the data from the excel sheet to database using django_import_export but from the django admin.
What I want is, to do this using React and for normal users(non superusers) also. Is there a way to integrate django_import_export with react? Any other way to implement this functionality is also apreciated.
Presumably your backend uses a REST API to handle requests from the frontend. So, you can write an API handler which receives the Excel data posted to it.
The Django API handler can create an import process to handle upload. Check the documentation for more information.
Note that if you are loading large files, then you might want to handle the upload asychronously. This is little more tricky, but you could look at Celery to help with this.

Communication between Django and React

I'm trying to setup a project using Django for backend and React for frontend. The project has several screens, a lot of information in DB and images generated by the backend, and will include some authentication and user permissions for different screens.
According to what I found - the best way to do it is having Django render an html file:
def index(request):
return render(request, 'frontend/index.html')
which references a .js file:
<script src="{% static "frontend/main.js" %}"></script>
Which is created using Webpack.
This main.js retrieves the data it needs from Django using a REST api:
fetch("...some Django endpoint..").then(response => ... this.setState(...retrieved data...))
Unlike when just using Django for backend + Django templates for frontend where the backend can just send the context directly to the template:
def index(request):
context = {'information': .... retrieve info from DB}
return HttpResponse(loader.get_template('bla/index.html').render(context, request))
and the template can use this info directly, without referencing the backend again:
{% for bla in information %}
I'm wondering if it is a reasonable setup?
It seems excessive to have the frontend use REST for retrieving each piece of information it needs and the backend exposing another REST api for each part of data it needs to supply (Instead of just pushing all of the information to a single dict and sending it over along with the template),
Also, it requires at least 2 RTTs to render the full page (which I guess usually is okay)
According to what I found - the best way to do it is having Django render an html file:
I disagree with this line. I would say it would be best to keep the react app and Django app totally separate.
I believe, the Django application should solely provide APIs and adminsite(maybe, depending on your needs).
And the frontend should be a standalone app which can be served through NGINX/ExpressJs/Apache etc.
There are several advantages of this setup.
From Django application's perspective, the advantages are:
Django will not be burdened to serve the Frontend. Use gunicorn or uwsgi to serve the Django APIs.
As Django will provide data through API only, it will provide clarity on how the frontend application will communicate with the backend. I know that you can send data using context when Django serves the react app, but this might cause confusion because of API and context's co-existence.
You can use Token based authentication, JWT etc instead of Django's own session based authentication, which have a lot of other advantages.
Freeing your frontend application from backend is the best thing can happen for the frontend. Like for example:
if you had Django to serve the frontend, you were almost forced to use session based auth(its not like you can't use other auths, but whats the point of having multiple auth systems)
You couldn't have used server side rendering with Django rendering the frontend.
Lets say, you are have no idea about how Django works, but you will be forced to setup a Django application in your local machine, because it serves the frontend.
You couldn't have used ExpressJs to serve the frontend, or use the advantages of using NGINX to serve those contents.
Deployment would be complicated if you have docker setup. In this case, you would have had to use one Docker Container to serve everything, else you could have used multiple docker containers to serve backend/frontend.
Lets say, you want to serve the Django application in one server, frontend from other server, but with Django tightly coupled with Frontend, you can't do this setup.
You can easily connect external RESTful API services without bothering about Django. Even you can use any other frameworks like Tornado, Flask etc(but DRF+Django ORM is awesome) to develop APIs.
There are some more generic advantages of having backend and frontend separated.
There is a fantastic tutorial which you can read on medium about setting up separate Django + ReactJs app.
You can use GraphQL, it has several advantages over REST, f.e.:
only one endpoint for entire app;
ability to fetch data with relations between them;
easy data structure modifications on both sides;
advanced client with cache/normalization/subscriptions/optimistic updates (I prefer apollo to relay);
can be used as datasource for static site generation (SEO);
you can stich other services/APIs;
... many more.
Using react Server Side Rendering you can get pages without additional requests - 'prefilled'/rehydrated, ready for interactions - better Time to Interactive.
Tutorial/demo: django-graphql-apollo-react-demo

Migration from Wordpress to Django?

I created a website through Wordpress but now i want to migrate my site into Django without losses my data.
So, How can i switch my website from Wordpress to Django?
Should be pretty easy. Once you've designed and built your Django app and it's functioning correctly, write a Python script that accesses all the necessary data form your WP database and creates the corresponding Django objects.
If you're not an experienced programmer this will be extremely hard, unfortunately. But you'll learn a lot.
The hardest thing is rewrite worpress template to django template engine. I dont see nothing else than manual facture
The database schema of the wordpress website may differ from the Django website that you have created. To migrate all the data from Wordpress to Django, you first need to export the page details in the XML file from the Wordpress dashboard > Tool > Export section.
After exporting the data XML file you can write a Python script that read from the XML file and populate the current Django database.

When to use Angular with Django?

I have a pretty basic question.
Consider a CRUD web application built on Django. You have templates that render data. Those templates might have forms where you submit data to the backend, and that might reload the page to display changes. Sometime, you can make those requests over AJAX, for example when you need to update data on the UI. You can also submit forms with AJAX and update the HTML with it.
On the other hand you have single page applications. You serve a static file, and there is no reload of pages. You have data that comes from an API and populates some front-end template.
What are some guidelines for when to use what? Not in a mutually exclusive way, but within one Django project, what are some reasons/considerations to use a Django template/forms/AJAX approach and when to use Angular?
Thank you.
Something to consider is how "interactive" you want the client-side to be.
I am in the process of converting an existing Django app to use Angular (and django-rest-framework). The app was highly interactive and relied on a lot of custom JQuery to get various widgets working just right. JQuery's constant looping through the DOM made it pretty slow. I am finding that using Angular instead of JQuery is much faster.
So if you have a lot of complexity in the front-end, I would recommend Angular.

How to port from Drupal to Django?

What would be the best way to port an existing Drupal site to a Django application?
I have around 500 pages (mostly books module) and around 50 blog posts. I'm not using any 3rd party modules.
I would like to keep the current URLS (for SEO purposes) and migrate database to Django. I will create a simple blog application, so migrating blog posts should be ok. What would be the best way to serve 500+ pages with Django? I would like to use Admin to edit/add new pages.
All Django development is similar, and yours will fit the pattern.
Define the Django model for your books and blog posts.
Unit test that model using Django's built-in testing capabilities.
Write some small utilities to load your legacy data into Django. At this point, you'll realize that your Django model isn't perfect. Good. Fix it. Fix the tests. Redo the loads.
Configure the default admin interface to your model. At this point, you'll spend time tweaking the admin interface. You'll realize your data model is wrong. Which is a good thing. Fix your model. Fix your tests. Fix your loads.
Now that your data is correct, you can create templates from your legacy pages.
Create URL mappings and view functions to populate the templates from the data model.
Take the time to get the data model right. It really matters, because everything else is very simple if your data model is solid.
It may be possible to write Django models which work with the legacy database (I've done this in the past; see docs on manage.py inspectdb).
However, I'd follow advice above and design a clean database using Django conventions, and then migrate the data over. I usually write migration scripts which write to the new database through Django and read the old one using the raw Python DB APIs (while it is possible to tie Django to multiple databases simultaneously, too).
I also suggest taking a look at the available blogging apps for Django. If the one included in Pinax suits your need, go ahead and use Pinax as a starting point.
S.Lott answer is still valid after years, I try to complete the analysis with the tools and format to do the job.
There are many Drupal export tools out of there by now but with the very same request I go for Views Datasource choosing JSON as format. This module is very solid and available for the last version of Drupal. The JSON format is very fast in both parsing and encoding and it's easy to read and very Python-friendly (import json).
Using Views Datasource you can create a node view sorted by node id (nid), show a limited number of elements per page, configure a view path, add to it a filter identifier and pass to it the nid to read all elements until you get an empty JSON response.
When importing in Django you have a wide set of tools as well, starting from loaddata to load fixtures. Views Datasource exported JSON but it's not formatted as Django expects fixtures: you can write a custom admin command to do the import, where you can have the full control of the import flow.
You can start your command passing a nid=0 as argument and then let the procedure read, import and then fetch data from the next page passing simply the last nid read in the previous HTTP request. You can even restrict access to the path on view but you need additional configuration on the import side.
Regarding performance, just for example I parsed and imported 15.000+ nodes in less than 10 minutes via a Django 1.8 custom admin command on an 8 core / 8 GB Linux virtual machine and PostgreSQL as DBMS, logging success and error information into a custom model for each node.
These are the basics for import/export between these two platform, for detailed information I described all the major steps for export from Drupal and then import to Django in this guide.