Efficient way to transfer data from one django application to another - django

Currently, I'm working on a project where I have a server - client relationship between two django applications running on separate hosts.
The server has to store and provide a large amount of relational data, eg: Suppliers, Companys, Products, etc etc..
The client downloads data on request from the server and adds it to their database. clients can also upload from their station to the database to expand it.
The previous person that developed this used XMLRPC to transfer the vast (13MB typical) XML file from server to client. now really all we're sending are database agnostic objects to be stored in a database so i wondered if there was a more efficient way of doing it?
Please ask for more details if you need them, I wasn't really sure what you'd need to know
EDIT: Efficient in terms of Networking, and Server Side Processing. Clients can do the heavy lifting.

A shared database design seems more suitable. But of course there may be security, political or organisational reasons ruling that out. Plus there would be significant re-design required.
To reduce network bandwidth first check that HTTP gzip compression is enabled.
If it's just a dumb data transfer JSON would generally be a lot more compact than XMLRPC. Does the data look amenable to a straight translation to JSON? This would still require some server-side processing.
For minimal server-side processing (if the database tables are relatively similar) it may be very efficient to just send the client a dump of the relevant db query. Of course unless the tables have the same schema you would have to do some client-side processing of raw SQL, which is not ideal.

Related

Uploading large files to server

The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?
You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.
In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

Is there a nice way to exchange django objects between 2 servers?

I have 2 django servers, with their own database, I want to exchange some specific objects between them over the http protocol.
Actually, I planed to create some views to generate XML output on one side to be imported on the other side. Is there a nicer way ?
Is there a reason this needs to happen through http?
If you just want to read data from one server to be used on the other, you could create a simple API that returns a representation of the object you queried for (in xml/json or whatever other format you wanted).
If there is going to be a decent amount of processing going on, or slow communication, and you don't need it to happen real time (in the request/response cycle), you could look at a message queue. Something like RabbitMQ for instance.
If you want both servers to have direct access to both databases, you could try to take advantage of Django's multiple database support.
If it's more of a one-off copy of data, just write a small (non-Django) script to do it.

Large Volume Excel Data Pulls - Avoiding ODBC

We have a requirement to provide ad-hoc access to large subsets of a system's data to users to analyse in Excel.
We do not want to grant direct ODBC access. This will curb our ability to make DB layout changes without our users' processes breaking.
Web Services seem ill suited for the volume of data at stake, in the region of 100's of thousands of records.
What would you suggest as an alternative to direct ODBC access?
There is a database concept of a "view" which does exactly what you need - it allows to expose large set of data and gives you a freedom of DB schema changes as long as you take care of exposing the same data to a user.
I agree with you regarding web services - it is not only the volume of data, but also the fact getting web services to work with Excel (2007 and above) is far from trivial. Also you will lock your DB schema as much as you would with a view.
For the really, really huge number of records you can consider data warehousing - a separate db, where you provide a read only access for reporting purposes and feeding the data from your read/write database. The feed can be easily and quickly done via SSIS.
HTH

Cost of serialization in web service

My next project involves the creation of a data API within an enterprise framework. The data will be consumed by several applications running on different software platforms. While my colleagues generally favour SOAP, I would like to use a RESTful architecture.
Most of the applications will only need a few objects at every call. Other applications will however sometimes need to make several sequential calls each involving thousands of records. I'm concerned about performance. Serialization/deserialization & network usage are where I fear to find a bottleneck. If each request involves a large delay, all of the enterprise's applications will be sluggish.
Are my fears realistic? Will serialization to a voluminous format like XML or JSON be a problem? Are there alternatives?
In the past, we've had to do these large data transfers using a "flatter"/leaner file format such as CSV for performance. How can I hope to achieve the performance I need using a web service?
While I'd prefer replies specific to REST, I'm interested in hearing how SOAP users might deal with this as well.
One advantage of REST is that you are free to use whatever media type you like. Why not continue to use text/csv? You could also enable HTTP compression to further reduce bandwidth consumption.
REST services are great for taking advantage of all different kinds of data formats. Whatever format fits your scenario best.
We offer both XML and JSON. Your mentioned rendering time really can be an issue. On server side we have JAXB whose standard sun-implementation is somewhat slow, when it comes to marshall XML. XML has the disadvantage of verbosity, but is also nice in interoperability and has schema + explicit versioning.
We compensated the verbosity in several ways (especially limiting the result-set):
In case you have a container with items in it, offer paging in your xml response (both page-size and page-number, e.g. /items?page=0&size=3) . The client can itself reduce the size by reducing the page-size.
Offer collapsing elements, for instance several clients are only interested in one data field of your whole item. Do this with a parameter (e.g. /items?select=name), then only the nested element 'name' is included inline of your item element. This dramatically decreases size.
Generally give the clients the power to use result-set limiting. They will definitley use it, because it speeds up response time also on their side :)
Also use compression, it reduces verbose XML extremely (in our case the payload got 10 times smaller). From client side you can do it by header 'Accept-Encoding: gzip'. If you use Apache, server configuration is also straight-forward
I'd like to offer three guidelines:
one is the observation that there are many SOAP Web services out there (especially built with .NET 2.0 "ASMX" technology) that send down their data transfer objects serialized in XML. There are of course many RESTful services that send down XML or JSON. XML serialization/deserialization is rarely the constraining factor.
one common cause of bottlenecks in Web services is an interface that encourages client applications to get data by making those thousands of sequential calls (there is a term for it: a chatty interface). This is what you should avoid when you design your Web service's interface, regardless of what four-letter acronym you decide to go ahead with.
one thing to remember about REST is that it (partially) stands for a transfer of state, which may be ill-suited to some operations where you don't want to transfer the state of a business object from the server to a client application. In those cases, a SOAP Web service (as suggested by your colleagues) is more appropriate; or perhaps a combination of SOAP and REST services, where the REST services would take care of operations where the state transfer is appropriate, and the SOAP services would implement the rest (pun unintended :-)) of the operations.