Postgresql skipping some/many write queries when used along with python crawlers

Postgresql skipping some/many write queries when used along with python crawlers - django

i am using postgres paired with django(python) and application is crawling internet for specific kind of data. As the crawlers find anything of their use they write it to the database. Now as the speed of crawlers is high and they are querying database by get_or_create(in django which checks for if the data is already in the database or not if it is not present then it makes a new row of it) All the cores on my machine are engaged to almost 99%. In that case when i trace the process,the postgres is skipping the write command for some or many instances.What can be the reason? Any recommendations in terms of change in architecture?
traced the crawler procedure manually and the process was printing the data it found but was not added to the postgres. That confirmed the data was getting lost.

Related

Sanity check on AWS Big Data Architecture

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.

Sharding existing postgresql database with PostgresXL

We want to shard our PostgreSQL DB, due to high disk load. Firstly, we looked at django-sharding library, but:
Very much rewriting in our backend
Migrating all tables to 64-bit primary keys is hard work on 300-400gb tables
Generating ids with Postgres Specific algorithm makes it impossible to move data from shard to shard. More than that, we have a large database with old ids. Updating all of them is a big problem too.
Generating ids with special tables makes us do a special SELECT query to main database every time we insert data. We have high write load, so it's not good.
Considring all these, we decided too look on Postgres database sharding solutions. We found 2 opportunities - Citus and PostgresXL. Citus makes us change data format too much and rewrite a big bunch of backend at the same time, so we are about to try PostgresXL as more transparent solution. But reading the docs, I can't understand some things and will be greatfull for recomendations:
Are there any other sharding workarounds except for Citus and PostgresXL? It would be good not to change much in our database on migrating.
Some questions about PostgresXL:
Do I understand correctly, that it's not Postgres extension, it's a standalone fork? So I should build all its parts from sources and than move data in some way?
How are Postgres and PostgresXL versions compatible? We have PostgreSQL 9.4. I don't see such a version in PostgresXL (9.2 or 9.5 no middle?). So can I use, for example, streaming replication for migration?
If yes/no, what is the best solution to migrate data? If I have 2Tb database with heavy write, can I migrate it somehow without stopping for a long period of time?
Thanks.

First off to save your self a LOT of headache have you looked at options Like Amazon's Auora, Dynomo, Red Shift, etc services? They are VERY cost effective at scale, as well as optimized and managed for you.
Actually Amazon's straight Postgress databases can handle MASSIVE amounts of reads or writes. We can go into 2,000- 6,000 IOPS on reads and another 2,000 to 6,000 IOPS in writes without issue. I would really look into this as the option. Azure, Oracle, and Google also have competing services.
Also be aware that Postgres-XL beyond all reason has no HA support. If you lose a single node you lose everything. The nodes can not fail over.
it's a standalone fork?
Yes, They are very different apps and developed separate from each other.
How are Postgres and PostgresXL versions compatible?
They arn't compatible. You can not just migration Postgres to Postgresl-XL. They work VERY differently.
Generating ids with Postgres Specific algorithm makes it impossible to >move data from shard to shard
Not following this, but with sharing you are not supposed to move data from one shard to another. The key being used generally needs to be something specific and unique to split/segregate your data on. Like a date, or a "type" field, or some other (hopefully ordered) field(s)/column(s). This breaks things up but has obvious pain in the a$$ limitations.
Are there any other sharding workarounds except for Citus and
PostgresXL? It would be good not to change much in our database on >>migrating.
Tons of options, but right off the bat going from a standard RDS, to a NoSql, or MPP database is going to be a major migration, a lot of effort, and have a LOT of limitations no matter what you do.
Next Postress-XL and Citus are MPP (massive parallel processing) clustering apps, not sharing specifically. That is part of what they can do, but it is not their focus.
Other options for MPP
pgPool -- (not great for heavy writes )
haProxy -- ( have not done it but read about it. Lost of work to setup and maintain. )
MySql Cluster -- (Huge pain to use the OSS version and major $$$ for the commercial version)
Green Plumb
Teradata
Vertica
what is the best solution to migrate data?
Very unlikely to find a simple migration for this kind of switch. You can expect to likely need to export the data your self from the existing RDS and import it to the new DB and will likely have to write something your self to get it the way you want it.

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.

Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.

In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

How to move from a database backend to another on a production Django project?

I would like to move a database in a Django project from a backend to another (in this case azure sql to postgresql, but I want to think of it as a generic situation). I can't use a dump since the databases are different.
I was thinking of something at the django level, like dumpdata, but depending on the amount of available memory and the size of the db sometimes it appears unreliable and crashes.
I have seen solutions that try to break the process into smaller parts that the memory can handle but it was a few years ago, so I was hoping to find other solutions.
So far my searches have failed since they always lead to 'south', which refers to schema migration and not moving data.

I have not implemented this before, but what about the following:
Django supports multiple databases...so just configure DATABASES in your settings file to support the old postgresql database and the azure sql database. Then create a small script that makes use of bulk_create, reading the data from one DB and writing it to the other.

How to cache MySQL table in C++ web service

I've got a big users table that I'm caching in a C++ web service (BitTorrent tracker). The entire table is refetched every 5 minutes. This has some drawbacks, like data being up to 5 minutes old and refetching lots of data that hasn't changed.
Is there a simple way to fetch just the changes since last time?
Ideally I'd not have to change the queries that update the data.

Two immediate possibilities come to me:
MySQL Query Cache
Memcached (or similar) Caching Layer
I would try the query cache first as it likely is far easier to setup. Do some basic tests/benchmarks and see if it fits your needs. Memcached will likely be very similar to your existing cache but, as you mention, you'll to find a better way of invalidating stale cache entries (something that the query cache does for you).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js