Django PostgreSQL asynchronous commits - django

PostgreSQL supports asynchronous commits - that is, the database engine can be configured to report success even if the database has not completed the write ahead log sync.
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT
This provides a useful compromise between running some queries in a manner that guarantees that in the event of database crash, it would remain in a consistent state, however, some allegedly committed transactions would appear as if they have been aborted cleanly.
Obviously for some transactions, it's critical that commits remain final - which is why the flag can be configured per transaction.
How can I take advantage of this functionality in django?

First I second Frank's note. That's the way to do it.
However if you do this you probably want to have a function which sets this on each API that may commit. This seems error prone to me so I probably wouldn't mess with it and would instead try hard to batch the transactions into the same transaction to the extent that makes sense. I would suggest further having a method in your models for showing the setting (SHOW synchronous_commit) so that you can properly unit test.
Again because this is a session setting this strikes me as a bit dangerous to play around with in this way, but it could be done if you take necessary precautions.

Related

Has the transaction behavior changed when a conflict occurred in firestore datastore?

I created new Google Cloud Platform project and Datastore.
Datastore was created as "Firestore in Datastore mode".
But, I think Firestore Datastore and Old Datastore behave differently if Conflict occurred.
e.g Following case.
procA: -> enter transaction -> get -> put -----------------> exit transaction
procB: -----> enter transaction -> get -> put -> exit transaction
Old Datastore;
procB is Completed and data is updated.
procA Conflict occured and data is rollbacked.
Firestore in Datastore mode;
procB is waited before exit transaction until procA is completed.Then Conflict occured.
procA Completed and data is updated.
Is it spec?
I cannot find document on Google Cloud Platform documentation.
I've been giving it some thought and I think the change may actually be intentional.
In the old behaviour that you describe basically the shorter transaction, even if it starts after the longer does, is successful, preempting the longer one and causing it to fail and be re-tried. Effectively this give priority to the shorter transactions.
But imagine that you have a peak of activity with a bunch of shorter transactions - they will keep preempting the longer one(s) which will keep being re-tried until eventually reaching the maximum retries limit and failing permanently. Also increasing the datastore contention in the process, due to the retries. I actually hit such scenario in my transaction-heavy app, I had to adjust my algorithms to work around it.
By contrast the new behaviour gives all transactions a fair chance of success, regardless of their duration or the level of activity - no priority handling. It's true, at a price - the shorter transactions started after the longer ones and overlapping them will take overall longer. IMHO the new behaviour is preferable to the older one.
The behavior you describe is caused by the chosen concurrency mode for Firestore in Datastore mode. The default mode is Pessimistic concurrency for newly created databases. From the concurrency mode documentation:
Pessimistic
Read-write transactions use reader/writer locks to enforce isolation
and serializability. When two or more concurrent read-write
transactions read or write the same data, the lock held by one
transaction can delay the other transactions. If your transaction does
not require any writes, you can improve performance and avoid
contention with other transactions by using a read-only transaction.
To get back the 'old' behavior of Datastore, choose "Optimistic" concurrency instead (link to command). This will make the faster transaction win and remove the blocking behavior.
I would recommend you to take a look at the documentation Transactions and batched writes. On this documentation you will be able to find more information and examples on how to perform transactions with Firestore.
On it, you will find more clarification on the get(), set(),update(), and delete() operations.
I can highlight the following from the documentation for you, that is very important for you to notice when working with transactions:
Read operations must come before write operations.
A function calling a transaction (transaction function) might run more than once if a concurrent edit affects a document that the transaction reads.
Transaction functions should not directly modify application state.
Transactions will fail when the client is offline.
Let me know if the information helped you!

Testing Side Effects in BDDs

I have an API that has the following logic:
Consume from Kafka.
Process the record.
Update the database, if the processing was successful.
If the processing fails, then push it to a Kafka topic.
If pushing to Kafka topic failed, then commit.
If the record was processed successfully, then commit.
If the commit fails, then log and move ahead with consuming the next event.
I am writing BDDs for this API. Currently, I feel like I am testing too many scenarios:
ProcessingFailed -> Database is unchanged -> Event should be pushed to Kafka -> Should be committed.
Kafka push failed -> Should be committed.
Commit failed -> (what to do? Should I check if the log is printed correctly?)
Happy path -> Database updated -> Kafka topic does not contained the event -> Commit was successful.
My question is, what's the proper way to test for such side effects?
Now suppose my process step is made of three steps:
Fetch from the database.
Make an HTTP call.
Now supposing that I am simulating a 'processing failed' by bringing my database down. Now do I also need to test that the HTTP call was not made?
A good general rule for bdd tests is each test should only have one reason to fail. For cucumber this translates to only one Then step in each scenario.
With this as guidance I would recommend writing one scenario per step of the process.
# Consume from Kafka
Given a certain thing has happened
# Process the record
When some action is performed successfully
# Update database if processed successfully
Then some result exists in the database
Then your next scenario starts where the first one left off:
Given a certain thing happened
When the action is performed unsuccessfully
# Push failed message to Kafka queue
Then a failed message is sent
The third scenario picks up where the second one leaves off:
Given a certain thing happened
And the action was performed unsuccessfully
When a failure message is sent
Then a thing should not exist in the database
Each scenario builds off the steps verified in the previous scenarios, being careful to ensure scenarios do not share data, or depend on the success of previously executed scenarios.
Currently, I feel like I am testing too many scenarios
My question is, what's the proper way to test for such side effects?
Well, it sounds to me like you are describing a state machine; where the transitions are driven by representations of different effects in the protocol.
Given that, I would normally expect to see tests for each target state.
Depending on your evaluation of the risks, it might make sense to run your automated checks at a number of different grains -- lots of decoupled tests exploring the different corner cases of the state machine itself, some checks to make sure that the orchestration of the different effects is correct, a few tests to make sure the whole mess works when you wire it all together.
Now do I also need to test that the HTTP call was not made?
There are probably two important questions to ask yourself here:
What are the risks of not having an automated test?
Why is just adding tests not effortless?
If the test subject is "so simple that there are obviously no deficiencies", then the investment odds tell us that investing time and money into extra testing is not a favorable play.
On the other hand, if you are looking for an excuse not to test the thing, then you might want to turn a critical eye toward your design. That's especially true if you are adding/changing code in a module that "already works". A big payoff for test investment comes from having many easy accurate tests for the code we are changing on a regular basis, so reluctance to add a new test for code that you are changing is a Big Red Flag[tm] that something Is Not According To Plan.

Work on a Django database without modifying it

I'm developing optimization algorithms which operate on data stored in a postgres django database. My algorithms have to repeatedly modify the objects in the database and sometimes revert the change done (it is metaheuristic algorithms, for those who knows).
The problem is that I don't want to save the modification on the postgres database during the process. I would like to save the modifications at the end of the process, when i'm satisfied with the results of the optimization. I think that the solution is to load all concerned objects in memory, work on them, and save the objects in memory to the database at the end.
However it seems to be more difficult than I thought...
Indeed, when I will make a django query (ie. model1.objects.get or model.objects.filter), I fear that django call the objects sometimes in database and sometimes in it's cache, but I'm pretty sure that in some case it will not be the same than the instances I manually loaded in memory (which are the ones on which I want to work because they may have changed since the load from the database) ...
Is there a way to bypass such problems ?
I implemented a kind of custom mini-database which works but it's becoming too difficult to maintain and over all, I think it's not the most simple and elegant way to proceed. I thought to dump the concerned model of the postgres database into an in-memory one (for performance), work on this in-memory db and when finishing my algorithm, update the data of the original database from the data in the in-memory one (it would imply that django keeps a link, perhaps through the pk, of the original objects with those in the in-memory database to identify which are the same and I don't know if it's possible).
Does someone has an insight?
Thank you in advance.
What you are looking for is transactions. One of the most powerfull features of an RDBS. Simply use START TRANSACTION before you start playing around with the data. At the end if you are happy with it use COMMIT. If you don't want your django app to see the changes use ROLLBACK.
Due to the default transaction isolation level of postgresql, your django app will not see whatever changes you are doing elsewhere until it's committed. At the same time what ever changes you do in your sql console or with other code will be visible to that code even though it's not committed.
Read Committed is the default isolation level in PostgreSQL. When a
transaction uses this isolation level, a SELECT query (without a FOR
UPDATE/SHARE clause) sees only data committed before the query began;
it never sees either uncommitted data or changes committed during
query execution by concurrent transactions. In effect, a SELECT query
sees a snapshot of the database as of the instant the query begins to
run. However, SELECT does see the effects of previous updates executed
within its own transaction, even though they are not yet committed

How to profile an openedge database?

Is there a Progress profiling tool that allows me to see the queries executing against an OpenEdge database?
We're doing a migration from an OpenEdge database into a SQL database. In order to map the data correctly we'd like to run certain application reports on the OpenEdge database and see what database queries are being executed to retrieve the data.
Is this possible with some kind of Progress profiling tool (a la SQL Server Profiling)? Preferably free...
Progress is record oriented, not set oriented like SQL, so your reports aren't a single query or a set of queries, it is more likely a lot of record lookups combined with what you'd consider query-like operations.
Depending on the version you're running, there is a way to send a signal to the client to see what it is currently doing, however doing so will almost certainly not give you enough information to discern what's going on "under the hood."
Long story short, your options are to get a Dataserver product so you can attach the Progress client to an SQL database - this will enable you to use an SQL database w/out losing the Progress functionality. The second option is to get a copy of the program's source code to find out how the reports are structured.
Tim is quite right -- without the source code, looking at the queries is unlikely to provide you with much insight.
None the less there are some tools and capabilities that will provide information about queries. Probably the most useful for your purpose would be to specify something similar to:
-logentrytypes QryInfo -logginglevel 3 -clientlog "mylog.log"
at session startup.
You can use session triggers to identify almost anything done by any program, without modifying or having access to the source of those programs. Setting this up may be more work than is worth it for your purpose. We have a testing system built around this idea. One big flaw: triggers cannot be fired for CAN-FIND.

Do long running transactions bog down databases?

Note: I'm using Postgres 9.x and Django ORM
I have some functions in my application which open a transaction, run a few queries, then do a couple full seconds of other things (3rd party API access, etc.), and then run a few more queries. The queries aren't very expensive,, but I've been concerned that, by having many transactions open for so long, I'll somehow bog down my database eventually or run out of connections or something. How big of a deal is this, performance-wise?
Keeping a transaction open has pros and cons.
On the plus side, every transaction has an overhead cost. If you can do a couple of related things in one transaction, you normally win performance.
However, you acquire locks on rows or whole tables along the way (especially with any kind of write operation). These are automatically released at the end of the transaction. If other processes might wait for the same resources, it is a very bad idea to call external processes while the transaction stays open.
Maybe you can do the call to the 3rd party API before you acquire any locks, and do all queries in swift succession afterwards?
Read about checking locks in the Postgres Wiki.
While not exact answer, I can't recommend this presentation highly enough.
“PostgreSQL When It’s Not Your Job” at DjangoCon US
It is from this year's DjangoCon, so there should be a video also, hopefully soon.
Plus check out authors blog, it's a golden mine of useful information on Postgres as a whole and django in particular. You'll find interesting info about transaction handling there.