Inconsistent RDS query results using EBS+flask-security+sqlalchemy - flask

I have a Flask app running using Elastic Beanstalk, using flask-security and plain sqlalchemy. The app's data is stored on a RDS instance, living outside of EB. I am following the flask-security Quick Start with a session, which doesn't use flask-sqlalchemy. Everything is free tier.
Everything is up and running well, but I've encountered that
After a db insert of a certain type, a view which reads all objects of that type gives alternating good/bad results. Literally, on odd refreshes I get a read that includes the newly inserted object and on even refreshes the newly inserted object is missing. This persists for as long as I've tried refreshing (dozens of times... several minutes).
After going away for an afternoon and coming back, the app's db connection is broken. I feel like my free tier resources are going to sleep, and the example code is not recovering well.
I'm looking for pointers on how to debug, for a more robust starting code example, or for suggestions on what else to try.
I may try to switch to flask-sqlalchemy (perhaps getting better session handling), or drop flask-security for flask-login (a downgrade in functionality... sniff).

Related

Why we need to setup AWS and POSTgres db when we deploy our app using Heroku?

I'm building a web api by watching the youtube video below and until the AWS S3 bucket setup I understand everything fine. But he first deploy everything locally then after making sure everything works he is transferring all static files to AWS and for DB he switches from SQLdb3 to POSgres.
django portfolio
I still don't understand this part why we need to put our static files to AWS and create POSTgresql database even there is an SQLdb3 default database from django. I'm thinking that if I'm the only admin and just connecting my GitHub from Heroku should be enough and anytime I change something in the api just need to push those changes to github master and that should be it.
Why we need to use AWS to setup static file location and setup a rds (relational data base) and do the things from the beginning. Still not getting it!
Can anybody help to explain this ?
Thanks
Databases
There are several reasons a video guide would encourage you to switch from SQLite to a database server such as MySQL or PostgreSQL:
SQLite is great but doesn't scale well if you're expecting a lot of traffic
SQLite doesn't work if you want to distribute your app accross multiple servers. Going back to Heroky, if you serve your app with multiple Dynos, you'll have a problem because each Dyno will use a distinct SQLite database. If you edit something through the admin, it will happen on one of this databases, at random, leading to inconsistencies
Some Django features aren't available on SQLite
SQLite is the default database in Django because it works out of the box, and is extremely fast and easy to use in local/development environments for prototyping.
However, it is usually not suited for production websites. Additionally, while it can be tempting to store your sqlite.db file along with your code, for instance in a git repository, it is considered a bad practice because your database can contain sensitive data (such as passwords, usernames, emails, etc.). Hence, a strict separation between your code and data is a good practice.
Another way to put it is that your code and your data have different lifecycles. You want to be able to edit data in your database without redeploying your code, and update your code without touching your database.
Even if you can remove public access to some files through GitHub, this is not a good practice because when you work in a team with multiple developpers, developpers may have access to the code but not the production data, because it's usually sensitive. If you work with 5 people and each one of them has a copy of your database, it means the risk to lose it or have it stolen is 5x higher ;)
Static files
When you work locally, Django's built-in runserver command handles the serving of static assets such as CSS, Javascript and images for you.
However, this server is not designed for production use either. It works great in development, but will start to fail very fast on a production website, that should handle way more requests than your local version.
Because of that, you need to host these static files somewhere else, and AWS is one place where you can do that. AWS will serve those files for you, in a very efficient way. There are other options available, for instance configuring a reverse proxy with Nginx to serve the files for you, if you're using a dedicated server.
As far as I can tell, the progression you describe from the video is bringing you from a local, development enviromnent to a more efficient and scalable production setup. That is to be expected, because it's less daunting to start with something really simple (SQLite, Django's built-in runserver), and move on to more complex and abstract topics and tools later on.

Google cloud datastore Rollback its previous saved row

I am using Django with Google cloud datastore i.e. Djange (https://djangae.org/)
I am new to these tech stacks and currently facing one strange issue.
when I persist data by calling Model.save(commit=true) . The data gets saved into cloud datastore but after 4/5 mins it gets reverted.
To test it further I tried to directly change the value in database but it also got reverted after sometime.
I am kind of confused as there is no error or exception I see . I am making atomit transaction and wrapped my code with try except to catch any exception but no luck.
could someone please advise me as how to debug further here.
I got some lead here. well I was pointing datastore with multiple versions of code and few of them were in infinite loop to hit the same Kind of datastore. currently killing all stale version makes the DB consistent with changes. . wanted to update so that others can get an idea if something similar happen

Django bulk_upsert saves queries to memory causing memory errors

For our Django web server we have quite limited resources which means we have to be careful with the amount of memory we use. One part of our web server is a crom job (using celery and rabbitmq) that parses a ~130MB csv file into our postgres database. The csv file is saved to disk and then read using the csv module from python, reading row by row. Because the csv file is basically a feed, we use the bulk_upsert from the custom postgres manager from django-postgres-extra to upsert our data and override existing entries. Recently we started experiencing memory errors and we eventually found out they were caused by Django.
Running mem_top() showed us that Django was storing massive upsert queries(INSERT ... ON CONFLICT DO) including their metadata, in memory. Each bulk_upsert of 15000 rows would add 40MB memory used by python, leading to a total of 1GB memory used when the job would finish as we upsert 750.000 rows in total. Apparently Django does not release the query from memory after it's finished. Running the crom job without the upsert call would lead to a max memory usage of 80MB, of which 60MB is default for celery.
We tried running gc.collect() and django.db.reset_queries() but the queries are still stored in memory. Our Debug setting is set to false and CONN_MAX_AGE is also not set. Currently we're out clues for where to look to fix this issue, we can't run our crom jobs now. Do you know of any last resorts to try to resolve this issue?
Some more meta info regarding our server:
django==2.1.3
django-elasticsearch-dsl==0.5.1
elasticsearch-dsl==6.1.0
psycopg2-binary==2.7.5
gunicorn==19.9.0
celery==4.3.0
django-celery-beat==1.5.0
django-postgres-extra==1.22
Thank you very much in advance!
Today I've found the solutions for our issues so I thought it'd be great to share. It turned out that the issue was a combination of Django and Sentry (which we only use on our production server). Django would log the query and Sentry would then catch this log and keep it in memory for some reason. As each raw SQL query was about 40MB this ate a lot of memory. Currently, we turned Sentry off on our crom job server and are looking into a way to clear the logs kept by sentry.

Neo4j-Neoclipse Concurrent access issue

I am creating a few nodes in neo4j using spring data, and then I am also accessing them via findByPropertyValue(prop, val).
Everything works correctly when I am reading/writing to the embedded DB using spring data.
Now, as per the Michael Hunger's book : Good Relationship, I opened up Neoclipse in read-only mode connection to my currently active Neo4j connection in Java..
But, it somehow still says that Neo4j's kernel is actively used by some other program or something.
Question 1 :What am I doing wrong here?
Also, I have created a few nodes and persisted them. Whenever I restart the embedded neo4j db, I can view all my nodes when I do findAll().
Question 2 :When I try to visualize all my nodes in Neoclipse(considering the db is accessible), I can only see one single node(which is empty), has no properties associated to it, whereas I have a name property defined.
I started my java app, persisted few nodes, traversed and got the output from in the java console. Now, I shutdown the application and started the Neoclipse IDE, connected to my DB and found that no nodes are present(Problem of Question 2).
After trying again(heads down), I go back to my Java app and ran my app, and surprisingly I found that I am getting a Lucene-file-corrupted error(unrecognized file format) error. I had no code changes, I did not delete anything, but still got this error.
Question 3 :Not sure what I am doing wrong. But since I found this discussion on my bug(lucene/concurrent db access), I am willing to know if this is a bug or if this is due to any programatic error.(Does it have to do something with Eclipse Juno)
Any reply would be highly appreciated.
Make sure you are properly committing the transactions.
Data is not immediately flushed to the disk by Neo4j and hence you might not be viewing the nodes immediately in Neoclipse. I always restart the application that is using Neo4j in
embedded mode so that data is flushed to the disk and then open neoclipse.
Posting your code would help us to check for any issues.

postgres + GeoDajango - data doesn't seem to save

I have a small python script that pushes data to my django postgres db.
It imports the relevant model from a django project and uses the .save function to save the data to the db without issue.
Yesterday the system was running fine. I started and stopped both my django project and the python script many times over the course of the day, but never rebooted or powered off my computer, until the end of the day.
Today I have discovered that the data is no longer in the db!
This seems silly, as I probably forgotten to do something obvious, but I thought that when the save function is called from a model, the data is committed to the db.
So this answer is "where to start troubleshooting problems like this" since the question is quite vague and we don't have enough info to troubleshoot effectively.
If this ever happens again, the first thing to do is to turn on statement logging for PostgreSQL and look at the statements as they come in. This should show you begin and commit statements as well as the queries. It's virtually impossible to troubleshoot this sort of problem without access to the queries. Things to look for include missing COMMITs, and missing statements.
After that, the next thing to do is to look at the circumstances under which your computer rebooted. Is it possible it did so before an expected commit? Or did it lose power and not have the transaction log flushed to disk in time?
Those two should rule out just about all possible causes on the db side in a development environment. In a production environment for old versions of PostgreSQL you do want to verify that the system has autovacuum running properly and that you aren't getting warnings about xid wraparound. In newer versions this is not a problem because PostgreSQL will refuse to accept queries when approaching xid wraparound.