I'm importing large amounts of data to django from csv. I created script with django-extensions which is creating objects like this:
def process_row(row):
return {"foo": row[0], "bar": row[1]}
data = []
for row csv.reader(csvfile):
if len(data) > 5000:
MyModel.objects.bulk_create(data)
data = []
data.append(process_row(row))
Basically, this works really good and fast, but the problem is still rising RAM usage by docker container. It never goes down. At the beggining container uses ~1.5GB, then after tens of bulk_create() it uses up to 10GB and I have to restart Docker Desktop to reset that.
P.S. docker stats says it only uses ~3GB, so I think the problem is with Docker Desktop
docker stats screenshot
Related
I'm using the NodeJS demo code from here: https://questdb.io/docs/develop/insert-data/ to insert data into QuestDB like this:
setInterval(() => {
run();
}, 3000);
(2 Docker containers on a bridge network)
And I have a browser window open to run
select count(*) from 'trades'
However, the query only runs once and then returns the same result. If I restart the docker containers, the query returns an updated value so I assume the values are successfully getting into the database but they are not reflected in the Postgres queries. I see the same behavior when I use the pg client in Node.
Any explanation or theory that would help me root cause this?
In QDB data is not visible to queries until it's committed .
In case of ILP receiver commits don't happen after each row or even on disconnect . Instead QDB uses a number of properties to determine when to commit efficiently .
In this case the easiest way to reduce insert to commit delay would be to reduce cairo.max.uncommitted.rows to e.g. 10 in conf/server.conf (plus instance/container restart) and then insert 10+ records .
You'll find more details on ILP commits at
https://questdb.io/docs/reference/api/ilp/tcp-receiver/#commit-strategy
I am running django app (wagtail) in kubernetes cluster along with redis. These two pieces are connected by django-redis. This is how my backend configuration look
{
"BACKEND":"django_redis.cache.RedisCache",
"LOCATION":"redis://redis-service:6379/0",
"OPTIONS":{
"CLIENT_CLASS":"django_redis.client.DefaultClient",
"CONNECTION_POOL_KWARGS":{
"max_connections":1000
}
}
}
This works just fine. I can see keys getting created in redis and app is also blazing fast thanks to redis.
Now real issue is every once in a while app slows down for some time. Upong investigation we found out the real issue is a key of size ~90 MB is created in redis. to process this key app takes some time and slows down.
To put a comparison other keys are less than 1 MB always just 1 key gets randomly created on one type of endpoint but not always.
I tried to check the contents of the key after unpickling and it is normal amounts of data just like other keys. but running
redis-cli --bigkeys
gives output something like this
Biggest string found '":1:views.decorators.cache.cache_page.XXXXXXX.GET.9b37c27c4fc5f19eb59d8e8ba7e1999e.83bf0aec533965b875782854d37402b7.en-us.UTC"' has 90709641 bytes
has someone seen similar issue? Anyway to debug the root cause.
django-redis version "==4.12.1"
wagtail version "==2.11.1"
django version "==3.1.3"
I am in the process of migrating a database from an external server to cloud sql 2nd gen. Have been following the recommended steps and the 2TB mysqlsump process was complete and replication started. However, got an error:
'Error ''Access denied for user ''skip-grants user''#''skip-grants host'' (using password: NO)'' on query. Default database: ''mondovo_db''. Query: ''LOAD DATA INFILE ''/mysql/tmp/SQL_LOAD-0a868f6d-8681-11e9-b5d3-42010a8000a8-6498057-322806.data'' IGNORE INTO TABLE seoi_volume_update_tracker FIELDS TERMINATED BY ''^#^'' ENCLOSED BY '''' ESCAPED BY ''\'' LINES TERMINATED BY ''^|^'' (keyword_search_volume_id)'''
2 questions,
1) I'm guessing the error has come about because cloud sql requires LOAD DATA LOCAL INFILE instead of LOAD DATA INFILE? However am quite sure on the master we run only LOAD DATA LOCAL INFILE so not sure how it changes to remove LOCAL while in replication, is that possible?
2) I can't stop the slave to skip the error and restart since SUPER privileges aren't available and so am not sure how to skip this error and also avoid it for the future while the the final sync happens. Suggestions?
There was no way to work around the slave replication error in Google Cloud SQL, so had to come up with another way.
Since replication wasn't going to work, I had to do a copy of all the databases. However, because of the aggregate size of all my DBs being at 2TB, it was going to take a long time.
The final strategy that took the least amount of time:
1) Pre-requisite: You need to have at least 1.5X the amount of current database size in terms of disk space remaining on your SQL drive. So my 2TB DB was on a 2.7TB SSD, I needed to eventually move everything temporarily to a 6TB SSD before I could proceed with the steps below. DO NOT proceed without sufficient disk space, you'll waste a lot of your time as I did.
2) Install cloudsql-import on your server. Without this, you can't proceed and this took a while for me to discover. This will facilitate in the quick transfer of your SQL dumps to Google.
3) I had multiple databases to migrate. So if in a similar situation, pick one at a time and for the sites that access that DB, prevent any further insertions/updates. I needed to put a "Website under Maintenance" on each site, while I executed the operations outlined below.
4) Run the commands in the steps below in a separate screen. I launched a few processes in parallel on different screens.
screen -S DB_NAME_import_process
5) Run a mysqldump using the following command and note, the output is an SQL file and not a compressed file:
mysqldump {DB_NAME} --hex-blob --default-character-set=utf8mb4 --skip-set-charset --skip-triggers --no-autocommit --single-transaction --set-gtid-purged=off > {DB_NAME}.sql
6) (Optional) For my largest DB of around 1.2TB, I also split the DB backup into individual table SQL files using the script mentioned here: https://stackoverflow.com/a/9949414/1396252
7) For each of the files dumped, I converted the INSERT commands into INSERT IGNORE because didn't want any further duplicate errors during the import process.
cat {DB_OR_TABLE_NAME}.sql | sed s/"^INSERT"/"INSERT IGNORE"/g > new_{DB_OR_TABLE_NAME}_ignore.sql
8) Create a database by the same name on Google Cloud SQL that you want to import. Also create a global user that has permission to access all the databases.
9) Now, we import the SQL files using the cloudsql-import plugin. If you split the larger DB into individual table files in Step 6, use the cat command to combine a batch of them into a single file and make as many batch files as you see appropriate.
Run the following command:
cloudsql-import --dump={DB_OR_TABLE_NAME}.sql --dsn='{DB_USER_ON_GLCOUD}:{DB_PASSWORD}#tcp({GCLOUD_SQL_PUBLIC_IP}:3306)/{DB_NAME_CREATED_ON_GOOGLE}'
10) While the process is running, you can step out of the screen session using Ctrl+a
+ Ctrl+d (or refer here) and then reconnect to the screen later to check on progress. You can create another screen session and repeat the same steps for each of the DBs/batches of tables that you need to import.
Because of the large sizes that I had to import, I believe it did take me a day or two, don't remember now since it's been a few months but I know that it's much faster than any other way. I had tried using Google's copy utility to copy the SQL files to Cloud Storage and then use Cloud SQL's built-in visual import tool but that was slow and not as fast as cloudsql-import. I would recommend this method up until Google fixes the ability to skip slave errors.
We are developing an app in ionic3 which use PouchDB and CouchDB. We would like to launch on mid February but we are worry if the database grow too much I make run out of memory in device.
To test we'd like to insert thousands records and check database size.. here we have the problem. We can't find out how get local db size.
I was diving in PouchDB documentation and I only found how to get info about remote database size but not local. I think remote size needs not to be equal than local. Anyone have an idea?
Thanks
I found a partial solution, it is partial because it only works in android, chrome and firefox but not in ios and safari. I get used and available storage with this
calculaEspacio(){
let nav: any = navigator;
nav.storage.estimate().then((obj)=>{
this.usado=Math.round(((obj.usage/1048576)*100))/100;
this.total = Math.round(((obj.quota/1048576)*100))/100;
})}
Anyone know equivalent for webkit?
Example
npm install pouchdb pouchdb-size
//index.js
var PouchDB = require('pouchdb');
PouchDB.plugin(require('pouchdb-size'));
var db = new PouchDB('test');
db.installSizeWrapper();
db.info().then(function (resp) {
//resp will contain disk_size
})
pouchdb-size https://github.com/pouchdb/pouchdb-size
Build Status Dependency Status devDependency Status
Adds disk_size to info()'s output for your *down backed PouchDB's.
Tested with leveldown, sqldown, jsondown, locket and medeadown. When it can't determine the database size, it falls back to the default info() output.
Full informations here
If you're using Ionic 3, I guess you can use the SQLite Plugin to have unlimited data
We are running a python process which runs this stored procedure, which import files from a certain directory to the postgres database. These files are first get imported to a in-memory table and then to the disk table. The actual size of the in-memory table should never really grow beyond 30 MB. As this table is constantly updated, the size of the table grows (because of dead tuples). To keep things in check, we need to perform CLUSTER operation on the table. I am using psycopg2 module to run stored procedure adn CLUSTER the table, but if the import process is running the size of the table never goes down. But If I stop the import process and run CLUSTER then the size of the table goes down. Because of the performance reason, I should be able to run CLUSTER command without stopping the import procedure.
I tried manual commit, ISOLATION_LEVEL_AUTOCOMMIT but none of this has worked.
Below is the sample code of the process -
while True:
-get the filenames in directory
for filpath in filenames:
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
# Calls a postgresql function that reads a file and imports it into
# a table via INSERT statements and DELETEs any records that have the
# same unique key as any of the records in the file.
cursor.execute("SELECT import('%s', '%s');" % (filepath, str(db_timestamp))
conn.commit()
cursor.close()
conn.close()
os.remove(get_media_path(fname))
With the similar conn object, I want to run CLUSTER command once an hour -
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
cursor.execute("CLUSTER table_name")
conn.commit()
cursor.close()
conn.close()
Also, I tried setting -
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
Another piece of information -
I have all this running inside django environment. I could not use django connection objects to do the task because - django could not release connections with my threading code and soon the database stopped accepting connections.Does this mixed environment might have an effect on psycopg?
Few observations -
Running the CLUSTER command when import process is running - size doesn't go down
When I stop the import process and then run CLUSTER - size does go down
When I stop the import process and start import process back, and after that run CLUSTER command - size does go down
Any thoughts on the problem?
From the manual:
When a table is being clustered, an
ACCESS EXCLUSIVE lock is acquired on
it. This prevents any other database
operations (both reads and writes)
from operating on the table until the
CLUSTER is finished.
Are you sure you have to CLUSTER every hour? With a better fillfactor and autovacuum, your table won't grow that much and you won't have dead tuples in the table.
OK - I found the culprit.
The problem was somehow the cluster or vacuum were not deleting the dead tuples, because some weird interaction was happening when we using pstcopg2 directly in django environment. After isolating the psycopg code and removing the django related code from import-process, everything worked fine. This solved the problem and now I can vacuum or cluster it without stopping the import-process.