QuestDB Line Protocol - questdb

I'm using the NodeJS demo code from here: https://questdb.io/docs/develop/insert-data/ to insert data into QuestDB like this:
setInterval(() => {
run();
}, 3000);
(2 Docker containers on a bridge network)
And I have a browser window open to run
select count(*) from 'trades'
However, the query only runs once and then returns the same result. If I restart the docker containers, the query returns an updated value so I assume the values are successfully getting into the database but they are not reflected in the Postgres queries. I see the same behavior when I use the pg client in Node.
Any explanation or theory that would help me root cause this?

In QDB data is not visible to queries until it's committed .
In case of ILP receiver commits don't happen after each row or even on disconnect . Instead QDB uses a number of properties to determine when to commit efficiently .
In this case the easiest way to reduce insert to commit delay would be to reduce cairo.max.uncommitted.rows to e.g. 10 in conf/server.conf (plus instance/container restart) and then insert 10+ records .
You'll find more details on ILP commits at
https://questdb.io/docs/reference/api/ilp/tcp-receiver/#commit-strategy

Related

GCP Datastore times out on large download

I'm using Objectify to access my GCP Datastore set of Entites. I have a full list of around 22000 items that I need to load into the frontend:
List<Record> recs = ofy().load().type(Record.class).order("-sync").list();
The number of records has recently increased and I get an error from the backend:
com.google.apphosting.runtime.HardDeadlineExceededError: This request (00000185caff7b0c) started at 2023/01/19 17:06:58.956 UTC and was still executing at 2023/01/19 17:08:02.545 UTC.
I thought that the move to Cloud Firestore in Datastore mode last year would have fixed this problem.
My only solution is to break down the load() into batches using 2 or 3 calls to my Ofy Service.
Is there a better way to grab all these Entities in one go?
Thanks
Tim

why is django app creating huge keys in redis

I am running django app (wagtail) in kubernetes cluster along with redis. These two pieces are connected by django-redis. This is how my backend configuration look
{
"BACKEND":"django_redis.cache.RedisCache",
"LOCATION":"redis://redis-service:6379/0",
"OPTIONS":{
"CLIENT_CLASS":"django_redis.client.DefaultClient",
"CONNECTION_POOL_KWARGS":{
"max_connections":1000
}
}
}
This works just fine. I can see keys getting created in redis and app is also blazing fast thanks to redis.
Now real issue is every once in a while app slows down for some time. Upong investigation we found out the real issue is a key of size ~90 MB is created in redis. to process this key app takes some time and slows down.
To put a comparison other keys are less than 1 MB always just 1 key gets randomly created on one type of endpoint but not always.
I tried to check the contents of the key after unpickling and it is normal amounts of data just like other keys. but running
redis-cli --bigkeys
gives output something like this
Biggest string found '":1:views.decorators.cache.cache_page.XXXXXXX.GET.9b37c27c4fc5f19eb59d8e8ba7e1999e.83bf0aec533965b875782854d37402b7.en-us.UTC"' has 90709641 bytes
has someone seen similar issue? Anyway to debug the root cause.
django-redis version "==4.12.1"
wagtail version "==2.11.1"
django version "==3.1.3"

How to skip slave replication errors on Google Cloud SQL 2nd Gen

I am in the process of migrating a database from an external server to cloud sql 2nd gen. Have been following the recommended steps and the 2TB mysqlsump process was complete and replication started. However, got an error:
'Error ''Access denied for user ''skip-grants user''#''skip-grants host'' (using password: NO)'' on query. Default database: ''mondovo_db''. Query: ''LOAD DATA INFILE ''/mysql/tmp/SQL_LOAD-0a868f6d-8681-11e9-b5d3-42010a8000a8-6498057-322806.data'' IGNORE INTO TABLE seoi_volume_update_tracker FIELDS TERMINATED BY ''^#^'' ENCLOSED BY '''' ESCAPED BY ''\'' LINES TERMINATED BY ''^|^'' (keyword_search_volume_id)'''
2 questions,
1) I'm guessing the error has come about because cloud sql requires LOAD DATA LOCAL INFILE instead of LOAD DATA INFILE? However am quite sure on the master we run only LOAD DATA LOCAL INFILE so not sure how it changes to remove LOCAL while in replication, is that possible?
2) I can't stop the slave to skip the error and restart since SUPER privileges aren't available and so am not sure how to skip this error and also avoid it for the future while the the final sync happens. Suggestions?
There was no way to work around the slave replication error in Google Cloud SQL, so had to come up with another way.
Since replication wasn't going to work, I had to do a copy of all the databases. However, because of the aggregate size of all my DBs being at 2TB, it was going to take a long time.
The final strategy that took the least amount of time:
1) Pre-requisite: You need to have at least 1.5X the amount of current database size in terms of disk space remaining on your SQL drive. So my 2TB DB was on a 2.7TB SSD, I needed to eventually move everything temporarily to a 6TB SSD before I could proceed with the steps below. DO NOT proceed without sufficient disk space, you'll waste a lot of your time as I did.
2) Install cloudsql-import on your server. Without this, you can't proceed and this took a while for me to discover. This will facilitate in the quick transfer of your SQL dumps to Google.
3) I had multiple databases to migrate. So if in a similar situation, pick one at a time and for the sites that access that DB, prevent any further insertions/updates. I needed to put a "Website under Maintenance" on each site, while I executed the operations outlined below.
4) Run the commands in the steps below in a separate screen. I launched a few processes in parallel on different screens.
screen -S DB_NAME_import_process
5) Run a mysqldump using the following command and note, the output is an SQL file and not a compressed file:
mysqldump {DB_NAME} --hex-blob --default-character-set=utf8mb4 --skip-set-charset --skip-triggers --no-autocommit --single-transaction --set-gtid-purged=off > {DB_NAME}.sql
6) (Optional) For my largest DB of around 1.2TB, I also split the DB backup into individual table SQL files using the script mentioned here: https://stackoverflow.com/a/9949414/1396252
7) For each of the files dumped, I converted the INSERT commands into INSERT IGNORE because didn't want any further duplicate errors during the import process.
cat {DB_OR_TABLE_NAME}.sql | sed s/"^INSERT"/"INSERT IGNORE"/g > new_{DB_OR_TABLE_NAME}_ignore.sql
8) Create a database by the same name on Google Cloud SQL that you want to import. Also create a global user that has permission to access all the databases.
9) Now, we import the SQL files using the cloudsql-import plugin. If you split the larger DB into individual table files in Step 6, use the cat command to combine a batch of them into a single file and make as many batch files as you see appropriate.
Run the following command:
cloudsql-import --dump={DB_OR_TABLE_NAME}.sql --dsn='{DB_USER_ON_GLCOUD}:{DB_PASSWORD}#tcp({GCLOUD_SQL_PUBLIC_IP}:3306)/{DB_NAME_CREATED_ON_GOOGLE}'
10) While the process is running, you can step out of the screen session using Ctrl+a
+ Ctrl+d (or refer here) and then reconnect to the screen later to check on progress. You can create another screen session and repeat the same steps for each of the DBs/batches of tables that you need to import.
Because of the large sizes that I had to import, I believe it did take me a day or two, don't remember now since it's been a few months but I know that it's much faster than any other way. I had tried using Google's copy utility to copy the SQL files to Cloud Storage and then use Cloud SQL's built-in visual import tool but that was slow and not as fast as cloudsql-import. I would recommend this method up until Google fixes the ability to skip slave errors.

How to scale concurrent ETL tasks up to an arbitrary number in SSIS?

Problem (See Context below)
How can I scale individual tasks (e.g. downloading and parsing) to an arbitrary number of concurrent executions (e.g. 500) in SSIS?
Setup Description
Our setup is that we have a list of feed urls we want to visit, get all items and insert them into the database.
Currently a php script downloads them concurrently, parses them sequentially and dumps them into csv which are later on inserted into the database using load data infile. ETL packages can handle one way or another all steps above.
This is controlled by the Package Property: MaxConcurrentExecutables. The default is -1 which means machine cores x 2, and usually works well.
You can also affect this by setting EngineThreads on each Data Flow Task.
Here's a good summary: http://blogs.msdn.com/b/sqlperf/archive/2007/05/11/implement-parallel-execution-in-ssis.aspx

CLUSTER not shrinking table size while running other transactions using Psycopg2

We are running a python process which runs this stored procedure, which import files from a certain directory to the postgres database. These files are first get imported to a in-memory table and then to the disk table. The actual size of the in-memory table should never really grow beyond 30 MB. As this table is constantly updated, the size of the table grows (because of dead tuples). To keep things in check, we need to perform CLUSTER operation on the table. I am using psycopg2 module to run stored procedure adn CLUSTER the table, but if the import process is running the size of the table never goes down. But If I stop the import process and run CLUSTER then the size of the table goes down. Because of the performance reason, I should be able to run CLUSTER command without stopping the import procedure.
I tried manual commit, ISOLATION_LEVEL_AUTOCOMMIT but none of this has worked.
Below is the sample code of the process -
while True:
-get the filenames in directory
for filpath in filenames:
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
# Calls a postgresql function that reads a file and imports it into
# a table via INSERT statements and DELETEs any records that have the
# same unique key as any of the records in the file.
cursor.execute("SELECT import('%s', '%s');" % (filepath, str(db_timestamp))
conn.commit()
cursor.close()
conn.close()
os.remove(get_media_path(fname))
With the similar conn object, I want to run CLUSTER command once an hour -
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
cursor.execute("CLUSTER table_name")
conn.commit()
cursor.close()
conn.close()
Also, I tried setting -
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
Another piece of information -
I have all this running inside django environment. I could not use django connection objects to do the task because - django could not release connections with my threading code and soon the database stopped accepting connections.Does this mixed environment might have an effect on psycopg?
Few observations -
Running the CLUSTER command when import process is running - size doesn't go down
When I stop the import process and then run CLUSTER - size does go down
When I stop the import process and start import process back, and after that run CLUSTER command - size does go down
Any thoughts on the problem?
From the manual:
When a table is being clustered, an
ACCESS EXCLUSIVE lock is acquired on
it. This prevents any other database
operations (both reads and writes)
from operating on the table until the
CLUSTER is finished.
Are you sure you have to CLUSTER every hour? With a better fillfactor and autovacuum, your table won't grow that much and you won't have dead tuples in the table.
OK - I found the culprit.
The problem was somehow the cluster or vacuum were not deleting the dead tuples, because some weird interaction was happening when we using pstcopg2 directly in django environment. After isolating the psycopg code and removing the django related code from import-process, everything worked fine. This solved the problem and now I can vacuum or cluster it without stopping the import-process.