i have some 200+ tables in my dynamodb. Since all my tables have localSecondaryIndexes defined, i have to ensure that no table is in the CREATING status, at the time of my CreateTable() call.
While adding a new table, i list all tables and iterate through their names, firing describeTable() calls one by one. On the returned data, i check for TableStatus key. Each describeTable() call takes a second. This implies an average of 3 minute waiting time before creation of each table. So if i have to create 50 new tables, it takes me around 4 hours.
How do i go about optimizing this? i think that a BatchGetItem() call works on stuff inside the table and not table-metadata. Can i do a bulk describeTable() call?
It is enough that you wait until the last table you created becomes ACTIVE. Run DescribeTable on that last created table with a few seconds interval.
Related
We have a use case where we need to execute multiple update statements(each updating different set of rows) on the same table in snowflake without any latency caused by queuing time of the update queries. Currently, a single update statements takes about 1 min to execute and all the other concurrent update statements are locked (about 40 concurrent update statements) and are queued, so the total time including the wait time is around 1 hour but the expected time is around 2 mins( assuming all update statements execute at the same time and the size of warehouse supports 40 queries at the same time without any queueing).
What is the best solution to avoid this lock time? We've considered the following two options :-
Make changes in the application code to batch all the update statements and execute as one query - Not possible for our use case.
Have a separate table for each client (each update statement, updates rows for different clients in the table) - This way, all the update statements will be executing in separate table and there won't be any locks.
Is the second approach the best way to go or is there any other workaround that would help us reduce the latency of the queueing time?
The scenario is expected to happen since Snowflake locks table during update.
Option 1 is ideal to scale in data model. But since you can't make it happen, you can go by option 2.
You can also put all the updates in one staging table and do upsert in bulk - Delete and Insert instead of update. Check if you can afford the delay.
But if you ask me, snowflake should not be used for atomic update. It has to be an upsert (delete and insert and that too in bulk). Atomic updates will have limitations. Try going for a row based store like Citus or MySQL if your usecase allows this.
I have a table that has to be refreshed daily from an external source. All the recommendations I read say to delete the whole table and re-create it instead of deleting all the items.
I tried the suggested method, but the deleteTable function returns successful even though the table is still in a state of "Table is being deleted", as seen from the DynamoDB console. Sometimes this takes more than a minute.
What is the proper way of deleting and re-creating a table? Should I just keep trying createTable until the already exists error goes away?
I am using Node.js.
(The table is a list of some 5,000+ bus stops. The source doesn't specify how often the data changes nor give any indicator that there are changes. I found a small number of changes once every few weeks.)
If you are using boto3 (Python), there is a waiter called TableNotExists:
Polls DynamoDB.Client.describe_table() every 20 seconds until a successful state is reached. An error is returned after 25 failed checks.
Or, you could just do that polling yourself.
I would suggest changing the table name each day, using the current date as part of the table name. Then you can create the new table and start populating it without having to wait for the delete of the previous day's table to complete.
If the response from the createTable method is a Table already exists exception, the exception also contains a retryDelay property that is a number.
I can't find documentation on retryDelay but it seems to be a time duration in seconds.
I use the Table already exists exception to check that the table is not completely deleted, and, if not, back off for a period specified in the retryDelay property. After a few iterations, the table can be successfully created.
Sometimes the value in retryDelay can be more than 20.
This approach has worked without issues for me every time.
I have a table ActivityLog to which new data is added in every second.
I am querying this table every 5 seconds using an Api in the following way.
logs = ActivityLog.objects.prefetch_related('activity').filter(login=login_obj,read_status=0)
Now let's say when I queried this table at time 13:20:05 I've got 5 objects in logs and after my querying 5 more rows were added to the table at 13:20:06. When I try to update only the queried logsdataset using logs.update(read_status=1) it also updates the newly added data in the table. That is instead of updating 5 objects it updates 10 objects. How can I update only the 5 objects that I've queried without looping through it.
Take a look at select_for_update. Just be aware that the rows will be locked in the meanwhile.
Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.
I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx