Database polling, prevent duplicate fetches - c++

I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?

Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx

Related

Is it possible to run queries in parallel in Redshift?

I wanted to do an insert and update at the same time in Redshift. For this I am inserting the data into a temporary table, removing the updated entries from the original table and inserting all the new and updated entries. Since Redshift uses concurrency, sometimes entries are duplicated, because the delete started before the insert was finished. Using a very large sleep for each operation this does not happen, however the script is very slow. Is it possible to run queries in parallel in Redshift?
Hope someone can help me , thanks in advance!
You should read up on MVCC (multi-version coherency control) and transactions. Redshift can only only run one query at a time (for a session) but that is not the issue. You want to COMMIT both changes at the same time (COMMIT is the action that causes changes to be apparent to others). You do this by wrapping your SQL statement in a transaction (BEGIN ... COMMIT) and executed in the same session (not clear if you are using multiple sessions). All changes made within the transaction will only be visible to the session making the changes UNTIL COMMIT when ALL the changes made by the transaction will be visible to everyone at the same moment.
A few things to watch out for - if your connection is in AUTOCOMMIT mode then you may break out of your transaction early and COMMIT partial results. Also when you are working in transactions your source table information is unchanging (so you see consistent data during your transaction) and this information isn't allowed to change for you. This means that if you have multiple sessions changing table data you need to be careful about the order in which they COMMIT so the right version of data is presented to each other.
begin transaction;
<run the queries in parallel>
end transaction;
In this specific case do this:
create temp table stage (like target);
insert into stage
select * from source
where source.filter = 'filter_expression';
begin transaction;
delete from target
using stage
where target.primarykey = stage.primarykey;
insert into target
select * from stage;
end transaction;
drop table stage;
See:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

In Google Spanner, is it possible that the exact same commit timestamp can appear again after already observed

In Google Spanner, commit timestamps are generated by the server and based on "TrueTime" as discussed in https://cloud.google.com/spanner/docs/commit-timestamp. This page also states that timestamps are not guarnateed to be unique, so multiple independent writers can generate timestamps that are exactly the same.
On the documentation of consistency guarantees, it is stated that In addition if one transaction completes before another transaction starts to commit, the system guarantees that clients can never see a state that includes the effect of the second transaction but not the first.
What I'm trying to understand is the combination of
Multiple concurrent transactions committing "at the same time" resulting in the same commit timestamp (where the commit timestamp forms part of a key for the table)
A reader observing new rows being entered into above table
Under these circumstances, is it possible that a reader can observe some but not all of the rows that will (eventually) be stored with the exact same timestamp? Or put differently, if searching for all rows up to a known exact timestamp, and with rows are being inserted with that timestamp, is it possible that the query first returns some of the results, but when executed again returns more?
The context of this is an attempt to model a stream of events ordered by time in an append only manner - I need to be able to keep what is effectively a cursor to a particular point in time (point in the stream of events) and need to know whether or not having observed events at time T means you can never get more events again at exactly time T.
Spanner is externally consistent, meaning that any reader will only be able to read the results of completed transactions...
Along with all externally consistent DB's, it is not possible for a reader outside of a transaction to be able to read the 'pending state' of another transaction. So a reader at time T will only be able to see transactions that have been committed before time T.
Multiple simultaneous insert/update transactions at commit time T (which would affect different rows, otherwise they could not be simultaneous) would not be seen by the reader at time T, but both would be seen by a reader at T+1
I ... need to know whether or not having observed events at time T means you can never get more events again at exactly time T.
Yes - ish. Rephrasing slightly as this is nuanced:
Having read events up to and including time T means you will never get any more events occurring with time equal to or before time T
But remember that the commit timestamp column is a simple TIMESTAMP column where any value can be stored -- it is the application that requests that the value stored is the commit timestamp, and there is nothing at the DB level to stop the application storing any value it likes...
As always with Spanner, it is the application which has to enforce/maintain the data integrity.

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Django: Simple rate limiting

Many of my views fetch external resources. I want to make sure that under heavy load I don't blow up the remote sites (and/or get banned).
I only have 1 crawler so having a central lock will work fine.
I want to allow at most 3 queries to a host per second, and have the rest block for a maximum of 15 seconds. How could I do this (easily)?
Use django cache
Seems to only have 1 second resolution
Use a file based semaphore
Easy to do locks for concurrency. Not sure how to make sure only 3 fetches happen a second.
Use some shared memory state
I'd rather not install more things, but will if I have to.
One approach; create a table like this:
class Queries(models.Model):
site = models.CharField(max_length=200, db_index=True)
start_time = models.DateTimeField(null = True)
finished = models.BooleanField(default=False)
This records when each query has either taken place, or will take place in the future if the limiting prevents it from happening immediately. start_time is the time the action is to start; this is in the future if the action is currently blocking.
Instead of thinking in terms of queries per second, let's think in terms of seconds per query; in this case, 1/3 second per query.
Whenever an action is to be performed, do the following:
Create a row for the action. q = Queries.objects.create(site=sitename)
On the object you just created (q.id), atomically set start_time to the greatest start_time for this site plus 1/3 second. If the greatest is 10 seconds in the future, then we can start our action at 10 1/3 seconds. If that time is in the past, clamp it to now().
If the start_time that was just set is in the future, sleep until that time. If it's too far in the future (eg. over 15 seconds), delete the row and error out.
When the query is finished, set finished to True, so the row can be purged later on.
The atomic action is what's important. You can't simply do an aggregate on Queries and then save it, since it'll race. I don't know if Django can do this natively, but it's easy enough in raw SQL:
UPDATE site_queries
SET start_time = MAX(now(), COALESCE(now(), (
SELECT MAX(start_time) + 1.0/3 FROM site_queries WHERE site = site_name
)))
WHERE id = object_id
Then, reload the model and sleep if necessary. You'll also need to purge old rows. Something like Queries.objects.filter(site=site, finished=True).exclude(id=id).delete() will probably work: delete all finished queries except the one you just made. (That way, you never delete the latest query, since later queries need that to be scheduled.)
Finally, make sure the UPDATE doesn't take place in a transaction. Autocommit must be turned on for this to work. Otherwise, the UPDATE won't be atomic: it'd be possible for two requests to UPDATE at the same time, and receive the same result. Django and Python typically have autocommit off, so you need to turn it on and then back off. With Postgres, this is connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) and ISOLATION_LEVEL_READ_COMMITTED. I don't know how to do this with MySQL.
(I consider the default of having autocommit turned off in Python's DB-API to be a seriously design flaw.)
The benefit of this approach is that it's quite simple, with straightforward state; you don't need things like event listeners and wakeups, which have their own sets of problems.
A possible issue is that if the user cancels the request during the delay, whether or not you do the action, the delay is still enforced. If you never start the action, other requests won't move down into the unused "timeslot".
If you're not able to get autocommit to work, a workaround would be to add a UNIQUE constraint to (site, start_time). (I don't think Django understands that directly, so you'd need to add the constraint yourself.) Then, if the race happens and two requests to the same site end up at the same time, one of them will throw a constraint exception that you can catch, and you can just retry. You could also use a normal Django aggregate instead of raw SQL. Catching constraint exceptions isn't as robust, though.
What about using a different process to handle scraping, and a queue for the communication between it and Django?
This way you would be able to easily change the number of concurrent requests, and it would also automatically keep track of the requests, without blocking the caller.
Most of all, I think it would help lowering the complexity of the main application (in Django).