combinatorial and advanced scheduling problems in antv S2 - antv

Language: JS
Set the intra group sorting, then set the advanced sorting, and then cancel the advanced sorting. The table does not return to the intra group sorting state, but to the original state
I want to restore the set intra group sorting after canceling the advanced sorting。
Thanks

Related

Indexed Range Query with DynamoDB

With DynamoDB, there is simply no straightforward way to perform an indexed range query over a column. Primary key, local secondary index, and global secondary index all require a partition key to range query.
For example, suppose I have a high-scores table with a numerical score attribute. There is no way to get the top 10 scores or top scores 25 to 50 with an indexed range query
So, what is the idiomatic or preferred way to perform this incredibly common task?
Settle for a table scan.
Use a static partition key and take advantage of partition queries.
Use a fixed number of static partition keys and use multiple partition queries.
It's either 2) or 3) but it depends on the amount and structure of data as well as the read/write activity.
There's no generic answer here as it's use-case specific.
As long as you can get away with it, you probably want to use 2) as it only requires a single Query API call. If you have lots of data or heavy read/write activity, you'd use some bucketing-strategy (very close to your third option) to write to multiple partitions, then do multiple queries and aggregate the results.
DDB isn't suited for analytics. As Maurice said you can facilitate what you need via secondary index, but there are also other options to consider:
If you are providing this Top N to your customers consistently/frequently and N is fixed, then you can have dedicated item(s) that hold this information and you would update that/those item(s) upon writing an item to a table. You can have 1 item for the whole top N or you can apply some bucketing strat.
If your system needs this information infrequently (on some singular occasions), then scan might be also fine.
If this is for analytics/research, consider exporting the table to S3 and using Athena.

Siddhi: state persistence and realtime queryable state?

My Scenario is like this.
I want to see/query what is the current aggregation value(s) of a particular query of the active processing window.
I have seen this in Apache Flink.
For e.g:
Say I have a query to count total number of failures, windowing to every 12 hours. And I want to ask (from another application) what is the current count for active aggregating window. Note that active window is still processing.
Reason is that my application need to give a feedback to the user regarding its current total failure count. So he/she can act based on that. Waiting until the window is processed and get the count then, is not the desired behavior in the perspective of user.
Is this possible? If so how?
One option is to use rolling time window. Rolling time window will give you the rolling aggregation(sum, count, etc) for a given time range. So for every incoming event you will get an output event with the count. You can use that to give feedback. There are two catches of this approach. One is it is a rolling count not a batch count. Other one is process is triggered with events to count stream. If you want to trigger the feedback depending on another requirement(Ex: user initiated, every hour etc) this approach will not work. For that you need to use below approach.
Use a time batch window and then join it with another stream which will get triggered depending on the business requirement. Below is a sample and here are the testcases for your reference.
from countStream#window.timeBatch(12 hrs) right outer join
feedbackTriggerStream#window.length(1)
select count() as totalFailures
insert into FeedbackStream;
Another option is to use the query feature. This approach is suitable if you are using Siddhi as a library and you have access to SiddhiAppRuntime. Below is a code sample for that. Lets assume below is your window query to calculate count.
define countWindow(userid string, reason string) timeBatch(12 hrs);
From countStream
Select *
Insert into countWindow;
Then you can use queries as below to access window data.
Event[] events = siddhiAppRuntime.query(
"from countWindow " +
"select count() as totalCount ");
events will contain the one event with count. Here is a reference to testcases.

Scanning DynamoDB table while inserting

When we scan a DynamoDB table, we can/should use LastEvaluatedKey to track the progress so that we can resume in case of failures. The documentation says that
LastEvaluateKey is The primary key of the item where the operation stopped, inclusive of the previous result set. Use this value to start a new operation, excluding this value in the new request.
My question is if I start a scan, pause, insert a few rows and resume the scan from the previous LastEvaluatedKey, will I get those new rows after resuming the scan?
My guess is I might miss some of all of the new rows because the new keys will be hashed and the values could be smaller than LastEvaluatedKey.
Is my guess right? Any explanation or documentation links are appreciated.
It is going sequentially through your data, and it does not know about all items that were added in the process:
Scan operations proceed sequentially; however, for faster performance
on a large table or secondary index, applications can request a
parallel Scan operation by providing the Segment and TotalSegments
parameters.
Not only it can miss some of the items that were added after you've started scanning it can also miss some of the items that were added before the scan started if you are using eventually consistent read:
Scan uses eventually consistent reads when accessing the data in a
table; therefore, the result set might not include the changes to data
in the table immediately before the operation began.
If you need to keep track of items that were added after you've started a scan you can use DynamoDB streams for that.

Is there a clever HBase Schema to Aid with Discovering Missing Value?

Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted.
I receive lots of events per row. However, there will be very few rows that will not have any events associated with them.
At the end of the day I would like to report on the rows that have not received any events.
My naive solution would be to introduce a cf:c that holds a flag, set the flag to 1 every-time I see an event for it. Then do a full-scan of the table looking for rowkeys that are missing the column-value. That seems like a waste, because I would be looking through 10 billion rows to discover a handful of rowkeys (we are talking about 100s or low 1000s).
Is there a clever way to design the hbase schema such that the rowkeys that are missing events could be found quickly (without going through every row)?
If I understood correctly, you have a rowkey xxxxyyyyzzzz1 ... xxxxyyyyzzzzn.
You have events for some rows and no events for other rows.
c is your flag to know whether events are there or not and you have huge data.
Rule of thumb in HBase: RowFilters are always faster and more efficient than column value filters (for searching that flag, a full table scan is required).
Your approach to scan the entire table for missing events (column value filter) will lead to a full table scan and is not efficient.
Conclusion: You have to use a row key filter to scan such a big table.
So I'd suggest you write the flag in the row key. For example :
0 -- is for no events
1 -- is there are events
xxxxyyyyzzzz1_0 // row with no events
xxxxyyyyzzzz1_1 // row with events
Now you can use a fuzzy row filter to capture missing event rows and take a report.
Option 2 of your another question which was answered by me
Is there a clever HBase Schema to Aid with Discovering Missing Value?
From, my experience with hbase, there is no such thing.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.