calcite, id filters , what is the easiest way to get them? - apache-calcite

I would like to find the easiest way to get the filters for id columns of my tables. Currently I use FilterableTable but that returns the filters as an expression tree and I would have to scan for it. I am wondering if there is an easier way to get the filter of my PK columns (the one I declare as keys or as indexed), i.e. get a from-to kind of structure.
EDIT: so what ideally I would expect is to extract a list of id ranges for the query, i.e. from filters to [id1 ... id2] , [id3...id4] and so on, where id1


DynamoDB Query distinct attribute values

I'm trying to query DynamoDB and get a result similar to select distinct(address) from ... in SQL.
I know DynamoDB is a document-oriented DB and maybe I need to change the data structure.
I'm trying to avoid getting all the data first and filtering later.
My data looks like this:
So I want to get the distinct addresses in the entire table.
How it's the best way to do it?
Unfortunately, no. You'll need to Scan the entire table (you can use the ProjectionExpression or AttributesToGet options to ask just for the "Address" attribute, but anyway you'll pay for scanning the entire contents of the table).
If you need to do this scan often, you can add a secondary-index which projects only the keys and the "Address" attribute, to make it cheaper to scan. But unfortunately, using a GSI whose partition key is the "Address" does not give you an ability to eliminate duplicates: Each partition will still contain a list of duplicate items, and unfortunately there is no way to just listing the different partition keys in an index - Scaning the index will give you the same partition key multiple times, as many items there are in this partition.

DynamoDB query by 3 fields

Hi I am struggling to construct my schema with three search fields.
So the two main queries I will use is:
Get all files from a user within a specific folder ordered by date.
Get all files from a user ordered by date.
Maybe there will be a additional query where I want:
All files from a user within a folder orderd by date and itemType == X
All files from a user orderd by date and itemType == X
So as of that the userID has to be the primaryKey.
But what should I use as my sortKey?. I tried to use a composite sortKey like: FOLDER${folderID}#FILE{itemID}#TIME{$timestamp} As I don't know the itemID I can't use the beginsWith expression right ?
What I could do is filter by beginsWith: folderID but then descending sort by date would not work.
Or should I move away from dynamoDB to a relationalDB with those query requirements in mind?
DynamoDB data modeling can be tough at first, but it sounds like you're off to a good start!
When you find yourself requiring an ID and sorting by time, you should know about KSUIDs. KSUID's are unique IDs that can be lexicographically sorted by time. That means that you can sort KSUIDs and they will order by creation time. This is super useful in DynamoDB. Let's check out an example.
When modeling the one-to-many relationship between Users and Folders, you might do something like this:
In this example, User with ID 1 has three folders with IDs 1, 2, and 3. But how do we sort by time? Let's see what this same table looks like with KSUIDs for the Folder ID.
In this example, I replaced the plain ol' ID with a KSUID. Not only does this give me a unique identifier, but it also ensures my Folder items are sorted by creation date. Pretty neat!
There are several solutions to filtering by itemType, but I'd probably start with a global secondary index with a partition key of USER#user_id#itemType and FOLDER#folder_id as the sort key. Your base table would then look like this
and your index would look like this
This index allows you to fetch all items or a specific folder for a given user and itemType.
These examples might not perfectly match your access patterns, but I hope they can get your data modeling process un-stuck! I don't see any reason why your access patterns can't be implemented in DynamoDB.
if you are sure about using dynamoDB you should analyze access patterns to this table in advance and chose part key, sort key based on the most frequent pattern. For other patterns, you should add GSI for each pattern. See
Usually, if it is about unknown patterns RDBMS looks better, or for HighLoad systems NO_SQL for highload workloads and periodic uploading data to something like AWS RedShift.

Negative filtering by filter_box or some other mechanism

Let's say I have a column named Column1. There are more than 10k different values for this column, but my goal is to display on a dashboard all data except few of them. Is it possible to achieve it in Superset? As far as I understand the only one option to filter dashboard is a filter_box, and I have to choose values explicitly in filterbox, so no way to use a negative filter. Is it true, or there is some hidden mechanism?
You can use the limit selector values option to provide the filter out values you dont need by specifying the column name and the list of values you would like to ignore using the appropriate condition like *equals, not equals, etc

DynamoDB fast search on complex data types

I need to create a new table on AWS DynamoDB that will have a structure like the following:
"email" : String (key),
... : ...,
"someStuff" : SomeType,
... : ...,
"listOfIDs" : Array<String>
This table contains users' data and a list of strings that I'll often query (see listOfIDs).
Since I don't want to scan the table every time in order to get the user linked to that specific ID due to its slowness, and I cannot create an index since it's an Array and not a "simple" type, how could I improve the structure of my table? Should I use a different table where I have all my IDs and the users linked to them in a "flat" structure? Is there any other way?
Thank you all!
Perhaps another table that looks like:
ID string / hash key,
Email string / range key,
Any other attributes you may want to access
The unique combination of ID and email will allow you to search on the "List of IDs". You may want to include other attributes within this table to save you from needing to perform another query.
Should I use a different table where I have all my IDs and the users linked to them in a "flat" structure?
I think this is going to be your best bet if you want to leverage DynamoDB's parallelism for query performance.
Another option might be using a CONTAINS expression in a query if your listOfIDs is stored as a set, but I can't imagine that will scale performance-wise as your table grows.

How to Scan HBase Rows efficiently

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
I am just getting started with HBase, bloom filters might help.
You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.