Cloud Spanner: Case-insensitive query goes extremely slow even when indexed

Cloud Spanner: Case-insensitive query goes extremely slow even when indexed - google-cloud-platform

https://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search
This page describes the problem I'm having and a potential solution.
To summarize the issue, I want to get all results matching a query using where UPPER(some_column) = UPPER(#param). I have an index that returns <50ms if I don't use UPPER on some_column. The same query takes 4+ seconds with the UPPER since the table is indexed on some_column alone and not the UPPER value of that column.
The author proposed this:
To support that query, we need an index that covers the actual search
term. That means we do not need an index on LAST_NAME but on
UPPER(LAST_NAME):
CREATE INDEX emp_up_name
ON employees (UPPER(last_name))
An index whose definition contains functions or expressions is a
so-called function-based index (FBI). Instead of copying the column
data directly into the index, a function-based index applies the
function first and puts the result into the index. As a result, the
index stores the names in all caps notation.
Does Spanner support a way to do this? If not what is a good alternative?
I've tried created a function-based index like this, but there's a syntax error making me think functions aren't allowed in the Cloud Spanner DDL
CREATE INDEX some_index
ON Table (
UPPER(Type)
)

As you said it's not possible to use UPPER in Cloud Spanner DDL, as it's not supported.
You can raise a feature request for that following this link [1].
The only workaround I can think of is changing the data before so it's already in uppercase.

Related

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

I am new to AWS. while reading the docs here and example I came to know that sort key is not only use to sort the data in partitions but also used to enhance the searching criteria on dynamoDB table.But the same we can do with the help of filterCondition. So what is the difference,
and also acc. to example given we can use sort/range key in withKeyConditionExpression("CreateDate = :v_date and begins_with(IssueId, :v_issue)")
but when I tried it gave me exception
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Query key condition not supported
Thanks

To limit the Items returned rather than returning all Items with a particular HASH key.
There are two different ways we can handle this
The ideal way is to build the element we want to query into the RANGE key. This allows us to use Key Expressions to query our data, allowing DynamoDB to quickly find the Items that satisfy our Query.
A second way to handle this is with filtering based on non-key attributes. This is less efficient than Key Expressions but can still be helpful in the right situations. Filter expressions are used to apply server-side filters on Item attributes before they are returned to the client making the call. Filtering is Applied after DynamoDB Query is completed . If you retrieve 100KB of data in Query step but filter it down to 1KB of data, you will consume the Read Capacity Units for 100KB of data
Moral is - Filtering and projection expressions aren't a magic bullet - they won't make it easy to quickly query your data in additional ways. However, they can save network transfer time by limiting the number and size of items transferred back to your network. They can also simplify application complexity by pre-filtering your results rather than requiring application-side filtering.
From dynamodbguide
dynamodbguide

Google Sheets Array formula for counting the number of values in each column

I'm trying to create an array formula to auto-populate the total count of values for each column as columns are added.
I've tried doing this using a combination of count and indirect, as well as tried my hand at query, but I can't seem to get it to show unique value counts for each column.
This is my first time attempting to use query, and at first it seemed possible from reading through the documentation on the query language, but I haven't been able to figure it out.
Here's the shared document: https://docs.google.com/spreadsheets/d/15VwsL7uTsORLqBDrnT3VdwAWlXLh-JgoJVbz7wkoMAo/edit?usp=sharing
I know I can do this by writing a custom function in apps script, but I'd like to use the built-in functions if I can for performance reasons (there is going to be a lot of data), and I want quick refresh rates.

try:
=ARRAYFORMULA(IF(B5:5="",,TRANSPOSE(MMULT(TRANSPOSE(N(B6:99<>"")), SIGN(ROW(B6:99))))))

In B3 try
=ArrayFormula(IF(LEN(B5:5), COUNTIF(IF(B6:21<>"", COLUMN(B6:21)), COLUMN(B6:21)),))

MarkLogic Optic javaScript Geospatial Difference

I want to reduce the selected items by their distance from a point using MarkLogic Optic.
I have a table with data and a lat long
const geoData = op.fromView("namespace","coordinate");
geoData.where(op.le(distance(-28.13,153.4,geoData.col(lat),geoData(long)),100))
The distance function I have already written and utilises geo.distance(cts.point(lat,long),cts.point(lat,long)) but the geoData.col("lat") passes an object that describes the full names space of the col not the value.
op.schemaCol('namespace', 'coordinate', 'long')
I suspect I need to do a map/reduce function but MarkLogic documentation gives the normal simplistic examples that are next to useless.
I would appreciate at some help.
FURTHER INFORMATION
I have mostly solved most of this problem except that some column have null values. The data is sparse and not all rows have a long lat.
So when the cts.points runs in the where statement and two null values are passed it raises an exception.
How do I coalesce or prevent execution of cts.points when data columns are null? I dont want to reduce the data set as the null value records still need to be returned they will just have a null distance.

Where possible, it's best to do filtering by passing a constraining cts.query() to where().
A constraining query matches the indexed documents and filters the set of rows to the rows that TDE projected from those documents before retrieving the filtered rows from the indexes.
If the lat and long columns are each distinct JSON properties or XML elements in the indexed documents, it may be possible to express the distance constraint using techniques similar to those summarized here:
http://docs.marklogic.com/guide/search-dev/geospatial#id_42017
In general, it's best to use map/reduce SJS functions for postprocessing on the filtered result set because the rows have to be retrieved to the enode to process in SJS.
Hoping that helps,

How to order django query set filtered using '__icontains' such that the exactly matched result comes first

I am writing a simple app in django that searches for records in database.
Users inputs a name in the search field and that query is used to filter records using a particular field like -
Result = Users.objects.filter(name__icontains=query_from_searchbox)
E.g. -
Database consists of names- Shiv, Shivam, Shivendra, Kashiva, Varun... etc.
A search query 'shiv' returns records in following order-
Kahiva, Shivam, Shiv and Shivendra
Ordered by primary key.
My question is how can i achieve the order -
Shiv, Shivam, Shivendra and Kashiva.
I mean the most relevant first then lesser relevant result.

It's not possible to do that with standard Django as that type of thing is outside the scope & specific to a search app.
When you're interacting with the ORM consider what you're actually doing with the database - it's all just SQL queries.
If you wanted to rearrange the results you'd have to manipulate the queryset, check exact matches, then use regular expressions to check for partial matches.
Search isn't really the kind of thing that is best suited to the ORM however, so you may which to consider looking at specific search applications. They will usually maintain an index, which avoids database hits and may also offer a percentage match ordering like you're looking for.
A good place to start may be with Haystack

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?

I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.

It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cloud Spanner: Case-insensitive query goes extremely slow even when indexed - google-cloud-platform

As you said it's not possible to use UPPER in Cloud Spanner DDL, as it's not supported. You can raise a feature request for that following this link [1]. The only workaround I can think of is changing the data before so it's already in uppercase.

Related

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

Google Sheets Array formula for counting the number of values in each column

MarkLogic Optic javaScript Geospatial Difference

How to order django query set filtered using '__icontains' such that the exactly matched result comes first

MongoDB: what is the most efficient way to query a single random document?

Categories

Resources