We're storing organization names in a DynamoDB table on AWS, and would like to maintain official capitalization in those business names, for example in "TNT" and "FedEx".
Our use case is that users of the application can search for organizations by name, but we'd like that their queries are interpreted case-insensitively. So, queries for "FedEx", "Fedex" or "fedex" should all return the correct item in the table.
Other databases have ways to perform queries ignoring case (for example by the ILIKE key word in PostgreSQL), by expressing queries via regular expressions, or by applying functions in the condition (for example the LOWER() function).
How can this be done in DynamoDB? The documentation on Amazon DynamoDB's Query does not provide an answer.
(The best work-around seems to be storing the name twice: once with the official capitalization in effect, and once in another field with the name converted to lowercase. Searching should then be done on the latter field, with the query search term also converted to lowercase. Yes, I know it adds redundancy to the table. It's a work-around, not an optimal solution.)
yes, exactly, when you add the new item/row, add also a new field searchName, that is the lowercase (even more, maybe only letters/numbers/spaces) of the your name field. and then search by that searchName field
Writing duplicate data in dynamodb is not a good design. The best solution would be to add ' elastic search ' to dynamodb. You can connect this component ' out of the box' using the aws console. Then use custom anayzer in elastic search to get case insensitive data.
Related
In my DynamoDB table named users, I need a unique identifier, which is easy for users to remember.
In a RDBMS I can use auto increment id to meet the requirement.
As there is no way to have auto increment id in DynamoDB, is there a way to meet this requirement?
If I keep last used id in another table (lastIdTable) retrieve it before adding new document, increment that number and save updated numbers in both tables (lastIdTable and users), that will be very inefficient.
UPDATE
Please note that there's no way of using an existing attribute or getting users input for this purpose.
Since it seems you must create a memorable userId without any information about the user, I’d recommend that you create a random phrase of 2-4 simple words from a standard dictionary.
For example, you might generate the phrase correct horse battery staple. (I know this is a userId and not a password, but the memorability consideration still applies.)
Whether you use a random number (which has similar memorability to a sequential number) or a random phrase (which I think is much more memorable), you will need to do a conditional write with the condition that the ID does not already exist, and if it does exist, you should generate a new ID and try again.
email address seems the best choice...
Either as a partition key, or use a GUID as the partition key and have a Global Secondary Index over email address.
Or as Matthew suggested in a comment, let the users pick a user name.
Docker container naming strategy might give you some idea. https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go
It will result in unique (limited) yet human friendly
Examples
awesome_einstein
nasty_weinstein
perv_epstein
A similar one: https://github.com/jjmontesl/codenamize
Using AWS Cloudsearch, I need to query 2 separate fields for the same value using a structured (compound) query e.g.
(and (or name:'john smith') (or curr_addr:'123 someplace' other_addr:'123 someplace'))
This query works, but I'm wondering if it's necessary to repeat the value for each field that I want to search against. Is there some way to specify the value only once e.g. curr_addr+other_addr:'123 someplace'
That is the correct way to structure your compound query. From the AWS documentation, you'll see that they structure their example query the same way:
(and title:'star' (or actors:'Harrison Ford' actors:'William Shatner')(not actors:'Zachary Quinto'))
From Constructing Compound Queries
You may be able to get around this by listing the more repetitive fields in the query options (q.options), and then specify the field for the rest of the fields. The fields list is sort of a fallback for when you don't specify which field you are searching in a compound query. So if you list the address fields there, and then only specify the name field in your query, you may get close to the behavior you're looking for.
Query options
q.options={fields: ['curr_addr','other_addr']}
Query
(and (or name:'john smith') (or '123 someplace'))
But this approach would only work for one set of repetitive fields, so it's not a silver bullet by any means.
From Search API Reference (see q.options => fields)
User has an email address and a display name.
Both of these must be unique.
Both of these must be updatable as long as either is not being used already.
A User table will exist with additional non-key attributes and a guid ID.
How to model to support efficient query check if email address or display name is already being used?
Should I create a table with the guid as Key, no range, and 2 separate GSI one for email and one for display name (each being the key)? Both will also have a second field with the guid id of the user. Or should these be completely separate tables, or ????
Thoughts, is there a better way?
Thanks.
There are 3 ways you can design that I can think of:
As you have mentioned, a table with guid and 2 separate GSI one for email and other for Name.
You have stated that both the fields had to be unique, so potentially you can make any one of them as hash and create GSI for other.(This will run into problem as you mention that you need to update Email & Name as well, for that you have to delete old record and add a new record with same attributes and updated Hash keys)
Advantage of this would be that you need to pay less as there will be only one GSI compared to #1.
Another option is to use CloudSearch, your DynamoDB table can be integrated with cloudSearch, in this option you can simply create a table with guid no need to add any GSI, whenever you want to search you can search on CloudSearch to get the output.
One more advantage you will get in CloudSearch is that you will be able to query on any attributes of the table and can use different filters on them.
One thing you need to see it that price difference between #2 and #3, you can go with anyone which is better suited in terms of price and functionality.
If you implement this with other ways feel free to share it.
Hope that helps
I am writing a simple app in django that searches for records in database.
Users inputs a name in the search field and that query is used to filter records using a particular field like -
Result = Users.objects.filter(name__icontains=query_from_searchbox)
E.g. -
Database consists of names- Shiv, Shivam, Shivendra, Kashiva, Varun... etc.
A search query 'shiv' returns records in following order-
Kahiva, Shivam, Shiv and Shivendra
Ordered by primary key.
My question is how can i achieve the order -
Shiv, Shivam, Shivendra and Kashiva.
I mean the most relevant first then lesser relevant result.
It's not possible to do that with standard Django as that type of thing is outside the scope & specific to a search app.
When you're interacting with the ORM consider what you're actually doing with the database - it's all just SQL queries.
If you wanted to rearrange the results you'd have to manipulate the queryset, check exact matches, then use regular expressions to check for partial matches.
Search isn't really the kind of thing that is best suited to the ORM however, so you may which to consider looking at specific search applications. They will usually maintain an index, which avoids database hits and may also offer a percentage match ordering like you're looking for.
A good place to start may be with Haystack
I posted a similar question over on the Adobe Community forums, but it was suggested to ask over here as well.
I'm trying to cache distinct queries associated with a particular database, and need to be able to flush all of the queries for that database while leaving other cached queries intact. So I figured I'd take advantage of ColdFusion's ehcache capabilities. I created a specific cache region to use for queries from this particular database, so I can use cacheRemoveAll(myRegionName) to flush those stored queries.
Since I need each distinct query to be cached and retrievable easily, I figured I'd hash the query parameters into a unique string that I would use for the cache key for each query. Here's the approach I've tried so far:
Create a Struct containing key value pairs of the parameters (parameter name, parameter value).
Convert the Struct to a String using SerializeJSON().
Hash the String using Hash().
Does this approach make sense? I'm wondering how others have approached cache key generation. Also, is the "MD5" algorithm adequate for this purpose, and will it guarantee unique key generation, or do I need to use "SHA"?
UPDATE: use cacheRegion attribute introduced in CF10!
http://help.adobe.com/en_US/ColdFusion/10.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7fae.html
Then all you need to do is to specify cachedAfter or cachedWithin, and forget about how to to generate unique keys. CF will do it for you by 'hashing':
query "Name"
SQL statement
Datasource
Username and
password
DBTYPE
reference: http://www.coldfusionmuse.com/index.cfm/2010/9/19/safe.caching
I think this would be the easiest, unless you really need to fetch a specific query by a key, then u can feed your own hash using cacheID, another new attribute introduced in CF10.