Fulltext Search DynamoDB - amazon-web-services

Following situation:
I´m storing elements in a DyanmoDb for my customers. HashKey is a Element ID and Range Key is the customer ID. In addition to these fields I´m storing an array of strings -> tags (e.g. ["Pets", "House"]) and a multiline text.
I want to provide a search function in my application, where the user can type a free text or select tags and get all related elements.
In my opinion a plain DB query is not the correct solution. I was playing around with CloudSearch, but I´m not really sure if this is the correct solution, because everytime the user adds a tag the index must be updated...
I hope you have some hints for me.

DynamoDB is now integrated with Elasticsearch, enabling you to perform
full-text queries on your data.
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/
DynamoDB streams are used to keep the search index up-to-date.

You can use an instant-search engine like Typesense to search through data in your DynamoDB table:
https://github.com/typesense/typesense
There's also ElasticSearch, but it has a steep learning curve and can become a beast to manage, given the number of features and configuration options it supports.
At a high level:
Turn on DynamoDB streams
Setup an AWS Lambda trigger to listen to these change events
Write code inside your lambda function to index data into Typesense:
def lambda_handler(event, context):
client = typesense.Client({
'nodes': [{
'host': '<Endpoint URL>',
'port': '<Port Number>',
'protocol': 'https',
}],
'api_key': '<API Key>',
'connection_timeout_seconds': 2
})
processed = 0
for record in event['Records']:
ddb_record = record['dynamodb']
if record['eventName'] == 'REMOVE':
res = client.collections['<collection-name>'].documents[str(ddb_record['OldImage']['id']['N'])].delete()
else:
document = ddb_record['NewImage'] # format your document here and the use upsert function to index it.
res = client.collections['<collection-name>'].upsert(document)
print(res)
processed = processed + 1
print('Successfully processed {} records'.format(processed))
return processed
Here's a detailed article from Typesense's docs on how to do this: https://typesense.org/docs/0.19.0/guide/dynamodb-full-text-search.html

DynamoDB just added PartiQL, a SQL-compatible language for querying data. You can use the contains() function to find a value within a set (or a substring): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-functions.contains.html

In your specific case you need Elastic search. But you can do wildcard text search on sort-key,
/* Return all of the songs by an artist, matching first part of title */
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';
/* Return all of the songs by an artist, with a particular word in the title...
...but only if the price is less than 1.00 */
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price < 1.00;
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.ReadData.Query.html

This is the advantage of using dynamodb as a 'managed service' by aws. You get multiple components managed apart from the managed nosql db.
If you are using the 'downloaded' version of dynamodb then you need to ' build your own ' elasticcluster and index the data in dynamodb .

Related

How to retrieve all the item from DynamoDB using boto3?

I want to retrieve all the items from my table without specifying any particular parameter, I can do it using Key Pair, but want to get all items. How to do it?
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Email')
response = table.get_item(
Key={
"id": "2"
}
)
item = response['Item']
print(item)
This way I can do, but how to retrieve all items? is there any method?
If you want to retrieve all items you will need to use the Scan command.
You can do this by running
response = table.scan()
Be aware that running this will utilise a large number of read credits (RCU). If you're using eventual consistency 1 RCU will be equal to 2 items (under 4KB) and strongly consistent will be 1 item per each RCU (under 4KB).
Here is the consideration page for scans vs queries in AWS documentation.

Databricks -Test SQL created table exists with correct columns

I'm trying to create a test in Databricks that checks a suite of tables has been correctly created with the correct columns. This feels as if it should be simple but I can't quite grasp the solution, everything is migrated from Oracle and my background is Oracle and SQL rather than python.
So for example, imagine the following example table that will be populated with dashboard data. If it already exists with a different structure the reporting scripts will fail.
%sql
CREATE TABLE IF NOT EXISTS $report_schema.p23a_com
(AGE_GROUP STRING,
UniqServReqID STRING,
Rec_Code STRING,
Elapsed_days INT)
USING delta
PARTITIONED BY (AGE_GROUP)
Part of the test is as follows but obviously the assert fails because of the partition column info.
I can't seem to make DESCRIBE less wordy, I could remove the # from the input list but that seems messy and makes it more difficult when I extend the test to pick up datatype. Is there a better way to capture the schema?
def get_table_schema(dbase,table_name):
desc_query="DESCRIBE "+dbase+"."+table_name
df_tab_cols = sqlContext.sql(desc_query)
return df_tab_cols
def test_table_schema(tab_cols,list_tab_cols):
input_col_list = df_tab_cols.select("col_name").rdd.map(lambda row : row[0]).collect()
assert set(input_col_list) == set(list_tab_cols)
db = report_schema
table = "p23a_com"
cols = ["AGE_GROUP","UniqServReqID","Rec_Code","Elapsed_days"]
df_tab_cols = get_table_schema(db,table)
test_table_schema(df_tab_cols,cols)
The answer is me being a bit thick and too SQL focused. Instead of using SQL Describe all I needed to do was read the table directly into the dataframe via spark then use columns.
ie
def get_table_schema(dbase,table_name):
desc_query = dbase+"."+table_name
df_tab_cols = spark.table(desc_query)
return df_tab_cols
def test_table_schema(tab_cols,list_tab_cols):
input_col_list = list(df_tab_cols.columns)
assert set(input_col_list) == set(list_tab_cols)
print(input_col_list)
My issue is that we have migrated a reporting system to databricks and some builds are failing due to tables existing, old versions of tables, creates failing when drops haven't completed; we also don't have full git integration and some environments are more wild west than I would like.
As a result a schema check is necessary.
As I couldn't get an answer the following is my final code in case anyone else needs something similar. As I wanted to check columns and column types I switched to dictionaries, it turns out that comparing dictionaries is simple.
I don't think it is very pythonic and it isn't efficient when there are a lot of tables. I think the dictionary creation from a dataframe might need an rdd in non-databricks environments.
Feel free to critique as I am still learning.
# table_list is a python dictionary in the form
# {tablename:{column1:column1_type, column2:column2_type, etc:etc}}
# When a table is added or changed in the build it should be added as a dictionary item.
table_list = {
'p23a_com': {'AGE_GROUP': 'string',
'UniqServReqID': 'string',
'Rec_Code': 'string',
'Elapsed_days': 'int'},
'p23b_com': {'AGE_GROUP': 'string',
'UniqServReqID': 'string',
'Org_code': 'string',
'Elapsed_days': 'int'}}
# Function to get the schema for a table
# Function takes the database name and the table name and returns a dictionary of the columns and data types.
def get_table_schema(dbase,table_name):
desc_query = dbase+"."+table_name
df_tab_cols = spark.createDataFrame(spark.table(desc_query).dtypes)
tab_cols_dict = dict(map(lambda row: (row[0],row[1]), df_tab_cols.collect()))
return tab_cols_dict
# This is the test, it cycles through the table_list dictionary returning the columns and types and then doing a dictionary compare.
# The query will fail on any missing table or any table with incorrect columns.
for tab in table_list:
tab_cols_d = get_table_schema(db,tab)
assert tab_cols_d == table_list[tab]

how can we get total number of items in cloudsearch?

I am working on customer document (data like fname, lname, orderAmount etc.) which is configured on AWS CloudSearch. Now the situation is that I am showing this data on jquery datatable. for pagination I required to have total items available for search. Is there any way I can get count of all available document on cloudsearch ?
I am getting response for matching count, But not the total items available on cloudsearch doamin.
I have search for this under https://docs.aws.amazon.com/cloudsearch/latest/developerguide/what-is-cloudsearch.html but did not found any useful thing.
Any trick to get total document count on particular cloudsearch domain ?
amazon does not have an easy way to fetch all the records, so we basically pass a condition which would never be true & return all the record.
Key Points
Get only single item (pageSize = 1)
Get item without fields
For Example
http://search-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com/2013-01-01/search?q=(and+(id:0))&q.parser=structured&return=_no_fields&size=1

Querying nested attributes in Amazon DynamoDB

How can I efficiently query on nested attributes in Amazon DynamoDB?
I have a document structure as below, which lets me store related information in the document itself (rather than referencing it).
It makes sense to store the seminars nested in the course, since they will likely be queried alongside the course (they are all course-specific, i.e. a course has many seminars, and a seminar belongs to a course).
In CouchDB, which I’m migrating from, I could write a View that would project some nested attributes for querying. I understand that I can’t project anything that isn’t a top-level attribute into a dynamodb secondary index, so this approach doesn’t seem to work.
This brings me back to the question: how can I efficiently query on nested attributes without scanning, if I can’t use them as keys in an index?
For example, if I want to get average attendance at Nelson Mandela Theatre, how can I query for the values of registrations and attendees in all seminars that have a location of “Nelson Mandela Theatre” without resorting to a scan?
{
“course_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminars”: [
{
“seminar_id”: “XXXYYY-12345”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “How to lie with statistics”,
“registrations”: “92”,
“attendees”: “61”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “155555555”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for dog owners”,
“registrations”: “244”,
“attendees”: “240”
},
{
“seminar_id”: “XXXAAA-54321”,
“epoch_time”: “223456789”,
“duration”: “4000”,
“location”: “Starbucks”,
“name”: “Is feral cat population growth a leading indicator for the S&P 500?”,
“registrations”: “40”
}
]
}
{
“course_id”: “CJX-5553389”,
“course_name”: “Cat Health 101”,
“tutors”: [“Cognito-sub-4”, “Cognito-sub-9”],
“seminars”: [
{
“seminar_id”: “TTRHJK-43278”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Catwoman Hall”,
“name”: “Emotional support octopi for cats”,
“registrations”: “88”,
“attendees”: “87”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “123666789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for cat owners”,
“registrations”: “44”,
“attendees”: “44”
}
]
}
Index cannot be created for nested attributes (i.e. document data types in Dynamodb).
Document Types – A document type can represent a complex structure
with nested attributes—such as you would find in a JSON document. The
document types are list and map.
Query Api:-
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
Scan API:-
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
In order to use Query API, the hash key value is required. The OP doesn't have any information that hash key value is available. As per OP, the data needs to be queried by location attribute which is inside the Dynamodb List data type. Now, the option is to look at GSI.
Kindly read more about the GSI. One of the rules is that GSI can be created using top level attributes only. So, the location can't be used to create the index.
So, creating the GSI in order to use Query API has been ruled out as well.
The index key attributes can consist of any top-level String, Number,
or Binary attributes from the base table; other scalar types, document
types, and set types are not allowed.
Because of the above mentioned reasons, the Query API can't be used to get the data based on location attribute assuming hash key value is not available.
If hash key value is available, FilterExpression can be used to filter the data. Only way to filter the data present in the complex list data type is CONTAINS function. In order to use CONTAINS function, all the attributes in the occurrence is required to match the data (i.e. seminar_id, location, duration and all other attributes). So, it is definitely not possible to fulfil the use case mentioned in the OP using the current data model.
Proposed alternate solution:-
Re-modeling the data structure as mentioned below could be an option to resolve the problem. There is definitely no other solution available to fulfil the use case using Query API.
Main Table :-
Course Id - Hash Key
seminar_id - Sort Key
GSI :-
Seminar location - Hash Key
Course Id - Sort Key
In a DynamoDB table, each key value must be unique. However, the key
values in a global secondary index do not need to be unique.
Now, you can use the Query API on GSI to get the data for Seminar location is equal to Nelson Mandela Theatre. You can use the course id in the query api if you know the value. The query api will potentially give multiple items in the result set. You can use FilterExpression if you would like to further filter the data based on some non key attributes.
This is an example from here where you use a filter expression, it is with a scan operation, but maybe you can apply something similar for query instead of scan (take a look at the API):
{
"TableName": "MyTable",
"FilterExpression": "#k_Compatible.#k_RAM = :v_Compatible_RAM",
"ExpressionAttributeNames": {
"#k_Compatible": "Compatible",
"#k_RAM": "RAM"
},
"ExpressionAttributeValues": {
":v_Compatible_RAM": "RAM1"
}
}
You can do one thing to make it working on Scan
Store the object in stringify format like
{
"language": "[{\"language\":\"Male\",\"proficiency\":\"Female\"}]"
}``
and then can perform scan operation
language: {
contains: "Male"
}
on client side you can perform JSON.parse(language)
I have not such experience with DynamoDB yet but started setudying it since I'm planning on use it for my next project.
As far as I could understand from AWS documentation, the answer to your question is: it's not possible to efficiently query on nested attributes.
Looking at Best Practices, spetially Best Practices for Using Secondary Indexes in DynamoDB, it's possible to understand that the right approach should be using diffent line types under the same Partition Key as shown here. Then under the same course_id you would have a generic sorting key(sk). The first register would then have sk = 'Details' with course's data, then other registers like "seminar-1" and it's data, and so on.
You would then set seminar's properties you would like to query as SGI (Secondary Global Index) bearing in mind that it can only have 5 SGI per table.
Hope it helps.
You can use document paths to filter the values. Use seminars.location as the document path.

Select distinct count cloudant/couchdb

I am starting a project using Cloudant.
It's a simple system for logging, so I can track the usage of my apps.
My documents looks like this:
{
app:'name of the app',
type:'page view | login | etc..',
owner:'email_of_the_user',
device: 'iphone | android | etc..', date:
'yyyy-mm-dd'
}
I've tried to do some map reducing and faceted searches, but couldn't find so far the result for what I want.
I want to count the number of distinct documents grouped by same owner, date (yyyy-mm-dd), and app.
[For example, if a the same guy logs in the app twice or 20 times in the same date, it will be counted only once.
I want to count how many single users used an app each day, no matter what's the type of the log, or the device he used.]
If it was SQL, assuming that each key of the document is a column, I would query something like this:
SELECT app, date, count(*) FROM LOGS group by date, owner, app
ant the result would be something like:
'App1', '2015-06-01', 200
'App1', '2015-06-02', 232
'App2', '2015-06-01', 142
'App2', '2015-06-02', 120
How can I get the same result using Cloudant/CouchDB?
You can do this using design documents, as Cesar mentioned. A concrete example would be to create a view where your map function emits the field on where you want to group on, such as:
function(doc) {
emit(doc.email, 1);
}
Then, you select your desired reduce function (such as _count). When viewing this on Cloudant dashboard, make sure you select Reduce as part of the query options. When accessing the view via URL you need to pass the appropriate parameters (reduce=true&group=true).
The documentation on Views here is pretty thorough: https://docs.cloudant.com/creating_views.html
For what you need there is a feature on couldant/couchdb called design document. You can check their documentation for this feature for details or this guide:
http://guide.couchdb.org/draft/design.html
Cloudant documentation:
https://docs.cloudant.com/design_documents.html
Design documents are similar views on the SQL world.
Regards,
We were able to do this in our project using the Cloudant Java API...
https://github.com/cloudant/java-cloudant
You should be able to get this sort of result by creating a view that has a map function like this...
function(doc) {
emit([doc.app, doc.date, doc.owner], 1);
}
The reduce function should look like this:
function(keys, values, rereduce){
if (rereduce){
return sum(values);
} else {
return sum(values);
}
}
Then we used the following query to get the data we wanted.
Database db = ....
db.view(viewName).startKey(startKeys).endKey(endKeys)
.group(true).includeDocs(false).query(castClass)
We supplied the view name and some start and end keys (since we emitted a compound key and we needed to supply a filter) and then used the group method to get the data back as you need it.
Revised..
With this new emit key in the map function you should get results like this:
{[
{[app1, 2015,06,28, john#somewhere.net], 12}, <- john visited 12 times on that day...
{[app1, 2015,06,29, john#somewhere.net], 10},
{[app1, 2015,06,28, ann#somewhere.net], 1}
]}
If you use good start and end keys, the amount of records you're querying will stay small and the number of records you get back is the unique visitors you are seeking. Note that in this scenario you are getting back a bit more than you want, but it does work.