I have a scenario where I have to import data (millions of records) from multiple sources and save it in a database. A user should get results in under 2-3 seconds when they try to search for any information related to that data.
For this, I designed an architecture where I used golang to import data from multiple sources and pushed data in AWS SQS. I've created a lambda function which triggers when AWS SQS has some data. This lambda function then pushes data in AWS Elastic Search. I've created a Rest API using which I give results to the user.
I use CRON to do this importing work every morning. Now my problem is if a new batch of data comes I want to delete the existing data and replace all of them with the new data.
I'm stuck at how I can achieve this deleting and adding new data part.
I thought of creating a temporary index and then replacing it with the original index. But the problem is I do not know when importing has ended and can make this index switch.
The concept you're after is an index alias. The basic workflow would be:
Import today's data into an index with my-index-2019-09-16 (for example).
Make sure the import is complete and worked correctly.
Point the alias to the new index (it's an atomic switch between the indices):
POST /_aliases
{
"actions" : [
{ "remove" : { "index" : "my-index-2019-09-15", "alias" : "my-index" } },
{ "add" : { "index" : "my-index-2019-09-16", "alias" : "my-index" } }
]
}
Delete the old index.
You will double the disk space during the import process, but otherwise this should work without any issues and you only delete data once it has a proper replacement.
Related
I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?
Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times
The apollo docs suggest a way to implement the merge function in a field policy when implementing pagination logic.
merge(existing = [], incoming) {
return [...existing, ...incoming];
}
However, when I use 'cache-and-network' fetch policy for the query, that means that first it loads data from the cache, then it goes out to the network, and will append the existing list with the incoming data, so every item will exist in the cache twice, if the incoming data is the same as what was in the cache before.
What is the correct way to solve this? Can I differentiate between an initial load and a fetchmore request in the merge function? The merge function should obviously work differently for an initial fetch that should overwrite what we have loaded from the cache, and for a pagination fetch more.
In case anyone stumbled into this issue, solution as of Apollo 3 is:
fetchPolicy: 'cache-and-network',
nextFetchPolicy: 'cache-first',
I have some sample data (which is simulating real data that I will begin getting soon), that represents user behavior on a website.
The data is broken down into 2 json files for every day of usage. (When I'm getting the real data, I will want to fetch it every day at midnight). At the bottom of this question are example snippets of what this data looks like, if that helps.
I'm no data scientist, but I'd like to be able to do some basic analysis on this data. I want to be able to see things like how many of the user-generated objects existed on any given day, and the distribution of different attributes that they have/had. I'd also like to be able to visualize which objects are getting edited more, by whom, when and how frequently. That sort of thing.
I think I'd like to be able to make dashboards in google data studio (or similar), which basically means that I get this data in a usable format into a normal relational database. I'm thinking postgres in aws RDS (there isn't that much data that I need something like aurora, I think, though I'm not terribly opposed).
I want to automate the ingestion of the data (for now the sample data sets I have stored on S3, but eventually from an API that can be called daily). And I want to automate any reformatting/processing this data needs to get the types of insights I want.
AWS has so many data science/big data tools that it feels to me like there should be a way to automate this type of data pipeline, but the terminology and concepts are too foreign to me, and I can't figure out what direction to move in.
Thanks in advance for any advice that y'all can give.
Data example/description:
One file is catalog of all user generated objects that exist at the time the data was pulled, along with their attributes. It looks something like this:
{
"obj_001": {
"id": "obj_001",
"attr_a": "a1",
"more_attrs": {
"foo": "fred":,
"bar": null
}
},
"obj_002": {
"id": "obj_002",
"attr_a": "b2",
"more_attrs": {
"foo": null,
"bar": "baz"
}
}
}
The other file is an array that lists all the user edits to those objects that occurred in the past day, which resulted in the state from the first file. It looks something like this:
[
{
"edit_seq": 1,
"obj_id": "obj_002",
"user_id": "u56",
"edit_date": "2020-01-27",
"times": {
"foo": null,
"bar": "baz"
}
},
{
"edit_seq": 2,
"obj_id": "obj_001",
"user_id": "u25",
"edit_date": "2020-01-27",
"times": {
"foo": "fred",
"bar": null
}
}
]
It depends on the architecture that you want to deploy. If you want event based trigger, I would use SQS, I have used it heavily, as soon as someone drop a file in s3 it can trigger the SQS which can be used to trigger Lambda.
Here is a link which can give you some idea: http://blog.zenof.ai/processing-high-volume-big-data-concurrently-with-no-duplicates-using-aws-sqs/
You could build Data Pipelines using AWS DataPipeline. For e.g. if you want to read data from S3 and some transformations and then to Redshift.
You can also have look on AWS Glue, which has Spark backend, which can also crawler the schema and perform ETL.
I have a view like
function (doc, meta)
{
if(doc.Tenant)
{
emit([doc.Tenant.Id,doc.Tenant.User.Name],doc);
}
}
In this view I want all the value belongs to Tenant.Id == 1 and User.Name where Contains "a"
I can search this in my C# by collecting all the Tenant data belongs to particular Tenant Id.
But I have million of data for each Tenant. So need to check this in the server side itself.
Is this possible to search.
I'm guessing that you want to be able to change which letter you are searching for in the string, unfortunately couchbase isn't going to be the best thing for this type of query.
If it will always be the letter 'a' that you want to search for then you could do a map like this and then query on the id.
function (doc, meta) {
if(doc.Tenant) {
var name = doc.Tenant.User.Name.toLowerCase();
if(name.indexOf("a") > -1) {
emit(doc.Tenant.Id,null);
}
}
}
If however you want to be able to dynamically change which letter or even substring you want to search for in the name then you want to consider something like elasticsearch (great for text searching). Couchbase has an elasticsearch transport plugin that will automatically replicate to your elasticsearch node(s).
Here is a presentation on ES and Couchbase
http://www.slideshare.net/Couchbase/using-elasticsearch-and-couchbase-together-to-build-large-scale-applications
The documentation for installation and getting started with the ES plugin
http://docs.couchbase.com/couchbase-elastic-search/
And a cool tutorial detailing how to add a GUI on top of your ES data for easy filtering.
http://blog.jeroenreijn.com/2013/07/visitor-analysis-with-couchbase-elasticsearch.html
Let's imagine that we have some couchbase bucket containing N docs, each S bytes size and V views count. We need to retreive those docs incl. all info that they contain.
One way:
Create a view that has such map function:
function (doc, meta) {
if (meta.type == "json" && doc.type == "mytype"){
emit([doc.field1, doc.field2], doc);
}
}
This map function will return all data we need in one step. But on the other side it will produce quiet large amount of data.
Another way:
Create a view that will return only document ids like that (or even document key using meta.id):
function (doc, meta) {
if (meta.type == "json" && doc.type == "mytype"){
emit([doc.field1, doc.field2], doc.id);
}
}
Then after getting this on client side we need to get each document by supplied ids like:
couchbase.getMultiple([key1,key2,...,keyX]) *
* where keyX are doc.ids from view.
In this case we produce less amount of view data, but operation will complete with X+1 requests.
So first way loads couchbase servers and consumes large amount of disk space for views. Second way consumes less view space, but loads client and produce more requests to couchbase server.
So there are some questions about this:
Which one of those ways is more acceptable / frequently used or it hardly depends on that N, S and V values?
If correct way depends on N, S, V on which values of those variables (high/medium/low) one way is more acceptable than other?
Couchbase can quiet easily horizontally scaled. If client side is harder to scale, is first way preferred?
May be there are some test results that compare this two ways.
Which SDK are you using to access the view?
The good practice is usually to avoid emitting the document ID since it is automatically put in the view index.
The basic rule:
Do not emit the doc (or too much values)
Do not emit the key
Take a look to:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html
Then if your application need to access the full document you just need to do (Java as example):
query.setIncludeDocs(true);
In this case the client will automatically do the "get" for you to call the server and load the document in the cache. (in fact the SDK is doing a multi get.