How to do incremental development with DynamoDB? - amazon-web-services

We're currently investigating a proper key-value store to store feed data. As we're hosting on AWS, their DynamoDB key-value solution looks very tempting. But, its seems that a DynamoDB table structure is unmodifiable after creation, making it really cumbersome to have rolling updates. Also even whilst not live it seems to be very painful to just copy all your indices manually in order to add an index to your table. Maybe I'm missing some kind of automation tool somewhere?

DynamoDB recently announced that online indexing will be coming soon, so you will be able to add indexes when you need them rather than all of it at table creation time. Here is the relevant section from this source on 08 Oct 2014:
Online Indexing (Available Soon) We are planning to give you the
ability to add and remove indexes for existing DynamoDB tables. This
will give you the flexibility to adjust your indexes to match your
evolving query patterns. This feature will be available soon, so stay
tuned!
Update 2015/1/27: Online Indexing is now available

DynamoDB tables have not any structure, only primary keys and indices.
You should create your architecture before any tables.
Think, then do.
Anyway AWS have a flex API, you can write your own scripts or use aws-cli for automation.

Related

Simple Search capabilities with NoSql (DynamoDB)

I am new to NoSQL. I am trying to make simple app which will have products that you search through. With SQL I would simply have a products table and be able to search any of the columns for substrings with %LIKE% and pull the returned rows. I would like to use DynamoDB, but seemingly there is no way of doing this without introducing AWS OpenSearch (ElasticSearch) which will probably cost more than all my DynamoDb tables. Is there any simple way to do this in DynamoDb without having to scan the whole table and filtering with contains?
No, there is no way to do what you want (search dynamodb) without adding in another layer such as elasticsearch - keep it simple, use a traditional database.
IMO, never assume you need a nosql database - because you rarely do - always assume you need a traditional database until proven otherwise.
Ok so DynamoDB is not what you are looking for, it is designed for a very different use case.
However, ElasticSearch which is in no way tied to DynamoDB very much is what you are looking for and will greatly simplify what you are trying to over using a traditional SQL database. Those who are saying otherwise, are providing poor information. A traditional database cannot index a %LIKE% query, where this is precisely what ElasticSearch does on every field in your document.
Getting started with ElasticSearch is super easy. Just download the Jar and run it, then start going through examples posting and getting documents from the index. If your experience is anything like mine, you will never really want to use a SQL database again, but as is mentioned they each have their own place, and so I do still use traditional RDBMS but I specialize in ElasticSearch.
I have converted many applications that were unable to find reasonable performance, to ElasticSearch where the performance is almost always sub second, and typically a fraction of that. An RDBMS being asked to do many %LIKE% matches will not be able to provide you sub second results.
There are also a number of tools that will automatically funnel data from your RDBMS db into ElasticSearch so that you can have the benefits of both worlds.
NoSQL means a great many things. In general it has been applied to at several classes of datastore.
Columnar Datastore - DynamoDB, Hive
Document/Object Database - MongoDB, CouchDB, MarkLogic, and a great many others
Key/Value - Cassandra, MongoDB, Redis, Memcache
Search Index - SOLR, ElasticSearch, MarkLogic
ElasticSearch bridges the gap between Document Database and Search Index, providing the features of both. It also provides the capabilities of a Key/Value data store.
The columnar datastore is much more tuned for doing work across massive amounts of data, generally in an aggregate, but results from the queries are not the kind of performance you are looking for. These are used for datasets with trillions of rows and hundreds of features/columns.
ElasticSearch however provides very rapid search across large numbers of JSON documents index by default every value in the json.
The way to do this with dynamodb is by using ElasticSearch, however, you do not need DynamoDB to do this with ElasticSearch, so you don't need double the cost.

Selecting the right cloud storage option on GCP

I am an entry level developer in a startup. I am trying to deploy a text classifier on GCP. For storing inputs(training data) and outputs, I am struggling to find the right storage option.
My data isn't huge in terms of columns but is fairly huge in terms of instances. It could even be just key-value pairs. My use case is to retrieve each entity from just one particular column from the DB, apply some classification on it and store the result in the corresponding column and update the DB. Our platform requires a DB which can handle a lot of small queries at once without much delay. Also, the data is completely unrelational.
I looked into GCP's article of choosing a storage option but couldn't narrow down my options to any specific answer. Would love to get some advice on this.
You should take a look at Google's "Choosing a Storage Option" guide: https://cloud.google.com/storage-options/
Your data is structured, your main goal is not analytics, your data isn't relational, you don't mostly need mobile SDKs, so you should probably use Cloud Datastore. That's a great choice for durable key-value data.
In brief, these are the storage options available. May be in future it can be more or less.
Depending on choice, you can select your storage option which is best suited.
SOURCE: Linux Academy

Online update spanner schema is extremely slow

Online updating spanner schema takes minutes even for very very small tables (10s of rows).
i.e. - adding/dropping/altering columns, adding tables, etc.
This can be quite frustrating for development processes and new version deployments.
Any plans for improvement?
Few more questions:
Anyone knows a 3rd party schema comparison tool for spanner? couldn't find any.
What about data backups? in order to save historical snapshots.
Thanks in advance
Schema Updates:
Since Cloud Spanner is a distributed database, it makes sure to update all moving parts of the system which takes the latency as described.
As a suggestion, you could batch the schema updates. This ensures the lower latencies (nearly equivalent to executing a single schema update) and can be executed using API / gcloud command-line tools.
Schema Comparison Tool:
You could use the getDatabaseDdl API to maintain history of your schema changes and use your tool of choice to diff them.

Sitecore : Options for storing/querying custom data used by sitecore CD

This (below) has been a common problem/debate on most of my sitecore projects.
Problem:
Sitecore web site creates/uses the custom data such as polls/quiz/user-journey/comments etc.
Solutions:
One option to solve this problem is create a custom DB table and use Entity Framework fro CRUD.
The other option is to make a copy of master database (as data) and use Sitecore API for CRUD.
The benefit of 2nd option could be out of box API usage, workflow etc.
Has anyone faced this issue and what's the best way to solve this?
As you know there is no blanket solution for all projects but I believe this option is the best for most projects.
Option 3: Custom DB + Data Provider
Create a custom database as you have said in option 1.
Use a data provider so that the items can be indexed/searched easily (depends on your requirements, see additional benefits below)
Pros:
- CD's do not depend on the custom DB which is a big advantage over option 1.
- If you need to do any transformation to the items as you publish them you can, same applies in import. (in the instance you are connecting to an external/existing datasource that you want to transform)
For more info check out this: http://www.sitecore.net/learn/blogs/technical-blogs/john-west-sitecore-blog/posts/2012/05/when-to-implement-data-providers-in-the-sitecore-aspnet-cms.aspx
We also have this challenge all the time. What I personally learned from experience is that when the requirement is fairly small, you have better to choose option one.
However, when the requirement is not a small thing, especially in interactive scenarios like what you mentioned, you have to ask yourself questions like: "what if later on client wants it multilingual or what if they need some sort of statistics and analytics on the result? Don't I want to take advantage of stuff like WFFM or Analytics?" In such cases you may better think it over and weigh the pros and cons and possible scaling options in the future (because practically sitecore projects are not for small scale websites). For example when collecting large amount of data you definitely need Lucene/Solr and Item Buckets.
Luckily in recent version of Sitecore you have MongoDB option which is a good option for collecting interactive data and stuff which are not well-structured and prone to change in structure in future.
Edit:
There is also an ORM tool called Glass Mapper, similar to EF, if you are interested. While EF works great with SQL server, Glass Mapper works with Sitecore Data Repository in the same way but it may introduce a bit performance drawbacks to your code.

Data Warehouse and Django

This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!
Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.
My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.