Assigning unique IDs to String using MapReduce - mapreduce

I want to run a MapReduce Job where I want to scan multiple columns from a given file and assign a unique ID(Index No.) to each distinct value for each column. The main challenge is to share the same ID for same value that is encountered on different node or different instances of Reducer.
Currently, I am using zookeeper for sharing the Unique IDs, but that is having its performance impact. I have even kept the information in local cache's at reducer level to avoid multiple trips to zookeeper for same value. I wanted to explore if there is any other better mechanism to do the same.

I can suggest two possible solutions for your problem
Create unique ID based on your value. This might be a hash function with low collision rate.
Use faster storage than ZooKeeper. You can try simple key value storage like Redis to store value to id mapping.

Related

What's the cheapest way to store an auto increment indexed list of values in AWS?

I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?
Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times

AWS Elasticsearch exceeded limit of total fields in index

I'm running Elasticsearch on AWS, and haven't quite understood how to properly address this issue.
Right now I have the items stored on DynamoDb and use dynamodb streams to send the items to a lambda that then uses dynamodb-stream-elasticsearch to send them to elasticsearch when they are created/updated.
Some properties can be objects which have many nested properties which can themselves be objects, and when these new fields were added, is when I first started getting this error. Due to the nature of these items, these new properties will need to be searchable in the future.
Initially the default index value had not been changed. After my first search on how to fix this I increased the limit to 5000 and now have had to increase it to 12000. The instance type is a t2.small.elasticsearch. The aws elasticsearch console is already reporting the instance health as yellow after I increased the index limit.
Which is the best way to tackle this sort of situation?
Does increasing the instance type fix it or is it a matter of breaking up the item and having multiple separate indexes? If the solution is the latter, is there a good tutorial/guide on how to do this with this set-up (aws dynamodb/elasticsearch)?
By default, the maximum number of fields in an index is 1000, But you can increase that by changing the index.mapping.total_fields.limit index setting.
See other settings to prevent mappings explosion: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/mapping.html#mapping-limit-settings
Which is the best way to tackle this sort of situation?
This could be a solution if you use flattened
The nested type is a specialised version of the object datatype that
allows arrays of objects to be indexed in a way that they can be
queried independently of each other.
When ingesting key-value pairs with a large, arbitrary set of keys,
you might consider modeling each key-value pair as its own nested
document with key and value fields. Instead, consider using the
flattened datatype, which maps an entire object as a single field and
allows for simple searches over its contents. Nested documents and
queries are typically expensive, so using the flattened datatype for
this use case is a better option.

AWS Dynamodb scan ordering?

We have a setup where various worker nodes perform computations and update their relative states in a DynamoDB table. The table acts as a kind of history of activity of the worker nodes. A watchdog node needs to periodically scan through the table, and build an object representing the current state of the worker nodes and their jobs. As such, it's important for our application to be able to scan the table and retrieve data in chronological order (i.e. sorted by timestamp). The table will eventually be too large to scan into local memory for later ordering, so we cannot sort it after scanning.
Reading from the AWS documentation about the primary key:
DynamoDB uses the partition key value as input to an internal hash
function. The output from the hash function determines the partition
(physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key are stored together, in
sorted order by sort key value.
Documentation on the scan function doesn't seem to mention anything about the order of the returned results. But can that last part in the quote above (the part I emphasized in bold) be interpreted to mean that the results of scans are ordered by the sort key? If I set all partition keys to be the same value, say "0", then use my timestamp as the sort key, can I be guaranteed that the scan operation will return data in chronological order?
Some note:
All code is written in Python, and thus I'm using the boto3 module to perform scan operations.
Our system architect is steadfast against the idea of updating any entries in the table to reflect their current state, or deleting items when the job is complete. We can only ever add to the table, and thus we need to scan through the whole thing each time to determine the worker states.
I am using strong read consistency for scan operations.
Technically SCAN never guarantees order (although as an observation the lack of order guarantee seems to mean that the partition is randomly ordered, but the sort remains, well, sorted.)
What you've proposed will work however, but instead of scanning, you'll be doing a query on partition-key == 0, which will then return all the items with the partition key of 0, (up to limit and optional sorted forward/backwards) sorted by the sort key.
That said, this is really not the way that dynamo wants you to use it. For example, it guarantees your partition will run hot (because you've explicitly put everything on the same partition), and this operation will cost you the capacity of reading every item on the table.
I would recommend investigating patterns such as using a dynamodb stream processed by a lambda to build and maintain a materialised view of this "current state", rather than "polling" the table with this expensive scan and resulting poor key design.
You’re better off using yyyy-mm-dd as the partition key, rather than all 0. There’s a limit of 10 GB of data per partition, which also means you can’t have more than 10 GB of data per partition key value.
If you want to be able to retrieve data sorted by date, take the ISO 8601 time stamp format (roughly yyyy-mm-ddThh-mm-ss.sss), split it somewhere reasonable for your data, and use the first part as the partition key and the second part as the sort key. (Another advantage of this approach is that you can use eventually consistent reads for most of the queries since it’s pretty safe to assume that after a day (or an hour o something) that the data is completely replicated.)
If you can manage it, it would be even better to use Worker ID or Job ID as a partition key, and then you could use the full time stamp as the sort key.
As #thomasmichaelwallace mentioned, it would be best to use DynamoDB streams with Lambda to create a materialized view.
Now, that being said, if you’re dealing with jobs being run on workers, then you should also consider whether you can achieve your goal by using a workflow service rather than a database. Workflows will maintain a job history and/or current state for you. AWS offers Step Functions and Simple Workflow.

Indexing notifications table in DynamoDB

I am going to implement a notification system, and I am trying to figure out a good way to store notifications within a database. I have a web application that uses a PostgreSQL database, but a relational database does not seem ideal for this use case; I want to support various types of notifications, each including different data, though a subset of the data is common for all types of notifications. Therefore I was thinking that a NoSQL database is probably better than trying to normalize a schema in a relational database, as this would be quite tricky.
My application is hosted in Amazon Web Services (AWS), and I have been looking a bit at DynamoDB for storing the notifications. This is because it is managed, so I do not have to deal with the operations of it. Ideally, I'd like to have used MongoDB, but I'd really prefer not having to deal with the operations of the database myself. I have been trying to come up with a way to do what I want in DynamoDB, but I have been struggling, and therefore I have a few questions.
Suppose that I want to store the following data for each notification:
An ID
User ID of the receiver of the notification
Notification type
Timestamp
Whether or not it has been read/seen
Meta data about the notification/event (no querying necessary for this)
Now, I would like to be able to query for the most recent X notifications for a given user. Also, in another query, I'd like to fetch the number of unread notifications for a particular user. I am trying to figure out a way that I can index my table to be able to do this efficiently.
I can rule out simply having a hash primary key, as I would not be doing lookups by simply a hash key. I don't know if a "hash and range primary key" would help me here, as I don't know which attribute to put as the range key. Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key? Then perhaps a secondary index could help me to sort by the timestamp, if this is even possible.
I also looked at global secondary indexes, but the problem with these are that when querying the index, DynamoDB can only return attributes that are projected into the index - and since I would want all attributes to be returned, then I would effectively have to duplicate all of my data, which seems rather ridiculous.
How can I index my notifications table to support my use case? Is it even possible, or do you have any other recommendations?
Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact
your performance, scalability, and financial costs. It is different
than working with a local database because you pay not only for the
data that you store but also the operations that you perform against
the data. Deleting a record is a WRITE operation for example, so if
you don't have an efficient plan for clean up (and your case being
Time Series Data specially needs one), you will pay the price. Your
Data Model will not show problems when dealing with small data volume
but can definitely ruin your plans when you need to scale. That being
said, decisions like creating (or not) an index, defining proper
attributes for your keys, creating table segmentation, and etc will
make the entire difference down the road. Choosing DynamoDB (or more
generically speaking, a Key-Value store) as any other architectural
decision comes with a trade-off, you need to clearly understand
certain concepts about the Storage Model to be able to use the tool
efficiently, choosing the right keys is indeed important but only the
tip of the iceberg. For example, if you overlook the fact that you are
dealing with Time Series Data, no matter what primary keys or index
you define, your provisioned throughput will not be optimized because
it is spread throughout your entire table (and its partitions) and NOT
ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is
directly impacting your throughput just because it is part of the same
table. This leads to cases where the
ProvisionedThroughputExceededException is thrown "unexpectedly" when
you know for sure that your provisioned throughput should be enough for your
demand, however, the TABLE PARTITION that is being unevenly accessed
has reached its limits (more details here).
The post below has more details, but I wanted to give you some motivation to read through it and understand that although you can certainly find an easier solution for now, it might mean starting from the scratch in the near future when you hit a wall (the "wall" might come as high financial costs, limitations on performance and scalability, or a combination of all).
Q: Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key?
A: DynamoDB is a Key-Value storage meaning that the most efficient queries use the entire Key (Hash or Hash-Range). Using the Scan operation to actually perform a query just because you don't have your Key is definitely a sign of deficiency in your Data Model in regards to your requirements. There are a few things to consider and many options to avoid this problem (more details below).
Now before moving on, I would suggest you reading this quick post to clearly understand the difference between Hash Key and Hash+Range Key:
DynamoDB: When to use what PK type?
Your case is a typical Time Series Data scenario where your records become obsolete as the time goes by. There are two main factors you need to be careful about:
Make sure your tables have even access patterns
If you put all your notifications in a single table and the most recent ones are accessed more frequently, your provisioned throughput will not be used efficiently.
You should group the most accessed items in a single table so the provisioned throughput can be properly adjusted for the required access. Additionally, make sure you properly define a Hash Key that will allow even distribution of your data across multiple partitions.
The obsolete data is deleted with the most efficient way (effort, performance and cost wise)
The documentation suggests segmenting the data in different tables so you can delete or backup the entire table once the records become obsolete (see more details below).
Here is the section from the documentation that explains best practices related to Time Series Data:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
For example, You could have your tables segmented by month:
Notifications_April, Notifications_May, etc
Q: I would like to be able to query for the most recent X notifications for a given user.
A: I would suggest using the Query operation and querying using only the Hash Key (UserId) having the Range Key to sort the notifications by the Timestamp (Date and Time).
Hash Key: UserId
Range Key: Timestamp
Note: A better solution would be the Hash Key to not only have the UserId but also another concatenated information that you could calculate before querying to make sure your Hash Key grants you even access patterns to your data. For example, you can start to have hot partitions if notifications from specific users are more accessed than others... having an additional information in the Hash Key would mitigate this risk.
Q: I'd like to fetch the number of unread notifications for a particular user.
A: Create a Global Secondary Index as a Sparse Index having the UserId as the Hash Key and Unread as the Range Key.
Example:
Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread
When you query this index by Hash Key (UserId) you would automatically have all unread notifications with no unnecessary scans through notifications which are not relevant to this case. Keep in mind that the original Primary Key from the table is automatically projected into the index, so in case you need to get more information about the notification you can always resort to those attributes to perform a GetItem or BatchGetItem on the original table.
Note: You can explore the idea of using different attributes other than the 'Unread' flag, the important thing is to keep in mind that a Sparse Index can help you on this Use Case (more details below).
Detailed Explanation:
I would have a sparse index to make sure that you can query a reduced dataset to do the count. In your case you can have an attribute "unread" to flag if the notification was read or not, and use that attribute to create the Sparse Index. When the user reads the notification you simply remove that attribute from the notification so it doesn't show up in the index anymore. Here are some guidelines from the documentation that clearly apply to your scenario:
Take Advantage of Sparse Indexes
For any item in a table, DynamoDB will only write a corresponding
index entry if the index range key
attribute value is present in the item. If the range key attribute
does not appear in every table item, the index is said to be sparse.
[...]
To track open orders, you can create an index on CustomerId (hash) and
IsOpen (range). Only those orders in the table with IsOpen defined
will appear in the index. Your application can then quickly and
efficiently find the orders that are still open by querying the index.
If you had thousands of orders, for example, but only a small number
that are open, the application can query the index and return the
OrderId of each open order. Your application will perform
significantly fewer reads than it would take to scan the entire
CustomerOrders table. [...]
Instead of writing an arbitrary value into the IsOpen attribute, you
can use a different attribute that will result in a useful sort order
in the index. To do this, you can create an OrderOpenDate attribute
and set it to the date on which the order was placed (and still delete
the attribute once the order is fulfilled), and create the OpenOrders
index with the schema CustomerId (hash) and OrderOpenDate (range).
This way when you query your index, the items will be returned in a
more useful sort order.[...]
Such a query can be very efficient, because the number of items in the
index will be significantly fewer than the number of items in the
table. In addition, the fewer table attributes you project into the
index, the fewer read capacity units you will consume from the index.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes
Find below some references to the operations that you will need to programmatically create and delete tables:
Create Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html
Delete Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html
I'm an active user of DynamoDB and here is what I would do... Firstly, I'm assuming that you need to access notifications individually (e.g. to mark them as read/seen), in addition to getting the latest notifications by user_id.
Table design:
NotificationsTable
id - Hash key
user_id
timestamp
...
UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id
When you query the UserNotificationsIndex, you set the user_id of the user whose notifications you want and ScanIndexForward to false, and DynamoDB will return the notification ids for that user in reverse chronological order. You can optionally set a limit on how many results you want returned, or get a max of 1 MB.
With regards to projecting attributes, you'll either have to project the attributes you need into the index, or you can simply project the id and then write "hydrate" functionality in your code that does a look up on each ID and returns the specific fields that you need.
If you really don't like that, here is an alternate solution for you... Set your id as your timestamp. For example, I would use the # of milliseconds since a custom epoch (e.g. Jan 1, 2015). Here is an alternate table design:
NotificationsTable
user_id - Hash key
id/timestamp - Range key
Now you can query the NotificationsTable directly, setting the user_id appropriately and setting ScanIndexForward to false on the sort of the Range key. Of course, this assumes that you won't have a collision where a user gets 2 notifications in the same millisecond. This should be unlikely, but I don't know the scale of your system.

Data stores for aggregations of large number of objects identified by atttibutes

I have somewhat of an interesting problem, and I'm looking for data store solutions for efficient querying.
I have a large (1M+) number of business objects, and each object has a large number of attributes (on the order of 100). The attributes are relatively unstructured -- the system has thousands of possible attributes, their number grows over time, and each object has an arbitrary (e.g. sparse) subset of them.
I frequently have to perform the following operation: find all objects with some concrete set of attributes S and perform an aggregation on them. I never know S ahead of time, and so on every request I have to perform an expensive sweep of the database which doesn't scale.
What are some data store solutions for this kind of problem? One possible solution would be to have a data store that parallelizes the aggregations -- maybe Cassandra with Hive/Pig on top?
Thoughts?
At this point, Cassandra + Spark is a likely candidate.
In a pure Cassandra world, you could (in theory) create a manual mapping of all possible S attributes to data objects, and then load those in via app and process (where the name of the S attribute is the partition key, the value of the S attribute is the clustering key, and the data object ID itself is another clustering key, that way you can quickly iterate over all objects with S attribute set).
It's not incredibly sexy, but could be made to work.