I want to write a map-side join and want to include a reducer code as well. I have a smaller data set which I will send as distributed cache.
Can I write the map-side join with reducer code?
Yes!! Why not. Look, reducer is meant for aggregation of the key values emitted from the map. So you can always have a reducer in your code whenever you want to aggregate your result (say you want to count or find average or any numerical summarization) based on certain criteria that you've set in your code or in accordance with the problem statement. Map is just for filtering the data and emitting some useful key value pairs out of a LOT of data. Map side join is just needed when one of the dataset is small enough to fit the memory of the commodity machine. By the way reduce-side join serves your purpose too!!
Related
With DynamoDB, there is simply no straightforward way to perform an indexed range query over a column. Primary key, local secondary index, and global secondary index all require a partition key to range query.
For example, suppose I have a high-scores table with a numerical score attribute. There is no way to get the top 10 scores or top scores 25 to 50 with an indexed range query
So, what is the idiomatic or preferred way to perform this incredibly common task?
Settle for a table scan.
Use a static partition key and take advantage of partition queries.
Use a fixed number of static partition keys and use multiple partition queries.
It's either 2) or 3) but it depends on the amount and structure of data as well as the read/write activity.
There's no generic answer here as it's use-case specific.
As long as you can get away with it, you probably want to use 2) as it only requires a single Query API call. If you have lots of data or heavy read/write activity, you'd use some bucketing-strategy (very close to your third option) to write to multiple partitions, then do multiple queries and aggregate the results.
DDB isn't suited for analytics. As Maurice said you can facilitate what you need via secondary index, but there are also other options to consider:
If you are providing this Top N to your customers consistently/frequently and N is fixed, then you can have dedicated item(s) that hold this information and you would update that/those item(s) upon writing an item to a table. You can have 1 item for the whole top N or you can apply some bucketing strat.
If your system needs this information infrequently (on some singular occasions), then scan might be also fine.
If this is for analytics/research, consider exporting the table to S3 and using Athena.
We have a setup where various worker nodes perform computations and update their relative states in a DynamoDB table. The table acts as a kind of history of activity of the worker nodes. A watchdog node needs to periodically scan through the table, and build an object representing the current state of the worker nodes and their jobs. As such, it's important for our application to be able to scan the table and retrieve data in chronological order (i.e. sorted by timestamp). The table will eventually be too large to scan into local memory for later ordering, so we cannot sort it after scanning.
Reading from the AWS documentation about the primary key:
DynamoDB uses the partition key value as input to an internal hash
function. The output from the hash function determines the partition
(physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key are stored together, in
sorted order by sort key value.
Documentation on the scan function doesn't seem to mention anything about the order of the returned results. But can that last part in the quote above (the part I emphasized in bold) be interpreted to mean that the results of scans are ordered by the sort key? If I set all partition keys to be the same value, say "0", then use my timestamp as the sort key, can I be guaranteed that the scan operation will return data in chronological order?
Some note:
All code is written in Python, and thus I'm using the boto3 module to perform scan operations.
Our system architect is steadfast against the idea of updating any entries in the table to reflect their current state, or deleting items when the job is complete. We can only ever add to the table, and thus we need to scan through the whole thing each time to determine the worker states.
I am using strong read consistency for scan operations.
Technically SCAN never guarantees order (although as an observation the lack of order guarantee seems to mean that the partition is randomly ordered, but the sort remains, well, sorted.)
What you've proposed will work however, but instead of scanning, you'll be doing a query on partition-key == 0, which will then return all the items with the partition key of 0, (up to limit and optional sorted forward/backwards) sorted by the sort key.
That said, this is really not the way that dynamo wants you to use it. For example, it guarantees your partition will run hot (because you've explicitly put everything on the same partition), and this operation will cost you the capacity of reading every item on the table.
I would recommend investigating patterns such as using a dynamodb stream processed by a lambda to build and maintain a materialised view of this "current state", rather than "polling" the table with this expensive scan and resulting poor key design.
You’re better off using yyyy-mm-dd as the partition key, rather than all 0. There’s a limit of 10 GB of data per partition, which also means you can’t have more than 10 GB of data per partition key value.
If you want to be able to retrieve data sorted by date, take the ISO 8601 time stamp format (roughly yyyy-mm-ddThh-mm-ss.sss), split it somewhere reasonable for your data, and use the first part as the partition key and the second part as the sort key. (Another advantage of this approach is that you can use eventually consistent reads for most of the queries since it’s pretty safe to assume that after a day (or an hour o something) that the data is completely replicated.)
If you can manage it, it would be even better to use Worker ID or Job ID as a partition key, and then you could use the full time stamp as the sort key.
As #thomasmichaelwallace mentioned, it would be best to use DynamoDB streams with Lambda to create a materialized view.
Now, that being said, if you’re dealing with jobs being run on workers, then you should also consider whether you can achieve your goal by using a workflow service rather than a database. Workflows will maintain a job history and/or current state for you. AWS offers Step Functions and Simple Workflow.
I have somewhat of an interesting problem, and I'm looking for data store solutions for efficient querying.
I have a large (1M+) number of business objects, and each object has a large number of attributes (on the order of 100). The attributes are relatively unstructured -- the system has thousands of possible attributes, their number grows over time, and each object has an arbitrary (e.g. sparse) subset of them.
I frequently have to perform the following operation: find all objects with some concrete set of attributes S and perform an aggregation on them. I never know S ahead of time, and so on every request I have to perform an expensive sweep of the database which doesn't scale.
What are some data store solutions for this kind of problem? One possible solution would be to have a data store that parallelizes the aggregations -- maybe Cassandra with Hive/Pig on top?
Thoughts?
At this point, Cassandra + Spark is a likely candidate.
In a pure Cassandra world, you could (in theory) create a manual mapping of all possible S attributes to data objects, and then load those in via app and process (where the name of the S attribute is the partition key, the value of the S attribute is the clustering key, and the data object ID itself is another clustering key, that way you can quickly iterate over all objects with S attribute set).
It's not incredibly sexy, but could be made to work.
I want to run a MapReduce Job where I want to scan multiple columns from a given file and assign a unique ID(Index No.) to each distinct value for each column. The main challenge is to share the same ID for same value that is encountered on different node or different instances of Reducer.
Currently, I am using zookeeper for sharing the Unique IDs, but that is having its performance impact. I have even kept the information in local cache's at reducer level to avoid multiple trips to zookeeper for same value. I wanted to explore if there is any other better mechanism to do the same.
I can suggest two possible solutions for your problem
Create unique ID based on your value. This might be a hash function with low collision rate.
Use faster storage than ZooKeeper. You can try simple key value storage like Redis to store value to id mapping.
I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?