Mapreduce on hbase - mapreduce

I am executing one map reduce job which is processing 30 rows from a hbase table(MAP_INPUT_RECORDS=30).This table has 11000 regions but at any time one record will be in a single region only as per our region split policy(ie single record will not be in 2 or more region). Here i am getting more number of mappers 65 in the log (TOTAL_LAUNCHED_MAPS=65). As per the hbase document, for each region one mapper will get assigned. But in my case the number of mappers are more than the region. suggest some solution. Thanks in advance.

You have 11000 regions(Table regions) so at max you can have 11000 mappers.
Are you confusing table regions with region servers of Hbase. A Hbase can have 10 region servers and a table hosted on the hbase can have 1000 regions. Each region server hosting 100 regions.
TableInputFormat spawns mapper based on regions of the table and not Hbase Region Server.
For a better understanding please follow http://bytepadding.com/big-data/hbase/hbase-parameter-tuning/

Related

For Dynamodb DAX are the requests charged even when there is a cache hit i.e item is fetched from dax cache

Lets suppose I have a dynamodb table with 10 frequently accessed items of around 8KB each.
I decided to use dax infront of the table.
I got total 1 million read requests for the items.
a. Will I be charged for 10 dynamodb requests, since only 10 requests made it to dynamodb and rest were fetched from dax cache itself,
or
b. will I be charged for all 1 million dynamodb requests.
I had a similar question, and asked AWS. The answer I received was:
Whenever DAX has the item available (a cache hit), DAX returns the
item to the application without accessing DynamoDB. In that case, the
request will not consume read capacity units (RCUs) from DynamoDB
table and hence there will not be any DynamoDB cost for that
request. Therefore, if you have 10k requests and out of that if
only 2k requests goes to DynamoDB, the total charge which gets charged
will be for the 2k read request charge for DynamoDB, running cost for
DAX cluster and data transfer charges (if applicable).
DynamoDB charges for DAX capacity by the hour and your DAX instances
run with no long-term commitments. Pricing is per node-hour consumed
and is dependent on the instance type you select. Each partial
node-hour consumed is billed as a full hour. Pricing applies to all
individual nodes in the DAX cluster. For example, if you have a
three-node DAX cluster, you are billed for each of the separate nodes
(three nodes in total) on an hourly basis.
https://aws.amazon.com/dynamodb/pricing/on-demand/

Cheapest/easiest way to store/read 1 bit of data in AWS

Problem is as title reads. What is the best way to do this? Currently I'm using dynamodb but I'm not convinced it's the cheapest due to how read capacity units are calculated (1kb minimum). If you had to store 1 bit of data and query it from a static website (hosted from S3), how would you do it?
Thinking outside the box, some options would be:
Amazon ElastiCache cache.t2.micro ($0.017 per hour)
Roll-your-own cache on an Amazon EC2 instance t3.nano ($0.0052 per hour)
Store it in an AWS IoT Shadow ($2.25 per million messages)
DynamoDB Accelerator (DAX) ($0.04 per hour)
AWS Systems Manager - Parameter Store (Free) but might not be suitable

DynamoDB Global Table Replication System

I am working on Benchmarking Dynamodb's performance as part of a project at the university and have been looking for more details on the replication system when setting up Global tables as i want to understand its impact on latency / Throughput.
I end up by finding 2 confusing Concept, Regions and Availability zones. From what i understood here:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html
By Creating 2 Tables, one in Frankfurt and one in Ireland let's say, This means that i now have
2 multi-master read/write Replicas.
But then i found those links:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
https://aws.amazon.com/blogs/aws/new-for-amazon-dynamodb-global-tables-and-on-demand-backup/
explaining that the data is stored and automatically replicated across multiple Availability Zones in an AWS region but not mentioning the number of replicas and whether they can be used for read / write requests and are also multi-master or slaves or just for recovery purposes.
From what i understood here if going back to the example i am using (Frankfurt / Ireland)
I will be having:
3 multi-master read/write Replicas in Frankfurt
3 multi-master read/write Replicas in Ireland
Please let me know which one is correct. Thanks in Advance
Dyanmodb by default puts your data to tables in multiple availability zone irrespective of if it is a global table or not. This is to make sure higher availability in case of one zone going down. However these partition are transparent to the user, and user don't get to choose which one to connect to.
Here is a nice video explaining how it works under the hood.
Global table means that data will be replicated across the regions transparently to the user. I did a benchmarking with table in two regions oregon and ohio, it typically took ~1.5 secs. to get replicated. Replication resolution is auto managed by AWS and the last write one wins.
A personal suggestion here is to use only one table to write so that data collision can be minimized. And in the case of disaster failover writes to other region.

Using AWS DynamoDB or Redshift to store analytics data

I would like to ask which services would suits me the best. For example, an facebook-like mobile app where I need to track every movement of a user such as the pages visited or links clicked.
I am thinking of using DynamoDB to create multiple tables to track each different activities. When I run my analytic app, it will query all the data for each table (Similar hash key but different range key so I can query all data) and compute the result in the app. So the main cost is the read throughput which can easily be 250 reads/s (~ $28/mth) for each table. The storage for each table has no limit so it is free?
For Redshift, I will be paying for the storage size on a 100% utilized per month basis for 160GB. That will cost me about $14.62/mth. Although it looks cheaper, I am not familiar with Redshift, hence, I am not sure of what are the other hidden costs.
Thanks in advance!
Pricing for Amazon DynamoDB has several components:
Provisioned Throughput Capacity (the speed of the tables)
Indexed Data Storage (the cost of storing data)
Data Transfer (for data going from AWS to the Internet)
For example, 100GB of data storage would cost around $25.
If you want 250 reads/second, it would cost $0.0065 per 50 units, which is .0065 * 5 units * 24 hours * 30 days = $23.40 (plus some write capacity units).
Pricing for Amazon Redshift is based upon the number and type of nodes. A 160GB dc1.large node would cost 25c/hour * 24 hours * 30 days = $180 per node (but only one node is probably required for your situation).
Amazon Redshift therefore comes out as more expensive, but it is also a more-featured system. You can run complex SQL against Amazon Redshift, whereas you would have to write an application to retrieve, join and compute information from DynamoDB. Think of DynamoDB as a storage service, while Redshift is also a querying service.
The real decision, however, should be based on how you are going to use the data. If you can create an application that will work with DynamoDB, then use it. However, many people find the simplicity of using SQL on Redshift to be much easier.

Is it possible to use Map reduce

I have several hundred of millions rows in database as input data.
Every 10000 rows processing tooks approximately 15 minutes cause external API requests are involved.
I am going to divide it into data spots and process with hundred Amazon AWS EC2 instances. Every process launched on EC2 instance will also save output in database.
Is it possible to organize this multiagent task as map reduce claster using Cloud services?