AWS EMR HBase Bulk Load - amazon-web-services

I developed a Map Reduce program to make HBase Bulk Loading using the technique explained in this Cloudera article : https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/.
On our previous on-prem Cloudera Hadoop cluster, it was working very well. Now, we are moving to AWS. I don't manage to make this program work on AWS EMR cluster.
EMR details :
Release label:emr-5.16.0
Hadoop distribution:Amazon 2.8.4
Applications:Spark 2.3.1, HBase 1.4.4
Master : m4.4xlarge
Nodes : 12 x m4.4xlarge
Here is the code of my Driver
Job job = Job.getInstance(getConf());
job.setJobName("My job");
job.setJarByClass(getClass());
// Input
FileInputFormat.setInputPaths(job, input);
// Mapper
job.setMapperClass(MyMapper.class);
job.setInputFormatClass(ExampleInputFormat.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
// Reducer : Auto configure partitioner and reducer
Table table = HBaseCnx.getConnection().getTable(TABLE_NAME);
RegionLocator regionLocator = HBaseCnx.getConnection().getRegionLocator(TABLE_NAME);
HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator);
// Output
Path out = new Path(output);
FileOutputFormat.setOutputPath(job, out);
// Launch the MR job
logger.debug("Start - Map Reduce job to produce HFiles");
boolean b = job.waitForCompletion(true);
if (!b) throw new RuntimeException("FAIL - Produce HFiles for HBase bulk load");
logger.debug("End - Map Reduce job to produce HFiles");
// Make the output HFiles usable by HBase (permissions)
logger.debug("Start - Set the permissions for HBase in the output dir " + out.toString());
//fs.setPermission(outputPath, new FsPermission(ALL, ALL, ALL)); => not recursive
FsShell shell = new FsShell(getConf());
shell.run(new String[]{"-chmod", "-R", "777", out.toString()});
logger.debug("End - Set the permissions for HBase in the output dir " + out.toString());
// Run complete bulk load
logger.debug("Start - HBase Complete Bulk Load");
LoadIncrementalHFiles loadIncrementalHFiles = new LoadIncrementalHFiles(getConf());
int loadIncrementalHFilesOutput = loadIncrementalHFiles.run(new String[]{out.toString(), TABLE_NAME.toString()});
if (loadIncrementalHFilesOutput != 0) {
throw new RuntimeException("Problem in LoadIncrementalHFiles. Return code is " + loadIncrementalHFiles);
}
logger.debug("End - HBase Complete Bulk Load");
My mapper reads Parquet files and emits :
key which is the row key of a Put as ImmutableBytesWritable
value which is a HBase Put
The issue happens in the Reduce step. In each Reducer's "syslog", I got errors that seem related to Socket connections. Here is a piece of syslog :
2018-09-04 08:21:39,085 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-04 08:21:39,086 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
2018-09-04 08:21:55,705 ERROR [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2018-09-04 08:21:55,705 WARN [main] org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 ERROR [main] org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 WARN [main] org.apache.hadoop.hbase.client.ZooKeeperRegistry: Can't retrieve clusterId from Zookeeper
After several search in Google, I found several posts that were advising to set the quorum IP directly in the Java code. I did that as well but it did not work. Here is how I currently get the HBase connection
Configuration conf = HBaseConfiguration.create();
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
// Attempts to set directly the quorum IP in the Java code that did not work
//conf.clear();
//conf.set("hbase.zookeeper.quorum", "...ip...");
//conf.set("hbase.zookeeper.property.clientPort", "2181");
Connection cnx = ConnectionFactory.createConnection(conf);
What I don't understand is that everything else is working. I can programmatically create tables, query the table (Scan or Get). I can even use a MR job that insert data with TableMapReduceUtil.initTableReducerJob("my_table", IdentityTableReducer.class, job);. But is of course much less fast than the HBase complete Bulk Load technique which directly writes HFiles splitted according to existing regions.
Thank you for your help

I've been working on a similar migration. The issue is that the reducer runs in a separate process so you need to set the quorum on the job's configuration instead. That will make the value available to the reducer.
job.getConfiguration().set("hbase.zookeeper.quorum", "...ip...");

Related

Connection to AWS MemoryDB cluster sometimes fails

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).
After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:
Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
If we connect directly to the primary read/write node we get no such errors.
If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
We use .NET Core 6
We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
The application is hosted in AWS ECS
StackExchangeRedisCache is added to the service collection in a startup file :
services.AddStackExchangeRedisCache(o =>
{
o.InstanceName = redisConfiguration.Instance;
o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});
...where ToRedisConfiguration returns a basic ConfigurationOptions object :
new ConfigurationOptions()
{
EndPoints =
{
{ "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
},
User = "username",
Password = "password",
Ssl = true,
AbortOnConnectFail = false,
ConnectTimeout = 60000
};
We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.
We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.
Any help or advice is appreciated, thanks!

How to get rid of 'Flow sessions were not provided' exception in corda when run using network bootstrapper?

I am running corda 4.5. My flows work perfectly when run using gradle task, deployNodes. But when I run the flow for the nodes created using the network bootstrapper, I run to the below exception.
Mon Jul 26 12:43:10 GMT 2021>>> start CreateAccount name: accountOne
▶︎ Starting
✘ Requesting signature by notary service
✘ Requesting signature by Notary service
✘ Validating response from Notary service
✘ Broadcasting transaction to participants
✘ Done
☂ java.lang.IllegalArgumentException: Flow sessions were not provided for the following transaction participants: [O=PartyA, L=New York, C=US]
From the logs:
inMemoryConfigSelectionLogger. - Did not detect a configuration for InMemory selection - enabling memory usage for token indexing. Please set stateSelection.inMemory.enabled to "false" to disable this
inMemoryConfigSelectionLogger. - No indexing method specified. Indexes will be created at run-time for each invocation of selectTokens - Cordapp configuration is incorrect due to exception - empty config: No configuration setting found for key 'stateSelection'
like Sneha said in the comments it's impossible to be confident about the source of the issues here without more context about the code here.
Remember that you want to be sure flow sessions are provided in a way similar to this where you specify the identities in a list and submit them to finality flow.
Party me = getOurIdentity();
Party notary = getServiceHub().getNetworkMapCache().getNotaryIdentities().get(0);
Command<YoContract.Commands.Send> command = new Command<YoContract.Commands.Send>(new YoContract.Commands.Send(), Arrays.asList(me.getOwningKey()));
YoState state = new YoState(me, this.target);
StateAndContract stateAndContract = new StateAndContract(state, YoContract.ID);
TransactionBuilder utx = new TransactionBuilder(notary).withItems(stateAndContract, command);
this.progressTracker.setCurrentStep(VERIFYING);
utx.verify(getServiceHub());
this.progressTracker.setCurrentStep(SIGNING);
SignedTransaction stx = getServiceHub().signInitialTransaction(utx);
this.progressTracker.setCurrentStep(FINALISING);
FlowSession targetSession = initiateFlow(this.target);
return subFlow(new FinalityFlow(stx, Collections.singletonList(targetSession), Objects.requireNonNull(FINALISING.childProgressTracker())));
}
source: https://github.com/corda/samples-java/blob/master/Features/dockerform-yocordapp/workflows/src/main/java/net/corda/samples/dockerform/flows/YoFlow.java

Database Migration Task fails to load the data into the source database

I have created PostgreSQL (target) RDS on AWS , did schema conversion using SCT and now I am trying to move data using Data Migration task from database (DB2) placed at EC2 instance (source) to target DB. The data is not loading and task is giving following error:
Last Error ODBC general error. Task error notification received from subtask 1, thread 0 [reptask/replicationtask.c:2800] [1022502] Error executing source loop; Stream component failed at subtask 1, component st_1_5D3OUPDVTS3BLNMSQGEXI7ARKY ; Stream component 'st_1_5D3OUPDVTS3BLNMSQGEXI7ARKY' terminated [reptask/replicationtask.c:2807] [1022502] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
I was getting the same error and the issue was related to database user rights for REPLICATION CLIENT and REPLICATION SLAVE as mentioned in AWS Documentation:
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html#CHAP_Source.MySQL.Prerequisites
I resolved it by setting the above mentioned REPLICATION rights using the following statements in MySQL (replacing {dbusername} with the actual database user name which was being used in DMS Endpoint):
GRANT REPLICATION CLIENT ON *.* to {dbusername}#'%';
GRANT REPLICATION SLAVE ON *.* to {dbusername}#'%';

How to fix `column "xlog_position" does not exist` error when using AWS DMS for Postgres to Postgres data migration

I'm trying to migrate and synchronize a PostgreSQL database using AWS DMS and I'm getting the following error.
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1020487]
RetCode: "SQL_ERROR SqlState: 42703 NativeError: 1
Message: ERROR: column "xlog_position" does not exist; No query has been executed with that handle; RetCode: SQL_ERROR SqlState: 42P01 NativeError: 1
Message: ERROR: relation "pglogical.replication_set" does not exist; No query has been executed with that handle; RetCode: SQL_ERROR SqlState: 42703 NativeError: 1 Message: ERROR: column "xlog_position" does not exist; No query has been executed with that handle;
Could not find any supported plugins available on source; Could not resolve default plugin; Could not assign a postgres plugin to use for replication; Failure in setting Postgres CDC agent control structure; Error executing command; Stream component failed at subtask 0, component st_0_JX7ONUUGB4A2AR2VQ4FMEZ7PFU ; Stream component 'st_0_JX7ONUUGB4A2AR2VQ4FMEZ7PFU' terminated [reptask/replicationtask.c:2680] [1020487] Stop Reason FATAL_ERROR Error Level FATAL
I'm using two PostgreSQL instances as both source and target. I have already tested and verified that both database instances are accessible by replication instance. Target instance user has full access to the database. Do I need to install any plugins or do additional configurations to get this migration setup working?
I managed to resolve the issue by following the steps mentioned at
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html.
The issue was due to the fact that I was using DMS engine v3.1.4 which required some additional configuration for the replication process to start. These instructions can be found at https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.v10
If you are experiencing this issue double check the DMS replication engine version. This can be viewed under Replication Instances in Resource Management.
To enable logical decoding for an Amazon RDS for PostgreSQL DB instance
The user account requires the rds_superuser role to enable logical
replication. The user account also requires the rds_replication role
to grant permissions to manage logical slots and to stream data using
logical slots.
Set the rds.logical_replication static parameter to 1. As part of
applying this parameter, we also set the parameters wal_level,
max_wal_senders, max_replication_slots, and max_connections. These
parameter changes can increase WAL generation, so you should only set
the rds.logical_replication parameter when you are using logical
slots.
Reboot the DB instance for the static rds.logical_replication
parameter to take effect.
Create a logical replication slot as explained in the next section.
This process requires that you specify a decoding plugin. Currently
we support the test_decoding output plugin that ships with
PostgreSQL.
The last item can be done with the following command:
SELECT * FROM pg_create_logical_replication_slot('test_slot', 'test_decoding');
Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.LogicalReplication

HdfsRpcException: Failed to invoke RPC call "getFsStats" on server

I've installed a single node Hadoop Cluster on EC2 instance. I then stored some test data on hdfs and I'm trying to load the hdfs data to SAP Vora. I'm using SAP Vora 2.0 for this project.
To create the table and load the data to Vora, this is the query I'm running:
drop table if exists dims;
CREATE TABLE dims(teamid int, team string)
USING com.sap.spark.engines.relational
OPTIONS (
hdfsnamenode "namenode.example.com:50070",
files "/path/to/file.csv",
storagebackend "hdfs");
When I run the above query, I get this error message:
com.sap.vora.jdbc.VoraException: HL(9): Runtime error.
(could not handle api call, failure reason : execution of scheduler plan failed:
found error: :-1, CException, Code: 10021 : Runtime category : an std::exception wrapped.
Next level: v2 HDFS Plugin: Exception at opening
hdfs://namenode.example.com:50070/path/to/file.csv:
HdfsRpcException: Failed to invoke RPC call "getFsStats" on server
"namenode.example.com:50070" for node id 20
with error code 0, status ERROR_STATUS
Hadoop and Vora are running on different nodes.
You should specify the HDFS Namenode port, which is typically 8020. 50700 is the port of the WebUI. See e.g. Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000