HdfsRpcException: Failed to invoke RPC call "getFsStats" on server - amazon-web-services

I've installed a single node Hadoop Cluster on EC2 instance. I then stored some test data on hdfs and I'm trying to load the hdfs data to SAP Vora. I'm using SAP Vora 2.0 for this project.
To create the table and load the data to Vora, this is the query I'm running:
drop table if exists dims;
CREATE TABLE dims(teamid int, team string)
USING com.sap.spark.engines.relational
OPTIONS (
hdfsnamenode "namenode.example.com:50070",
files "/path/to/file.csv",
storagebackend "hdfs");
When I run the above query, I get this error message:
com.sap.vora.jdbc.VoraException: HL(9): Runtime error.
(could not handle api call, failure reason : execution of scheduler plan failed:
found error: :-1, CException, Code: 10021 : Runtime category : an std::exception wrapped.
Next level: v2 HDFS Plugin: Exception at opening
hdfs://namenode.example.com:50070/path/to/file.csv:
HdfsRpcException: Failed to invoke RPC call "getFsStats" on server
"namenode.example.com:50070" for node id 20
with error code 0, status ERROR_STATUS
Hadoop and Vora are running on different nodes.

You should specify the HDFS Namenode port, which is typically 8020. 50700 is the port of the WebUI. See e.g. Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000

Related

AWS DMS ERROR : Last Error Load utility network error. Task error notification received from subtask 0, thread 0

I am trying to replicate data from RDS (POSTGRES 9.6) to REDSHIFT using Data Migration Service of AWS.
I have configured RDS in CDC mode and ran DMS job with Full Load + CDC option.
Using Replication Config : dms.c5.4xlarge (32GB) .
After successfully running FULL LOAD + CDC suddenly the job failed with the following error message :
Last Error Load utility network error. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2860] [1020458] Error executing source loop; Stream component failed at subtask 0, component st_0_JJ6J5HNCIGLCUMQJNNBGWCGA2YHJU2CQI6QJG6Y; Stream component 'st_0_JJ6J5HNCIGLCUMQJNNBGWCGA2YHJU2CQI6QJG6Y' terminated [reptask/replicationtask.c:2868] [1020458] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
What could be the possible cause of this scenario? I checked the NETWORK THROUGHPUT as well and everything seems to be in good shape.
TIA

Database Migration Task fails to load the data into the source database

I have created PostgreSQL (target) RDS on AWS , did schema conversion using SCT and now I am trying to move data using Data Migration task from database (DB2) placed at EC2 instance (source) to target DB. The data is not loading and task is giving following error:
Last Error ODBC general error. Task error notification received from subtask 1, thread 0 [reptask/replicationtask.c:2800] [1022502] Error executing source loop; Stream component failed at subtask 1, component st_1_5D3OUPDVTS3BLNMSQGEXI7ARKY ; Stream component 'st_1_5D3OUPDVTS3BLNMSQGEXI7ARKY' terminated [reptask/replicationtask.c:2807] [1022502] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
I was getting the same error and the issue was related to database user rights for REPLICATION CLIENT and REPLICATION SLAVE as mentioned in AWS Documentation:
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html#CHAP_Source.MySQL.Prerequisites
I resolved it by setting the above mentioned REPLICATION rights using the following statements in MySQL (replacing {dbusername} with the actual database user name which was being used in DMS Endpoint):
GRANT REPLICATION CLIENT ON *.* to {dbusername}#'%';
GRANT REPLICATION SLAVE ON *.* to {dbusername}#'%';

How to fix `column "xlog_position" does not exist` error when using AWS DMS for Postgres to Postgres data migration

I'm trying to migrate and synchronize a PostgreSQL database using AWS DMS and I'm getting the following error.
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1020487]
RetCode: "SQL_ERROR SqlState: 42703 NativeError: 1
Message: ERROR: column "xlog_position" does not exist; No query has been executed with that handle; RetCode: SQL_ERROR SqlState: 42P01 NativeError: 1
Message: ERROR: relation "pglogical.replication_set" does not exist; No query has been executed with that handle; RetCode: SQL_ERROR SqlState: 42703 NativeError: 1 Message: ERROR: column "xlog_position" does not exist; No query has been executed with that handle;
Could not find any supported plugins available on source; Could not resolve default plugin; Could not assign a postgres plugin to use for replication; Failure in setting Postgres CDC agent control structure; Error executing command; Stream component failed at subtask 0, component st_0_JX7ONUUGB4A2AR2VQ4FMEZ7PFU ; Stream component 'st_0_JX7ONUUGB4A2AR2VQ4FMEZ7PFU' terminated [reptask/replicationtask.c:2680] [1020487] Stop Reason FATAL_ERROR Error Level FATAL
I'm using two PostgreSQL instances as both source and target. I have already tested and verified that both database instances are accessible by replication instance. Target instance user has full access to the database. Do I need to install any plugins or do additional configurations to get this migration setup working?
I managed to resolve the issue by following the steps mentioned at
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html.
The issue was due to the fact that I was using DMS engine v3.1.4 which required some additional configuration for the replication process to start. These instructions can be found at https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html#CHAP_Source.PostgreSQL.v10
If you are experiencing this issue double check the DMS replication engine version. This can be viewed under Replication Instances in Resource Management.
To enable logical decoding for an Amazon RDS for PostgreSQL DB instance
The user account requires the rds_superuser role to enable logical
replication. The user account also requires the rds_replication role
to grant permissions to manage logical slots and to stream data using
logical slots.
Set the rds.logical_replication static parameter to 1. As part of
applying this parameter, we also set the parameters wal_level,
max_wal_senders, max_replication_slots, and max_connections. These
parameter changes can increase WAL generation, so you should only set
the rds.logical_replication parameter when you are using logical
slots.
Reboot the DB instance for the static rds.logical_replication
parameter to take effect.
Create a logical replication slot as explained in the next section.
This process requires that you specify a decoding plugin. Currently
we support the test_decoding output plugin that ships with
PostgreSQL.
The last item can be done with the following command:
SELECT * FROM pg_create_logical_replication_slot('test_slot', 'test_decoding');
Reference: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.LogicalReplication

AWS EMR HBase Bulk Load

I developed a Map Reduce program to make HBase Bulk Loading using the technique explained in this Cloudera article : https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/.
On our previous on-prem Cloudera Hadoop cluster, it was working very well. Now, we are moving to AWS. I don't manage to make this program work on AWS EMR cluster.
EMR details :
Release label:emr-5.16.0
Hadoop distribution:Amazon 2.8.4
Applications:Spark 2.3.1, HBase 1.4.4
Master : m4.4xlarge
Nodes : 12 x m4.4xlarge
Here is the code of my Driver
Job job = Job.getInstance(getConf());
job.setJobName("My job");
job.setJarByClass(getClass());
// Input
FileInputFormat.setInputPaths(job, input);
// Mapper
job.setMapperClass(MyMapper.class);
job.setInputFormatClass(ExampleInputFormat.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
// Reducer : Auto configure partitioner and reducer
Table table = HBaseCnx.getConnection().getTable(TABLE_NAME);
RegionLocator regionLocator = HBaseCnx.getConnection().getRegionLocator(TABLE_NAME);
HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator);
// Output
Path out = new Path(output);
FileOutputFormat.setOutputPath(job, out);
// Launch the MR job
logger.debug("Start - Map Reduce job to produce HFiles");
boolean b = job.waitForCompletion(true);
if (!b) throw new RuntimeException("FAIL - Produce HFiles for HBase bulk load");
logger.debug("End - Map Reduce job to produce HFiles");
// Make the output HFiles usable by HBase (permissions)
logger.debug("Start - Set the permissions for HBase in the output dir " + out.toString());
//fs.setPermission(outputPath, new FsPermission(ALL, ALL, ALL)); => not recursive
FsShell shell = new FsShell(getConf());
shell.run(new String[]{"-chmod", "-R", "777", out.toString()});
logger.debug("End - Set the permissions for HBase in the output dir " + out.toString());
// Run complete bulk load
logger.debug("Start - HBase Complete Bulk Load");
LoadIncrementalHFiles loadIncrementalHFiles = new LoadIncrementalHFiles(getConf());
int loadIncrementalHFilesOutput = loadIncrementalHFiles.run(new String[]{out.toString(), TABLE_NAME.toString()});
if (loadIncrementalHFilesOutput != 0) {
throw new RuntimeException("Problem in LoadIncrementalHFiles. Return code is " + loadIncrementalHFiles);
}
logger.debug("End - HBase Complete Bulk Load");
My mapper reads Parquet files and emits :
key which is the row key of a Put as ImmutableBytesWritable
value which is a HBase Put
The issue happens in the Reduce step. In each Reducer's "syslog", I got errors that seem related to Socket connections. Here is a piece of syslog :
2018-09-04 08:21:39,085 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-04 08:21:39,086 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
2018-09-04 08:21:55,705 ERROR [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2018-09-04 08:21:55,705 WARN [main] org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 ERROR [main] org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 WARN [main] org.apache.hadoop.hbase.client.ZooKeeperRegistry: Can't retrieve clusterId from Zookeeper
After several search in Google, I found several posts that were advising to set the quorum IP directly in the Java code. I did that as well but it did not work. Here is how I currently get the HBase connection
Configuration conf = HBaseConfiguration.create();
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
// Attempts to set directly the quorum IP in the Java code that did not work
//conf.clear();
//conf.set("hbase.zookeeper.quorum", "...ip...");
//conf.set("hbase.zookeeper.property.clientPort", "2181");
Connection cnx = ConnectionFactory.createConnection(conf);
What I don't understand is that everything else is working. I can programmatically create tables, query the table (Scan or Get). I can even use a MR job that insert data with TableMapReduceUtil.initTableReducerJob("my_table", IdentityTableReducer.class, job);. But is of course much less fast than the HBase complete Bulk Load technique which directly writes HFiles splitted according to existing regions.
Thank you for your help
I've been working on a similar migration. The issue is that the reducer runs in a separate process so you need to set the quorum on the job's configuration instead. That will make the value available to the reducer.
job.getConfiguration().set("hbase.zookeeper.quorum", "...ip...");

How to remove banned.users in hadoop Hortonworks

I have HDP Hortonworks 2.5.3 cluster, MAPREDUCE jobs in YARN are getting failed with the error:
java.io.IOException: DistCp failure: Job job_1498784032636_0015 has
failed:
Application application_1498784032636_0015 failed 2 times due to AM Container for appattempt_1498784032636_0015_000002 exited with
exitCode: -1000 For more detailed output, check the application
tracking page:
http://asterdart0005.labs.teradata.com:8088/cluster/app/application_1498784032636_0015 Then click on links to logs of each attempt. Diagnostics: Application
application_1498784032636_0015 initialization failed (exitCode=255)
with output: main : command provided 0 main : run as user is hdfs main
: requested yarn user is hdfs Requested user hdfs is banned
later i googled, it seems the hdfs user is banned user, as per the configuration in the file /etc/hadoop/conf/container-executor.cfg on each node, here is the content of the file:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
I have modified the file in all nodes (namenode, edge and data nodes), as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
#banned.users=hdfs,yarn,mapred,bin
min.user.id=500
and restarted all services in HDFS, YARN and MapReduce2 through Ambari, after restarting my jobs are failing with the same error, and checked the /etc/hadoop/conf/container-executor.cfg content, looks it reset to initial stage as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
any idea whats the solution here, to remove the users from the banned users list?
First thing to note is , you can not comment banned_users line, instead set correct users in value of banned_users list. (i.e. if you do not want to ban user hdfs then change banned.users=hdfs,yarn,mapred,bin to banned.users=yarn,mapred,bin). If you comment banned_users list then anyway by default hdfs, yarn and mapred will be banned.
Another thing, you can follow steps given below to propagate changes to all nodes.
​Go to Ambari server node
Modify /var/lib/ambari-server/resources/common-services/YARN/<version>/package/templates/container-executor.cfg.j2 to configure banned users.
Restart Ambari server and all Ambari agents