Which node in HDFS to request data? - hdfs

I am building a HDFS cluster and i want to request data. Do I have to send all requests to the master node or Can I do it through any node ?
Thanks

DFS Requests (to read/write data) are made by Hadoop Clients to the Namenode.
Can I do it through any node?
Any node that has
Hadoop installed with the configurations and libraries required to connect with Hadoop Filesystem.
User Privileges to access the Filesystem
Network access to the Cluster
can be used to make DFS requests.
Obviously, the nodes in the Cluster would qualify as Hadoop Clients.

You can use any node in the cluster , and also on any "edge" or "gateway" nodes, which are client nodes with the hadoop libraries and configuration files,and you don't specify a node to which you make the request, but the cluster, internally your request will be sent to the namenode ,and the namenode will send back not the data, but the location of the data and your client node will go to that location to retrieve the data.

Related

Why do we need HDFS on EMR when we have S3

In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files are stored in S3. We use Apache Spark for Data ingestion and transformation. We have EMR ever-running cluster with 1 master node (always running) and 1 core node(always running), whenever data processing happens additional core nodes and task nodes are added and removed once processing is done. Our EC2 instances are having EBS volumes for temporary storage/scratch space for executors.
Given this context, I am wondering why do we need HDFS in our EMR cluster at all?. I also see that the HDFS Namenode services are always running on the master node and on the core node Datanode services are running. They do have some blocks they are managing but not able to find which files they belong to. And also the size of all the blocks are very small(~2 GB).
Software versions used
Python version: 3.7.0
PySpark version: 2.4.7
Emr version: 5.32.0
If you know the answer to this question, can you please help me understand this need for HDFS?. Please let me know if you have any questions for me.
HDFS in EMR is a built-it component that is provided to store secondary information such as credentials if your spark executors need to authenticate themselves to read a resource, another use is to store log files, in my personal experience I used it as a staging area to store a partial result in a long computation, so that if something went wrong in the middle I would have a checkpoint from which to resume execution instead of starting the computation from scratch, it is strongly discouraged to store the final result on HDFS.
Spark on EMR runs on YARN, which itself uses HDFS. The Spark executors run inside of YARN containers, and Spark distributes the Spark code and config by placing it in HDFS and distributing it to all of the nodes running the Spark executors in the YARN containers. Additionally, the Spark Event Logs from each running and completed application are stored in HDFS by default.

Can not synchronize WSO2 node services

I have 2 nodes of WSO2 in cluster. Log says that both nodes connected to cluster. But each node has its own service list. The thing I want is when i configure service in one node, it must be synchonized to another one.
All things configured as in this tutorial
https://docs.wso2.com/display/CLUSTER44x/Setting+up+a+Cluster#SettingupaCluster-Configuringtheloadbalancer
I suppose you have used SVN based deployment synchronization. Could please try rsync as mentioned in [1]. This is the more recommended synchronization mechanism.

WSO2 ESB with Load balancer and without clustering

I am trying to setup WSO2 ESB on 2 nodes both sharing the same DB.
and also load balancer handling the loads across these 2 nodes.
Wondering if we really need to do clustering based on WKA scheme across these 2 nodes?
In ESB, synapse configurations are not stored in the DB, they are stored in the File System, So Yes HazleCast based clustering is required since the artifacts are synced between the nodes using a SVN based deployment synchronizer. When the manager node gets a new artifact (Say API, Proxy etc.) it will broadcast a syncing message to all the Worker nodes in the cluster, then worker nodes will checkout any new artifacts from the SVN. You can read more about this from here

How to configure putHDFS processor in Apache NiFi such that I could transfer file from a local machine to HDFS over the network?

I have data in a file on my local windows machine. The local machine has Apache NiFi running on it. I want to send this file to HDFS over the network using NiFi. How could I configure putHDFS processor in NiFi on the local machine such that I could send data to HDFS over the network?
Thank you!
You need to copy the core-site.xml and hdfs-site.xml from one of your hadoop nodes to the machine where NiFi is running. Then configure PutHDFS so that the configuration resources are "/path/to/core-site.xml,/path/to/hdfs-site.xml". That is all that is required from the NiFi perspective, those files contain all of the information it needs to connect to the Hadoop cluster.
You'll also need to ensure that the machine where NiFi is running has network access to all of the machines in your Hadoop cluster. You can look through those config files and find any hostnames and IP addresses and make sure they can be accessed from the machine where NiFi is running.
Using the GetFile processor or the combination of ListFile/FetchFile, it would be possible to bring this file from your local disk into NiFi and pass this onto the PutHDFS processor. The PutHDFS processor relies on the associated core-site.xml and hdfs-site.xml files in its configuration.
Just add Hadoop core configuration file directory to the first field
$HADOOP_HOME/conf/hadoop/hdfs-site.xml, $HADOOP_HOME/conf/hadoop/core-site.xml
and set the hdfs directory of the data ingestion to get stored in the field of Directory & let everything else default.

ElasticCache - What is the difference between the configuration and node endpoint?

ElasticCache gives you both a configuration end point, and an individual node endpoint.
What is really the difference between the two? And a use case that you'd use one versus the other?
I assume configuration end point could point to a group of node endpoints, but I don't really quite get it. A use case example would really help me understand when you'd want to use the 2 differently.
As per my understanding, Node endpoint is associated to particular node which is present in cluster, and Configuration endpoint is for cluster management. Each Node endpoint is connected to Configuration endpoint to get details about other nodes present in that cluster.
The configuration endpoint DNS entry contains the CNAME entries for each of the cache node endpoints; thus, by connecting to the configuration endpoint, you application immediately knows about all of the nodes in the cluster and can connect to all of them.You do not need to hard code the individual cache node endpoints in your application.
For more information on Auto Discovery, see Node Auto Discovery (Memcached).
My understanding of the AWS docs on this topic is that the configuration endpoint is what you need if you have multiple nodes. It looks like you would plug the configuration endpoint URL into their cache client software which download from your elasticache AWS management console (looks only available in Java and PHP at the moment).
If you just have one node then the node endpoint is the one you use with memcache, which with PHP looks like this:
$memcache = memcache_connect('yourECname.tvgtaa.0001.use1.cache.amazonaws.com', 11211);
http://www.php.net/manual/en/memcache.connect.php
p.s. once you download the the cache client, within it it has a link for installation directions which seem pretty self-explanatory: http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Appendix.PHPAutoDiscoverySetup.html