Where are the files stored in Hue in Amazon EMR - amazon-web-services

If I go to the hue link here at http://ec2-****:8888/hue/home/ I can access the hue dashboard and create and save files etc. However, I'm not able to see those files while browsing through the system using SSH. Where are these files stored in the system?

This is not how it works Alex, you cannot see that information in your filesystem.
Hue is giving you a view of the underlying Hadoop Distributed File System (HDFS).
The information in this filesystem is spread across several nodes in your Hadoop cluster.
If you need to find something in that filesystem, you cannot use the typical file manipulation tools provided by the operating system, but their Hadoop counterparts.
For your use case, Hadoop provide you the hdfs dfs command, or equivalently, hadoop fs.
Let's say you wanna find test1.sql in the Hadoop filesystem. You can issue the following command once you ssh in your node:
hadoop fs -ls -R / | grep test1.sql
Or:
hadoop fs -find / -name test1.sql
Please, see the complete reference of the options available.
You can retrieve the file to your local filesystem, once located by the previous commands, by issuing the following one:
hadoop fs -get /path/to/test1.sql test1.sql
This operation could be also achieved from the Hue File Browser.
In the specific case of Amazon EMR, this distributed filesystem can be supported by different storage systems basically, HDFS, for ephemeral workloads, and EMRFS, an implementation of HDFS over S3:
EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.

Related

Concat Avro files in Google Cloud Storage

I have some big .avro files in the Google Cloud Storage and I want to concat all of them in a single file.
I got
java -jar avro-tools.jar concat
However, as my files are in the google storage path: gs://files.avro I can't concat them by using avro-tools. Any suggestion about how to solve it?
You can use the gsutil compose command. For example:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
Note: For extremely large files and/or very low per-machine bandwidth, you may want to split the file and upload it from multiple machines, and later compose these parts of the file manually.
On my case I tested it with the following values: foo.txt contains a word Hello and bar.txt contains a word World. Running this command:
gsutil compose gs://bucket/foo.txt gs://bucket/bar.txt gs://bucket/baz.txt
baz.txt would return:
Hello
World
Note: GCS does not support inter-bucket composing.
Just in case if you're encountering an exception error with regards to integrity checks, run gsutil help crcmod to get an instructions on how to fix it.
Check out https://github.com/spotify/gcs-tools
Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-cli, proto-tools for Scio's Protobuf in Avro file, and magnolify-tools for Magnolify code generation, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.

how to use emrfs as a regular filesystem on apache spark over aws emr

I am using amazon emr and running python code using spark submit.
I would like to be able to access the EMRFS filesystem as an actual file system, meaning that I would like to be able to list all the files using something like
import os
os.listdir()
and would like to save files into the local storage and have them persisted in the s3 bucket.
I found this which is old and seems to work only with Java and I had no luck utilizing it for python, and found this, which is more general, old and does not fit my needs.
How can I do that?

Creating DCOS service with artifacts from hdfs

I'm trying to create DCOS services that download artifacts(custom config files etc.) from hdfs. I was using simple ftp server before for it but I wanted to use hdfs. It is allowed to use "hdfs://" in artifact uri but it doesn't work correctly.
Artifact fetch ends with error because there's no "hadoop" command. Weird. I read that I need to provide own hadoop for it.
So I downloaded hadoop, set up necessary variables in /etc/profile. I can run "hadoop" without any problem when ssh'ing to node but service still ends with the same error.
It seems that environment variables configured in service are used after the artifact fetch because they don't work at all. Also, it looks like services completely ignore /etc/profile file.
So my question is: how do I set up everything so my service can fetch artifacts stored on hdfs?
The Mesos fetcher supports local Hadoop clients, please check your agent configuration and in particular your --hadoop_home setting.

How data gets into HDFS files system

I am trying to understand how data from multiple sources and systems gets into HDF? I want to push web server log files form 30+ systems. These logs are sitting on 18 different servers.
Thx
Veer
You can create a map-reduce job. The input for your mapper would be a file sitting on a server, and your reducer would deduct to which path to put the file in hdfs. You can either aggregate all of your files in your reducer, or simply write the file as is at the given path.
You can use Oozie to schedule the job, or you can run it sporadically by submitting the map-reduce job on the server which hosts the job tracker service.
You could also create an java application that uses the hdfs api. The FileSystem object can be used to do standard file system operation, like writing a file to a given path.
Either way, you need to request the creation through hdfs api, because the name node is responsible for splitting the file in blocks and writing it on distributed servers.

How to put input file automatically in hdfs?

In Hadoop we always putting input file manually through -put command. Is there any way we can automate this process ?
There is no automated process of inputing a file into the Hadoop filesystem. However, it is possible to -put or -get multiple files with one command.
Here is the website for the Hadoop shell commands
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
I am not sure how many files you are dropping into HDFS, but one solution for watching for files and then dropping them in is Apache Flume. These slides provide a decent intro.
You can thing of automatic this process with Fabric library and python. Write hdfs put command in a function and you can call it for multiple file and perform same operations of multiple hosts in network. Fabric should be really helpful to automate in your scenario.