Load Data using Apache-Spark on AWS

Load Data using Apache-Spark on AWS - amazon-web-services

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I've created one master and two slave nodes. On the master node, I have a directory data containing all data files with csv format to be processed.
Now before we submit the driver program (which is my python code) to run, we need to copy the data directory data from the master to all slave nodes. For my understanding, I think it is because each slave node needs to know data file location in its own local file systems so it can load data file. For example,
from pyspark import SparkConf, SparkContext
### Initialize the SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("local")
sc = SparkContext(conf = conf)
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
### Collect files from the RDD
datafile.collect()
When each slave node runs the task, it loads data file from its local file system.
However, before we submit my application to run, we also have to put the directory data into the Hadoop Distributed File System (HDFS) using $ ./ephemeral-hdfs/bin/hadoop fs -put /root/data/ ~.
Now I get confused about this process. Does each slave node load data files from its own local file system or HDFS? If it loads data from the local file system, why do we need to put data into HDFS? I would appreciate if anyone can help me.

Just to clarify for others that may come across this post.
I believe your confusion is due to not providing a protocol in the file location. When you do the following line:
### Create a RDD containing metadata about files in directory "data"
datafile = sc.wholeTextFiles("/root/data") ### Read data directory
Spark assumes the file path /root/data is in HDFS. In other words it looks for the files at hdfs:///root/data.
You only need the files in one location, either locally on every node (not the most efficient in terms of storage) or in HDFS that is distributed across the nodes.
If you wish to read files from local, use file:///path/to/local/file. If you wish to use HDFS use hdfs:///path/to/hdfs/file.
Hope this helps.

One quick suggestion is to load csv from S3 instead of having it in local.
Here is a sample scala snippet which can be used to load a bucket from S3
val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY#REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
load(leadsS3Path)

Related

Updating AVRO files in GCS using data

I'm working on a POC to extract data from an API and load new/updated records to AVRO file present in GCS, I also want to delete the record that comes with a deleted flag, from the AVRO file.
What would be a feasible approach to implement this using dataflow, are there any resources that I can refer to for it?

You can't update file in GCS. You can only READ, WRITE and DELETE. If you have to change 1 byte in the file, you need to download the file, make the change and upload it again.
You can keep versions in GCS, but each BLOB is unique and can be changed.
Anyway, you can do that with dataflow, but keep in mind that you need 2 inputs:
The data to update
The file stored in GCS (that you have to read and to process also with dataflow)
At the end, you need to write the new file in GCS, with the data stored in dataflow.

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.

If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open · PyPI is a great library for reading from an S3 object without having to download it first.)

Generating Single Flow file for loading it into S3

I have a Nifi Flow, which fetches a data from RDS tables and load into S3 as flat files, now i need to generate another file which will be having the name of the file that I am loading into S3 bucket, this needs to be a separate flow;
example: if the RDS extracted flat file name is RDS.txt, then the new generated file should have rds.txt as content and I need to load this file to same S3 bucket.
Problem I face is I am using a generate flowfile processor and adding the flat file name as custom text in flowfile, but i could not set up any upstream for Generate flow file processor, so this is generating more files, if I use the merge content processor after the generate flow file processor, I could see duplicate values in the flowfile.
Can anyone help me out in this

I have a Nifi Flow, which fetches a data from RDS tables and load into S3 as flat files, now i need to generate another file which will be having the name of the file that I am loading into S3 bucket, this needs to be a separate flow;
Easiest path to do this is to chain something after PutS3Object that will update the flowfile contents with what you want. It would be really simple to write with ExecuteScript. Something like this:
def ff = session.get()
if (ff) {
def updated = session.write(ff, {
it.write(ff.getAttribute("filename").bytes)
} as OutputStreamCallback)
updated = session.putAttribute(updated, "is_updated", "true")
session.transfer(updated, REL_SUCCESS)
}
Then you can put a RouteOnAttribute after PutS3Object and have it route to either a null route if it detects the attribute is_updated or route back to PutS3Object if it's not been updated.

I got a simple solution for this I have added a funnel before the put s3 object, and upstream of the funnel will receive two file, one with the extract and the other with the file name, down stream of the funnel is connected to the puts3 object, so this will load both the files at the same time

Using AWS S3 and Apache Spark with hdf5/netcdf-4 data

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:
-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)
-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()
-map a set of text files to an RDD via keys.flatMap(mapFunc)
But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?
Here's a couple of similar questions that weren't answered fully that relate:
How to read binary file on S3 using boto?
using pyspark, read/write 2D images on hadoop file system

Using the netCDF4 and s3fs modules, you can do:
from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()
filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
nc_bytes = f.read()
root = Dataset(f'inmemory.nc', memory=nc_bytes)
Make sure you are setup to read from S3. For details, here is the documentation.

Moving a changing file to a new server using Gzip

I have a file in AWS S3 that is updating every second (actually collecting new data). I want to move the collected file to my local server periodically. Here are a few things that I am considering.
The transportation needs to be done in a zipped somehow to reduce the network burden since the cost the S3 is based on the network load.
After moving the data out of AWS S3, the data on S3 need to be deleted. In another way, the sum of the data on my server and the data on AWS should be the complete dataset and there should be intersection between these two datasets. Otherwise, next time, when we move the data, there will be duplicates for the dataset on my server.
The dataset on S3 is collecting all the time, and the new data is appended to the file using standard in. There is something running on the cron job to collect the data.
Here is a pseudo code that shows the idea of how the file has been built on S3.
* * * * * nohup python collectData.py >> data.txt
Which requires that the data transportation cannot break the pipeline, otherwise, the new data will be lost.

One of the option is to mount S3 bucket as local directory (for example, using RioFS project) and use standard shell tools (like rm, cp, mv ..) to remove an old file and upload a new file to Amazon S3.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js