Using AWS S3 and Apache Spark with hdf5/netcdf-4 data - amazon-web-services

I've got a bunch of atmospheric data stored in AWS S3 that I want to analyze with Apache Spark, but am having a lot of trouble getting it loaded and into an RDD. I've been able to find examples online to help with discrete aspects of the problem:
-using h5py to read locally stored scientific data files via h5py.File(filename) (https://hdfgroup.org/wp/2015/03/from-hdf5-datasets-to-apache-spark-rdds/)
-boto/boto3 to get data that is textfile format from S3 into Spark via get_contents_as_string()
-map a set of text files to an RDD via keys.flatMap(mapFunc)
But I can't seem to get these parts to work together. Specifically-- how do you load in a netcdf file from s3 (using boto or directly, not attached to using boto) in order to then use h5py? Or can you treat the netcdf file as a binary file and load it in as a binary file and map to an rdd using sc.BinaryFile(binaryFile)?
Here's a couple of similar questions that weren't answered fully that relate:
How to read binary file on S3 using boto?
using pyspark, read/write 2D images on hadoop file system

Using the netCDF4 and s3fs modules, you can do:
from netCDF4 import Dataset
import s3fs
s3 = s3fs.S3FileSystem()
filename = 's3://bucket/a_file.nc'
with s3.open(filename, 'rb') as f:
nc_bytes = f.read()
root = Dataset(f'inmemory.nc', memory=nc_bytes)
Make sure you are setup to read from S3. For details, here is the documentation.

Related

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Connecting Power BI to S3 Bucket

Need some guidance as I am new to Power BI and Redshift ..
My Raw JSON data is stored in Amazon S3 bucket in the form of .gz files (Each .gz file has multiple rows of JSON data)
I wanted to connect Power BI to Amazon s3 Bucket. As of now based on my research I got three ways:
Amazon S3 is a web service and supports the REST API. We can try to use web data source to get data
Question: Is it possible to unzip the .gz file (inside the S3 bucket or Inside Power BI), extract JSON data from S3 and connect to Power BI
Importing data from Amazon S3 into Amazon Redshift. Do all data manipulation inside Redshift using SQL workbench. Use Amazon Redshift connector to get data in Power BI
Question 1: Does Redshift Allows Loading .gzzipped JSON data from the S3 bucket? If Yes, is it directly possible or do I have to write any code for it?
Question 2: I have the S3 account, do I have to separately purchase Redshift Account/Space? What is the cost?
Move data from an AWS S3 bucket to the Azure Data Lake Store via Azure Data Factory, transform the data with Azure Data Lake Analytics (U-SQL), and then output the data to PowerBI
U-SQL recognize GZip compressed files with the file extension .gz and automatically decompress them as the part of the Extraction process. Is this process valid, if my gzipped files contain JSON data rows?
Please let me if there is any other method, also your valuable suggestions on this post.
Thanks in Advance.
About your first Question: I've just faced a similar issue recently (but extracting a csv) and I would like to register my solution.
Power BI still don't have a direct plugin to download S3 buckets, but you can do it using a python script.
Get data --> Python Script
PS.: make sure that boto3 and pandas libraries are installed in the same folder (or subfolders) of the Python home directory you informed in Power BI options,
OR in Anaconda library folder (c:\users\USERNAME\anaconda3\lib\site-packages).
Power BI window for Python scripts options
import boto3
import pandas as pd
bucket_name= 'your_bucket'
folder_name= 'the folder inside your bucket/'
file_name = r'file_name.csv' # or .json in your case
key=folder_name+file_name
s3 = boto3.resource(
service_name='s3',
region_name='your_bucket_region', ## ex: 'us-east-2'
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
obj = s3.Bucket(bucket_name).Object(key).get()
df = pd.read_csv(obj['Body']) # or pd.read_json(obj['Body']) in your case
The dataframe will be imported as a new query( named "df", in this example case)
Apparently pandas library can also also get a zipped file (.gz for example). See the following topic: How can I read tar.gz file using pandas read_csv with gzip compression option?

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Spark streaming job using custom jar on AWS EMR fails upon write

I am trying to convert a file (csv.gz format) into parquet using streaming data frame. I have to use streaming data frames because the files compressed are ~700 MB in size. The job is run using a custom jar on AWS EMR. The source, destination and checkpoint locations are all on AWS S3. But as soon as I try to write to checkpoint the job fails with following error:
java.lang.IllegalArgumentException:
Wrong FS: s3://my-bucket-name/transformData/checkpoints/sourceName/fileType/metadata,
expected: hdfs://ip-<ip_address>.us-west-2.compute.internal:8020
There are other spark jobs running on the EMR cluster that read and write from and to S3 which run successfully (but they are not using spark streaming). So I do not think it is an issue with S3 file system access as suggested in this post. I also looked at this question but the answers do not help in my case. I am using Scala: 2.11.8 and Spark: 2.1.0.
Following is the code I have so far
...
val spark = conf match {
case null =>
SparkSession
.builder()
.appName(this.getClass.toString)
.getOrCreate()
case _ =>
SparkSession
.builder()
.config(conf)
.getOrCreate()
}
// Read CSV file into structured streaming dataframe
val streamingDF = spark.readStream
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter","|")
.option("timestampFormat", "dd-MMM-yyyy HH:mm:ss")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue","")
.schema(schema)
.load(s"s3://my-bucket-name/rawData/sourceName/fileType/*/*/fileNamePrefix*")
.withColumn("event_date", "event_datetime".cast("date"))
.withColumn("event_year", year($"event_date"))
.withColumn("event_month", month($"event_date"))
// Write the results to Parquet
streamingDF.writeStream
.format("parquet")
.option("path", "s3://my-bucket-name/transformedData/sourceName/fileType/")
.option("compression", "gzip")
.option("checkpointLocation", "s3://my-bucket-name/transformedData/checkpoints/sourceName/fileType/")
.partitionBy("event_year", "event_month")
.trigger(ProcessingTime("900 seconds"))
.start()
I have also tried to use s3n:// instead of s3:// in the URI but that does not seem to have any effect.
Tl;dr Upgrade spark or avoid using s3 as checkpoint location
Apache Spark (Structured Streaming) : S3 Checkpoint support
Also you should probably specify the write path with s3a://
A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.
https://wiki.apache.org/hadoop/AmazonS3

how to write 4K size file on remote server in python

I am trying to write 50 4KB files on a EC2 instance with S3 mounted on it.
How can I do this in python?
I am no sure how to proceed with this.
If you have the S3 object bucket mounted via FUSE or some other method to get S3 object space as a pseudo file system, then you write files just like anything else in Python.
with open('/path/to/s3/mount', 'wb') as dafile:
dafile.write('contents')
If are you trying to put objects in S3 from an EC2 instance, then you will want to follow the boto documentation on how to do this.
to start you off:
create a /etc/boto.cfg or ~/.boto file like the boto howto says
from boto.s3.connection import S3Connection
conn = S3Connection()
# if you want, you can conn = S3Connection('key_id_here, 'secret_here')
bucket = conn.get_bucket('your_bucket_to_store_files')
for file in fifty_file_names:
bucket.new_key(file).set_contents_from_file('/local/path/to/{}'.format(file))
This assumes you are doing fairly small files, like you said 50k. Larger files may need to be split/chunked.