How to access AWS public dataset using Databricks? - amazon-web-services

For one of my classes, I have to analyze a "big data" dataset. I found the following dataset on the AWS Registry of Open Data that seems interesting:
https://registry.opendata.aws/openaq/
How exactly can I create a connection and load this dataset into Databricks? I've tried the following:
df = spark.read.format("text").load("s3://openaq-fetches/")
However, I receive the following error:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
Also, it seems that this dataset has multiple folders. How do I access a particular folder in Databricks, and if possible, can I focus on a particular time range? Let's say, from 2016 to 2020?
Ultimately, I would like to perform various SQL queries in order to analyze the dataset and perhaps create some visualizations as well. Thank you in advance.

If you browse the bucket, you'll see that there are multiple datasets there, in different formats, that will require different access methods. So you need to point to the specific folder (and maybe its subfolder to load data). Like, to load daily dataset you need to use CSV format:
df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "false").load("s3://openaq-fetches/daily/")
To load only subset of the data you can use path filters, for example. See Spark documentation on loading data.
P.S. the inferSchema isn't very optimal from performance standpoint, so it's better to explicitly provide schema when reading.

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

Building Google Cloud Platform Data Catalog on unstructured data

I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.

Use AWS Athena With Dynamic Fields / Schemaless

We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.

What big data tools or approach to be used

I have a central data store in AWS . I wanted to access multiple tables in that database and find patterns and predictions on those collection of data.
my tables have several transactional data like call details,marketing campaign details,contact information of people etc.
How to integrate all this data for a big data analysis to find the relationship and store them efficiently
I am confused whether to use Haddop or not, which architecture would be perfect
The most easiest way for you to start is to export tables you wish to analyze into a csv file and process it using Amazon Machine Learning.
The following guide describes entire process:
http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html

Unload status in Redshift

When you load data to your Amazon Redshift tables, you can check the load status using the table STV_LOAD_STATE.
I would like to know if there's a way to achieve the same, but with the unload operation. In other words, I'd like to know if there's a way to find out the current stage of an unload process.
Unlike loading data into Redshift, Unloading actually has to run a select statement. Therefore it can't tell us a status like it does when it's loading.
e.g if the select statement has to join multiple tables and scan a lot of tables to generate the output then it might take long even though the actual unload part might not be the long part.
So I usually check the query execution steps in AWS console to have a rough idea about where the unload is.
I also check the S3 folder that I am unloading to see if the files start coming in yet. They usually come in batches so it can give you an idea as well.
2021, and we have a solution
STL_UNLOAD_LOG
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_UNLOAD_LOG.html