Save spark dataframe schema to hdfs - hdfs

For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?

You can use treeString
schema = df._jdf.schema().treeString()
and convert it to an RDD and use saveAsTextFile:
sc.parallelize([schema ]).saveAsTextFile(...)
Or to use saveAsPickleFile:
temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

Yes, you can save the schema as df.write.format("parquet").save("path")
#Give path as a HDFS path
You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path
Parquet + compression is the best storage strategy whether it resides on S3
or not.
Parquet is a columnar format, so it performs well without iterating over all
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala

Related

ATHENA CREATE TABLE AS problem with parquet format

I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3. The type is displayed as "-" which means that the file extension is not recognized despite that I can read the files (written from Athena) successfully in a Glue job using:
df = spark.read.parquet()
Here is my statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test (
numeric_field INT
,numeric_field2 INT)
STORED AS PARQUET
LOCATION 's3://xxxxxxxxx/TEST TABLE/'
TBLPROPERTIES ('classification'='PARQUET');
INSERT INTO test
VALUES (10,10),(20,20);
I'm specifying the format as PARQUET but when I check in the S3 bucket the file type is displayed as "-". Also when I check the glue catalog, that table type is set as 'unknown'
S3 STORAGE PRINT SCREEN
I expected that the type is recognized as "parquet" in the S3 bucket
After contacting the AWS support, it was confirmed that with CTAS queries Athena does not create file extensions for parquet files.
"Further to confirm this, I do see the Knowledge Center article [1] where CTAS generates the Parquet files without extension ( Under section 'Convert the data format and set the approximate file size' Point 5)."
However the files written from Athena, even without the extension are readable.
Reference:
[1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Workaround: I created a function to change the file extension. Basically iterating over the files in the S3 bucket and then writing the contents back to the same location with parquet file extension

How do I export a random sample of a csv in GCS to BigQuery

I'm working with a large CSV (400M+ lines) located in a GCS bucket. I need to get a random sample of this csv and export it to BigQuery for a preliminary exploration. I've looked all the over the web and I just can't seem to find anything that addresses this question.
Is this possible and how do I go about doing it?
You can query your csv file directly from BigQuery using external tables.
Try it with TABLESAMPLE clause:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)
You can create a external table from GCS (to read directly from GCS) and then do something like that
SELECT * FROM `<project>.<dataset>.<externalTableFromGCS>`
WHERE CAST(10*RAND() AS INT64) = 0
The result of the select can be stored in GCS with an export or stored in a table with an insert select
Keep in mind that you need to fully load the file (and thus to pay for the full file size) and then query a subset of the file. You can't load only 10% of the volume in BigQuery.
There is no direct way to load sample records from GCS to BigQuery, But we can achieve it in different way, In GCS we have option to download only specific chunk of file, so following simple python code can load sample records to BQ from large big GCS file
from google.cloud import storage
from google.cloud import bigquery
gcs_client = storage.Client()
bq_client = bigquery.Client()
job_config = bigquery.LoadJobConfig(source_format='CSV', autodetect=True, max_bad_records=1)
bucket = gcs_client.get_bucket("your-bucket")
blob = storage.Blob('gcs_path/file.csv', bucket)
with open('local_file.csv', 'wb') as f: # downloading sample file
gcs_client.download_blob_to_file(blob, f, start=0, end=2000)
with open('local_file.csv', "rb") as source_file: # uploading to BQ
job = bq_client.load_table_from_file(source_file, 'your-proj.dataset.table_id', job_config=job_config)
job.result() # Wait for loading
In above code, it will download 2 kb of data from your huge GCS file, But
last row in the downloaded csv file may incomplete since we cannot define the bytes for each rows. Here the trickier part is "max_bad_records=1" in bq job config so it will ignore the uncompleted last row.

Read from glue cataloge using spark and not using dynamic frame(glue context)

Since our scheme is constant we are using spark.read() which is way faster then creating dynamic frame from option when data is stored in s3
So now wanted to read data from glue catalog
using dynamic frame takes lot of time
So wanted to read using spark read api
Dataframe.read.format("").option("url","").option("dtable",schema.table name).load()
What to enter in format and url option and any other thing is required??
Short answer:
If you read/load the data directly using a SparkSession/SparkContext you'll get a
pure Spark DataFrame instead of a DynamicFrame.
Different options when reading from spark:
Format: is the source format you are reading from, so it can be parquet, csv, json,..
load: it is the path to the source file/files you are reading from: it can be a local path, s3 path, hadoop path,...
options: plenty of different options like inferSchema if you want spark to to the best for you and guess the schema based on a taken sample of data or header = true in csv files.
An example:
df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")
No DynamicFrame has been created in the previous example, so df will be a DataFrame unless you convert it into a DynamicFrame using glue API.
Long answer:
Glue catalog is only a aws Hive implementation itself. You create a glue catalog defining a schema, a type of reader, and mappings if required, and then this becomes available for different aws services like glue, athena or redshift-spectrum. The only benefit I see from using glue-catalogs is actually the integration with the different aws-services.
I think you can get the most from data-catalogs using crawlers and the integrations with athena and redshift-specturm, as well as loading them into glue jobs using a unified API.
You can always read using from_options glue method directly from different sources and formats using glue and you won't lose some of the great tools glue has, and it will still read it as a DynamicFrame.
If you don't want to get that data from glue for any reason you just can specify a DataFrame Schema and read directly using a SparkSession but keep in mind that you won't have access to bookmarks, and other tools although you can transform that DataFrame into a DynamicFrame.
An example of reading from s3 using spark directly into a DataFrame (f.e in parquet, json or csv format), would be:
df = spark.read.parquet("s3://path/file.parquet")
df = spark.read.csv("s3a://path/*.csv")
df= spark.read.json("s3a://path/*.json")
That won't create any DynamicFrame unless you want to convert it to it, you'll get a pure Spark DataFrame.
Another way of doing it is using the format() method.
df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")
Keep in mind that there are several options like "header" or "inferSchema" for a csv f.e. You'll need to know if you want to use them. It is best practice to define the schema in productions environments instead of using inferSchema but there are several use cases.
And furthermore you can always convert that pure DataFrame to a DynamicFrame if needed using:
DynamicFrame.fromDF(df, glue_context, ..)

AWS Glue ETL and PySpark and partitioned data: how to create a dataframe column from partition

I have data in an S3 bucket containing many json-files that looks somewhat like this:
s3://bucket1/news/year=2018/month=01/day=01/hour=xx/
The day partition contains multiple hour=xx partitions, one for each hour of the day. I run a Glue ETL job on the files in the day partition and create a Glue dynamic_frame_from_options. I then apply some mapping using ApplyMapping.apply which works like a charm.
However, I would then like to create a new column, containing the hour value, based on the partition of each file. I can use Spark to create a new column with a constant, however, how do I make this column to use the partition as a source?
df1 = dynamicFrame.toDF().withColumn("update_date", lit("new column value"))
Edit1
The article from AWS on how to use partitioned data uses a Glue crawler before the creation of a dynamicFrame and then create the dynamicFrame from a Glue catalog. I need to create the dynamicFrame directly from the S3 source.
enter link description here
I am not really following what you need to do. Dont you already have a hour value if you have the files partitoned on it or is that only when you use create_dynamic_frame .from_catalog that you will get it?
Can you do a df1["hour"] or df1.select_fields["hour"]?
You do not need to import any libs if you have your data partitoned on ts(timestamp in yyyymmddhh format), this you can perform with pure python in Spark.
Example code. First I create some values that will populate my DataFrame.
Then create a new variable like below.
df_values = [('2019010120',1),('2019010121',2),('2019010122',3),('2019010123',4)]
df = spark.createDataFrame(df_values,['yyyymmddhh','some_other_values'])
df_new = df.withColumn("hour", df["yyyymmddhh"][9:10])
df_new.show()
+----------+-----------------+----+
|yyyymmddhh|some_other_values|hour|
+----------+-----------------+----+
|2019010120| 1| 20|
|2019010121| 2| 21|
|2019010122| 3| 22|
|2019010123| 4| 23|
+----------+-----------------+----+
I'm not familiar with AWS Glue, if the given link doesn't work for your case, then you can try and see if the following workaround works for you:
Get the file name using input_file_name, then use regexp_extract to get the partition column you want from the file name:
from pyspark.sql.functions import input_file_name, regexp_extract
df2 = df1.withColumn("hour", regexp_extract(input_file_name(), "hour=(.+?)/", 1))
As I understand your problem you would like to build dataframe for given day with the hours as partitions. Generally if you use Apache Hive-style partitioned paths and your files have the same schema you shouldn't have a problem to use
ds = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://bucket1/news/year=2018/month=01/day=01/']},
'json')
or...
df = spark.read.option("mergeSchema", "true").json('s3://bucket1/news/year=2018/month=01/day=01/')
So if it doesn't work you should check whether you use Apache Hive-style partitioned paths and your files have the same schema.
You can also try to use boto3 framework in Glue (it may be useful to you) :
import boto3
s3 = boto3.resource('s3')
Useful link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
"...AWS Glue does not include the partition columns in the DynamicFrame—it only includes the data."
We have to load the S3 key into a new column and decode the partitions programatically to create the columns we want into the Dynamic Frame/Data Frame.
Once created, we can use them as we need.
ps: I have test it against parquet files. It doesn't work for JSON files.
Reference

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.