Is it possible to make simplest concurrent SQL queries on S3 file with partitioning?
The problem it looks like you have to choose 2 options from 3.
You can make concurrent SQL queries against S3 with S3 Select. But S3 Select doesn't support partitioning, it also works on single file at a time.
Athena support partitioning and SQL queries, but it has limit of 20 concurrent queries. Limit could be increased, but there is no guarantees and uper line.
You can configure HBase that works on S3 through EMRFS, but that requires to much configurations. And I suppose data should be written through HBase (another format). Maybe more simple solution?
You can also use such managed services like AWS Glue or AWS EMR.
Example code which you can run in Glue:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
def load_dict(_database,_table_name):
ds = glueContext.create_dynamic_frame.from_catalog(database = _database, table_name = _table_name, transformation_ctx = "ds_table")
df = ds.toDF()
df.createOrReplaceTempView(_table_name)
return df
df_tab1=load_dict("exampledb","tab1")
df_sql=spark.sql( "select m.col1, m.col2 from tab1 m")
df_sql.write.mode('overwrite').options(header=True, delimiter = '|').format('csv').save("s3://com.example.data/tab2")
job.commit()
You can also consider to use Amazon Redshift Spectrum.
https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/
Related
I am using Glue Studio 4.0 to choose data source (delta table 2.1.0 that saved in S3) as image below:
And then, I generate script from the box:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tien_bronze_layer",
table_name="dim_product_dt",
transformation_ctx="S3bucket_node1",
)
job.commit()
Finally, I saved this job and run but I got an error:
Error: An error occurred while calling o96.getDynamicFrame.
s3://tientest/Bronze_layer/dim_product_dt/_symlink_format_manifest/manifest is not a Parquet file. Expected magic number at tail, but found [117, 101, 116, 10]
I know this error but I don't find any docs to read delta tables that contain manifest.
Can you all help me in this case, thanks!
I have used the new AWS Glue Studio visual tool to just try run a very simple SQL query, with Source as a Catalog Table, Transform as a simple SparkSQL, and Target as a CSV file(s) in an s3 bucket.
Each time I run the code, it succeeds but nothing is stored in the bucket, not even an empty CSV file.
Not sure if this is a SparkSQL problem, or an AWS Glue problem.
Here is the automatically generated code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data_Catalog_0
Data_Catalog_0_node1 = glueContext.create_dynamic_frame.from_catalog(
database="some_long_name_data_base_catalog",
table_name="catalog_table",
transformation_ctx="Data_Catalog_0_node1",
)
# Script generated for node ApplyMapping
SqlQuery0 = """
SELECT DISTINCT "ID"
FROM myDataSource
"""
ApplyMapping_node2 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": Data_Catalog_0_node1},
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon S3
AmazonS3_node166237 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={
"path": "s3://target_bucket/results/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node166237",
)
job.commit()
This is very similar to this question, I am kind of reposting it, because I am unable to comment on it due to the low points, and although 4 Months old, still unanswered.
The problem seems to be the double-quotes of the selected fields in the SQL query. Dropping them solved the issue.
In other words, I "wrongly" used this query syntax:
SELECT DISTINCT "ID"
FROM myDataSource
instead of this "correct" one :
SELECT DISTINCT ID
FROM myDataSource
There is no mention of it in the Spark SQL Syntax documentation
As I understand, Joob Bookmarks prevents the duplicated data. "Enable" updates the data based on the previous data, and "disable" process the entire dataset (does this mean it overrides it? I tried this, but the job took for too long and i'm not sure if it does what i think it does.)
But what if I want to override the Dynamodb Table in the job? I've seen examples where the output data is in S3, but I'm not sure about the DynamoDB.
For example I have a Glue job like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Redshift Cluster
RedshiftCluster_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tr_bbd",
redshift_tmp_dir=args["TempDir"],
table_name="tr_bbd_vendor_info",
transformation_ctx="RedshiftCluster_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=RedshiftCluster_node1,
mappings=[
("vendor_code", "string", "vendor_code", "string"),
("vendor_group_id", "int", "vendor_group_id", "int"),
("vendor_group_status_name", "string", "vendor_group_status_name", "string")
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node DynamoDB bucket
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "VENDOR_TABLE",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thank you.
I want to access hive metastore by running a spark job on AWS Glue. Doing so requires me to put the hive's instance's ip and access it. From my local, it works but not from AWS Glue.
I have tried to access Hive using the following piece of code:
spark_session = (
glueContext.spark_session
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config(
"hive.metastore.uris",
"thrift://172.16.12.34:9083",
conf=SparkConf()
)
.enableHiveSupport()
.getOrCreate()
)
I have also looked at various documentations but none could tell my how to connect to an ec2 instance at a specific port.
The code is:
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark import SparkConf, SparkContext
from pyspark.conf import SparkConf
from pyspark.context import SparkConf, SparkContext
from pyspark.sql import (DataFrameReader, DataFrameWriter, HiveContext,
SparkSession)
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark_session = (
glueContext.spark_session
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config(
"hive.metastore.uris",
"thrift://172.16.12.34:9083",
conf=SparkConf()
)
.enableHiveSupport()
.getOrCreate()
)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = spark_session.createDataFrame(data)
df.write.saveAsTable('example_2')
job.commit()
I expect to get the table written in Hive but instead I get the following error from Glue:
An error occurred while calling o239.saveAsTable. No Route to Host from ip-172-31-14-64/172.31.14.64 to ip-172-31-15-11.ap-south-1.compute.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host;
Consider the following aws glue job code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "my_database",
table_name = "my_table")
medicare_dynamicframe.printSchema()
job.commit()
It prints something like that (note that price_key is not on second position):
root
|-- day_key: string
...
|-- price_key: string
While my_table in datalake is defined with day_key as int (first column) and price_key as decimal(25,0) (second column).
May be I am wrong but I spot from sources that aws glue uses table and database to get just s3 path to data but completelly ignores any type definitions. May be for some data formats like parquet it is normal, but not for csv.
How configure aws glue to set schema from datalake table defintion for dynamic frame with csv?