I have used the new AWS Glue Studio visual tool to just try run a very simple SQL query, with Source as a Catalog Table, Transform as a simple SparkSQL, and Target as a CSV file(s) in an s3 bucket.
Each time I run the code, it succeeds but nothing is stored in the bucket, not even an empty CSV file.
Not sure if this is a SparkSQL problem, or an AWS Glue problem.
Here is the automatically generated code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data_Catalog_0
Data_Catalog_0_node1 = glueContext.create_dynamic_frame.from_catalog(
database="some_long_name_data_base_catalog",
table_name="catalog_table",
transformation_ctx="Data_Catalog_0_node1",
)
# Script generated for node ApplyMapping
SqlQuery0 = """
SELECT DISTINCT "ID"
FROM myDataSource
"""
ApplyMapping_node2 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": Data_Catalog_0_node1},
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon S3
AmazonS3_node166237 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={
"path": "s3://target_bucket/results/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node166237",
)
job.commit()
This is very similar to this question, I am kind of reposting it, because I am unable to comment on it due to the low points, and although 4 Months old, still unanswered.
The problem seems to be the double-quotes of the selected fields in the SQL query. Dropping them solved the issue.
In other words, I "wrongly" used this query syntax:
SELECT DISTINCT "ID"
FROM myDataSource
instead of this "correct" one :
SELECT DISTINCT ID
FROM myDataSource
There is no mention of it in the Spark SQL Syntax documentation
Related
I have a table including 13 records, we have a simple GLUE job where it's basically read from table and write to s3 bucket in parquet format as shown in the code in below. When we execute the job, the number of parquet files produced in S3, it's the same as total records we have, so it's writing each row in one parquet file. I don't understand why this is happening, and why it's not storing all the records just in the same parquet file. Is there any config or setting a parameter that we have missed to do?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1671552410216 = glueContext.create_dynamic_frame.from_catalog(
database="sql_database",
table_name="ob_cpanel_products",
transformation_ctx="AWSGlueDataCatalog_node1671552410216",
)
# Script generated for node Change Schema (Apply Mapping)
ChangeSchemaApplyMapping_node1671554868239 = ApplyMapping.apply(
frame=AWSGlueDataCatalog_node1671552410216,
mappings=[
("productid", "int", "productid", "long"),
("productcode", "string", "productcode", "string"),
("name", "string", "name", "string"),
("quantity", "int", "quantity", "long"),
("price", "decimal", "price", "decimal"),
],
transformation_ctx="ChangeSchemaApplyMapping_node1671554868239",
)
# Script generated for node Amazon S3
AmazonS3_node1671554880876 = glueContext.write_dynamic_frame.from_options(
frame=ChangeSchemaApplyMapping_node1671554868239,
connection_type="s3",
format="glueparquet",
connection_options={
"path": "s3://onebox-glue-etl-pre/etl-tpv/output/with_partition/",
"partitionKeys": [],
},
format_options={"compression": "gzip"},
transformation_ctx="AmazonS3_node1671554880876",
)
job.commit()
you should try to do the repartition before writing that dataframe
repartition_dataframe = ChangeSchemaApplyMapping_node1671554868239.repartition(<number of file count>)
#You can use coalesce also instead of repartition
As I understand, Joob Bookmarks prevents the duplicated data. "Enable" updates the data based on the previous data, and "disable" process the entire dataset (does this mean it overrides it? I tried this, but the job took for too long and i'm not sure if it does what i think it does.)
But what if I want to override the Dynamodb Table in the job? I've seen examples where the output data is in S3, but I'm not sure about the DynamoDB.
For example I have a Glue job like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Redshift Cluster
RedshiftCluster_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tr_bbd",
redshift_tmp_dir=args["TempDir"],
table_name="tr_bbd_vendor_info",
transformation_ctx="RedshiftCluster_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=RedshiftCluster_node1,
mappings=[
("vendor_code", "string", "vendor_code", "string"),
("vendor_group_id", "int", "vendor_group_id", "int"),
("vendor_group_status_name", "string", "vendor_group_status_name", "string")
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node DynamoDB bucket
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "VENDOR_TABLE",
"dynamodb.throughput.write.percent": "1.0"
}
)
job.commit()
Thank you.
I was trying to read from a table in snowflake and manipulate data and trying to write back !
I was able to connect to snow flake , read data as dataframe but cannot write back to the table
code to connect to snowflake
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
## #params: [JOB_NAME, URL, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
#sc = SparkContext()
sc=SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
try:
job.init(args['JOB_NAME'], args)
except Exception as e:
pass
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
"sfRole" : args['ROLE']
}
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "<>").load().select('<>')
print(df.printSchema())
print(df.show())
df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "GLUE_DEMO").mode("append").save()
But when executing getting below error
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.save.
: java.sql.SQLException: Status of query associated with resultSet is FAILED_WITH_ERROR. Results not generated.
at net.snowflake.client.jdbc.SFAsyncResultSet.getRealResults(SFAsyncResultSet.java:127)
at net.snowflake.client.jdbc.SFAsyncResultSet.getMetaData(SFAsyncResultSet.java:262)
If a see the history in snowflake it's showing warehouse not selected
No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command
The easiest way is to assign the default warehouse to the user:
ALTER USER <name> SET DEFAULT_WAREHOUSE = <string>
Reference: ALTER USER
The read worked, if the data was already cached and hence does not require an active warehouse.
the real error code somewhere in Snowflake history
Is it possible to make simplest concurrent SQL queries on S3 file with partitioning?
The problem it looks like you have to choose 2 options from 3.
You can make concurrent SQL queries against S3 with S3 Select. But S3 Select doesn't support partitioning, it also works on single file at a time.
Athena support partitioning and SQL queries, but it has limit of 20 concurrent queries. Limit could be increased, but there is no guarantees and uper line.
You can configure HBase that works on S3 through EMRFS, but that requires to much configurations. And I suppose data should be written through HBase (another format). Maybe more simple solution?
You can also use such managed services like AWS Glue or AWS EMR.
Example code which you can run in Glue:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
def load_dict(_database,_table_name):
ds = glueContext.create_dynamic_frame.from_catalog(database = _database, table_name = _table_name, transformation_ctx = "ds_table")
df = ds.toDF()
df.createOrReplaceTempView(_table_name)
return df
df_tab1=load_dict("exampledb","tab1")
df_sql=spark.sql( "select m.col1, m.col2 from tab1 m")
df_sql.write.mode('overwrite').options(header=True, delimiter = '|').format('csv').save("s3://com.example.data/tab2")
job.commit()
You can also consider to use Amazon Redshift Spectrum.
https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/
Consider the following aws glue job code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "my_database",
table_name = "my_table")
medicare_dynamicframe.printSchema()
job.commit()
It prints something like that (note that price_key is not on second position):
root
|-- day_key: string
...
|-- price_key: string
While my_table in datalake is defined with day_key as int (first column) and price_key as decimal(25,0) (second column).
May be I am wrong but I spot from sources that aws glue uses table and database to get just s3 path to data but completelly ignores any type definitions. May be for some data formats like parquet it is normal, but not for csv.
How configure aws glue to set schema from datalake table defintion for dynamic frame with csv?