I was trying to read from a table in snowflake and manipulate data and trying to write back !
I was able to connect to snow flake , read data as dataframe but cannot write back to the table
code to connect to snowflake
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
## #params: [JOB_NAME, URL, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
#sc = SparkContext()
sc=SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
try:
job.init(args['JOB_NAME'], args)
except Exception as e:
pass
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
"sfRole" : args['ROLE']
}
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "<>").load().select('<>')
print(df.printSchema())
print(df.show())
df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "GLUE_DEMO").mode("append").save()
But when executing getting below error
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.save.
: java.sql.SQLException: Status of query associated with resultSet is FAILED_WITH_ERROR. Results not generated.
at net.snowflake.client.jdbc.SFAsyncResultSet.getRealResults(SFAsyncResultSet.java:127)
at net.snowflake.client.jdbc.SFAsyncResultSet.getMetaData(SFAsyncResultSet.java:262)
If a see the history in snowflake it's showing warehouse not selected
No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command
The easiest way is to assign the default warehouse to the user:
ALTER USER <name> SET DEFAULT_WAREHOUSE = <string>
Reference: ALTER USER
The read worked, if the data was already cached and hence does not require an active warehouse.
the real error code somewhere in Snowflake history
Related
I am using Glue Studio 4.0 to choose data source (delta table 2.1.0 that saved in S3) as image below:
And then, I generate script from the box:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="tien_bronze_layer",
table_name="dim_product_dt",
transformation_ctx="S3bucket_node1",
)
job.commit()
Finally, I saved this job and run but I got an error:
Error: An error occurred while calling o96.getDynamicFrame.
s3://tientest/Bronze_layer/dim_product_dt/_symlink_format_manifest/manifest is not a Parquet file. Expected magic number at tail, but found [117, 101, 116, 10]
I know this error but I don't find any docs to read delta tables that contain manifest.
Can you all help me in this case, thanks!
I have used the new AWS Glue Studio visual tool to just try run a very simple SQL query, with Source as a Catalog Table, Transform as a simple SparkSQL, and Target as a CSV file(s) in an s3 bucket.
Each time I run the code, it succeeds but nothing is stored in the bucket, not even an empty CSV file.
Not sure if this is a SparkSQL problem, or an AWS Glue problem.
Here is the automatically generated code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Data_Catalog_0
Data_Catalog_0_node1 = glueContext.create_dynamic_frame.from_catalog(
database="some_long_name_data_base_catalog",
table_name="catalog_table",
transformation_ctx="Data_Catalog_0_node1",
)
# Script generated for node ApplyMapping
SqlQuery0 = """
SELECT DISTINCT "ID"
FROM myDataSource
"""
ApplyMapping_node2 = sparkSqlQuery(
glueContext,
query=SqlQuery0,
mapping={"myDataSource": Data_Catalog_0_node1},
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Amazon S3
AmazonS3_node166237 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="csv",
connection_options={
"path": "s3://target_bucket/results/",
"partitionKeys": [],
},
transformation_ctx="AmazonS3_node166237",
)
job.commit()
This is very similar to this question, I am kind of reposting it, because I am unable to comment on it due to the low points, and although 4 Months old, still unanswered.
The problem seems to be the double-quotes of the selected fields in the SQL query. Dropping them solved the issue.
In other words, I "wrongly" used this query syntax:
SELECT DISTINCT "ID"
FROM myDataSource
instead of this "correct" one :
SELECT DISTINCT ID
FROM myDataSource
There is no mention of it in the Spark SQL Syntax documentation
i am using glue to transfer data from postgreSql db to another postgreSql db, i always have issue on the id as it is declared as primary key, but when the primary key tag is remove on the database there would be no error.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node PostgreSQL
PostgreSQL_node1654086903275 = glueContext.create_dynamic_frame.from_catalog(
database="my_db",
table_name="test_table_1",
transformation_ctx="PostgreSQL_node1654086903275",
)
# Script generated for node Rename Field
RenameField_node1654086935942 = RenameField.apply(
frame=PostgreSQL_node1654086903275,
old_name="id",
new_name="Id",
transformation_ctx="RenameField_node1654086935942",
)
# Script generated for node PostgreSQL
PostgreSQL_node1654086963634 = glueContext.write_dynamic_frame.from_catalog(
frame=RenameField_node1654086935942,
database="my_db",
table_name="test_table_",
transformation_ctx="PostgreSQL_node1654086963634",
)
job.commit()
Is it possible to make simplest concurrent SQL queries on S3 file with partitioning?
The problem it looks like you have to choose 2 options from 3.
You can make concurrent SQL queries against S3 with S3 Select. But S3 Select doesn't support partitioning, it also works on single file at a time.
Athena support partitioning and SQL queries, but it has limit of 20 concurrent queries. Limit could be increased, but there is no guarantees and uper line.
You can configure HBase that works on S3 through EMRFS, but that requires to much configurations. And I suppose data should be written through HBase (another format). Maybe more simple solution?
You can also use such managed services like AWS Glue or AWS EMR.
Example code which you can run in Glue:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
def load_dict(_database,_table_name):
ds = glueContext.create_dynamic_frame.from_catalog(database = _database, table_name = _table_name, transformation_ctx = "ds_table")
df = ds.toDF()
df.createOrReplaceTempView(_table_name)
return df
df_tab1=load_dict("exampledb","tab1")
df_sql=spark.sql( "select m.col1, m.col2 from tab1 m")
df_sql.write.mode('overwrite').options(header=True, delimiter = '|').format('csv').save("s3://com.example.data/tab2")
job.commit()
You can also consider to use Amazon Redshift Spectrum.
https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/
I'm building an ETL pipeline using AWS-Glue and I'm running into this error when running a job :
"TypeError: unsupported operand type(s) for +: 'DynamicFrame' and 'str'"
The job is processing data and then writing it out to a PostgreSQL database.
The job seems to be working fine in the sense that the processing is working and the PSQL database is being updated, but the job still reports this error every time it runs.
I'm a bit stumped because I'm basically using a modified version of the stock AWS job script.
Here is my code :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
import pyspark.sql.functions
from pyspark.sql.functions import to_date
from pyspark.sql.functions import input_file_name
from pyspark.sql.functions import current_timestamp
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Create a DynamicFrame using the Service ROs table
ros_DyF = glueContext.create_dynamic_frame.from_catalog(database="DB",
table_name="TB", transformation_ctx = "ros_DyF")
# Do a bunch of processing...code not included...
# Update the tables in postgreSQL
psql_conn_options = {'database' : 'DB', 'dbtable' : 'TB'}
psql_tmp_dir = "TMPDIR"
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = mapped_dyF,
catalog_connection = 'wizelyPSQL',
connection_options = psql_conn_options,
redshift_tmp_dir = psql_tmp_dir,
transformation_ctx = "datasink4")
job.commit()
Here's the error I get :
Traceback (most recent call last):
File "script_2018-07-09-19-30-30.py", line 168, in <module>
transformation_ctx = "datasink4")
File "/mnt/yarn/usercache/root/appcache/application_1531164400757_0001/container_1531164400757_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 597, in from_jdbc_conf
File "/mnt/yarn/usercache/root/appcache/application_1531164400757_0001/container_1531164400757_0001_01_000001/PyGlue.zip/awsglue/context.py", line 262, in write_dynamic_frame_from_jdbc_conf
File "/mnt/yarn/usercache/root/appcache/application_1531164400757_0001/container_1531164400757_0001_01_000001/PyGlue.zip/awsglue/context.py", line 278, in write_from_jdbc_conf
File "/mnt/yarn/usercache/root/appcache/application_1531164400757_0001/container_1531164400757_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1531164400757_0001/container_1531164400757_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
TypeError: unsupported operand type(s) for +: 'DynamicFrame' and 'str'
End of LogType:stdout
Any suggestions?
Seems like it's a bug in PyGlue lib. However I checked source code and didn't didn any suspicious. Here is a line:
28. return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame.name + "_errors")
The error you are receiving would be produced if the line would be like this (removed .name from last parameter):
28. return DynamicFrame(self._jsink.pyWriteDynamicFrame(dynamic_frame._jdf, callsite(), info), dynamic_frame.glue_ctx, dynamic_frame + "_errors")
In this case your job would work since evaluation of the self._jsink.pyWriteDynamicFrame(...) happens before the string concatenation that produces error.
If you are using PySpark lib in dev endpoint then try to download the latest version from aws-glue-jes-prod-us-east-1-assets/etl/python/PyGlue.zip. Otherwise, if you are composing script in Glue Console UI (lib is provided by AWS Glue service) then you should contact AWS support.