I have created a glue ETL job, which is loading data from file present at s3 location into Redshift table. However my need is to load the data into a particular redshift schema, below is the sample code from my glue script:
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": True, "separator": "~"},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://<bucketName>/{0}".format(file_name_temp)]},
transformation_ctx="S3bucket_node1",
)
print('{0} completed'.format(file_name))
print(S3bucket_node1)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = S3bucket_node1, catalog_connection = "redshift", connection_options = {"dbtable": "Persons", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
With above script, it always load data into Public schema. When I try to define DBname as "Schema.tableName" it gives an error. I tried to add a new property as "Schema" in connection_options that did not worked as well. Can someone please help, how can I define the schema name above to load data into that particular schema only. I knw how it can be done by creating the crawler, but in my project we are restricted to create crawlers.
Related
TL;DR
I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
Input data is Catalogued in Glue and queryable via Athena
Glue Job runs with "Succeeded" output status, but no output files are created
Details
Input I have data that's being created from a scraper on a once-per-minute cycle. It's dumping the output in JSON (gzip) format to a bucket. I have this bucket catalogued in Glue and can query against it, with no errors, using Athena. This makes me feel more confident that I have the Catalogue and data-structure set up correctly. Alone, this isn't ideal as it creates ~1.4K files per day, which makes the queries against the data (via Athena) quite slow as they have to scan way too many, far too small files
Goal I'd like to periodically (probably once per week, month, I'm not sure yet) consolidate the once-per-minute files into far fewer, so that queries are scanning bigger and less numerous files (faster queries).
Approach My plan is to create a Glue ETL job (using Glue Studio) to read from the Catalogue Table, and write to a new S3 location (maintaining the same JSON-gzip format, so I can just re-point the Glue table to the new S3 location with the consolidated files). I set up the job using Glue Studio, and when I run it it says is succeeded, but there's no output to the S3 location specified (not empty files, just nothing at all).
Stuck! I'm at a bit of a loss, since (1) it says it's succeeding, and (2) I'm not even modifying the script (see below), so I'd presume (maybe a bad idea) that it's not that.
Logs I've tried going through the CloudWatch logs to see if it'll help, but I don't get much out of there. I suspect it may have something to do with this entry, but I can't find a way to either confirm that or change anything to "fix" it. (The path definitely exists, verified by the fact that I can see it in S3, the Catalogue can search it as verified by Athena queries, and it's auto-generated by the Glue Studio script-builder.) To me it sounds like I've selected, somewhere, an option that makes it think I only want some sort of "incremental" scan of the data. But I haven't (knowingly), nor can I find anywhere that would make it seem I have.
CloudWatch Log Entry
21/03/13 17:59:39 WARN HadoopDataSource: Skipping Partition {} as no new files detected # s3://my_bucket/my_folder/my_source_data/ or path does not exist
Glue Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0"]
## #return: DataSource0
## #inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0")
## #type: DataSink
## #args: [connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0"]
## #return: DataSink0
## #inputs: [frame = DataSource0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
Other Posts I Researched First
None have the same problem of a "Succeeded" job providing no output. However, one had empty files being created, while another too many files. The most interesting approach was using Athena to create the new output file for you (with an external table); however, when I looked into that, it appeared that the output format options would not have JSON-gzip (or JSON without gzip), but only CSV and Parquet, which are non-preferred for my use.
How to Convert Many CSV files to Parquet using AWS Glue
AWS Glue: ETL job creates many empty output files
AWS Glue Job - Writing into single Parquet file
AWS Glue, output one file with partitions
datasource_df = DataSource0.repartition(1)
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
I'm trying to use an ETL job to directly write my dataframe to a database catalog and update my partitions.
I had a code like this :
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {
"path": TARGET_PATH,
"partitionKeys":["x", "y"]
},
format = "parquet",
transformation_ctx = "datasink4")
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["x", "y"]
sink = glueContext.write_dynamic_frame_from_catalog(frame=dropnullfields3,
database=DATABASE,
table_name=TABLE,
transformation_ctx="write_sink",
additional_options=additionalOptions)
which worked to write the data into the catalog. However I would like to avoid the double write.
So I followed the method 2 from the documentation to update partitions :
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
And came with this code :
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {
"path": TARGET_PATH,
"partitionKeys":["x", "y"]
},
format = "parquet",
transformation_ctx = "datasink4")
sink = glueContext.getSink(connection_type="s3", path=TARGET_PATH,
enableUpdateCatalog=True,
partitionKeys=["x", "y"])
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=DATABASE, catalogTableName=TABLE)
sink.writeFrame(dropnullfields3)
But now the data can't be loaded in Athena, I get strange errors about the data structure like this :
HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: < expected at the end of 'struct' (Service: null; Status Code: 0; Error Code: null; Request ID: null)
I have tried to recreate the table to have only the new files in glueparquet.
I have also tried to run a crawler on the new glueparquet files, the table generated from the crawler can be queried. However when I fill the same table from the ETL job above I get always this error...
You want to change the classification for the table to glueparquet
CREATE EXTERNAL TABLE `table_name`(
...
)
PARTITIONED BY (
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://cortisol-beta-log-bucket/service_log/'
TBLPROPERTIES (
'classification'='glueparquet')
or in CDK you need to set the dataFormat as follows:
dataFormat: new DataFormat({
inputFormat: InputFormat.PARQUET,
// Have to explicitly specify classification string to allow glue jobs to add partitions
classificationString: new ClassificationString("glueparquet"),
outputFormat: OutputFormat.PARQUET,
serializationLibrary: SerializationLibrary.PARQUET
}),
then you can just use the code below and it will work with athena:
glueContext.write_dynamic_frame.from_catalog(
frame=last_transform,
database=args["GLUE_DATABASE"],
table_name=args["GLUE_TABLE"],
transformation_ctx="datasink",
additional_options={"partitionKeys": partition_keys, "enableUpdateCatalog": True},
)
I've been creting pyspark jobs and I keep getting one similar and intermittently error (is more like random):
An error occurred while calling o129.parquet. Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found;
Request ID: D2FA355F92AF8F05; S3 Extended Request ID: 1/fWdf1DurwPDP40HDGARlMRO/7lKzFDJ4g7DbUnM04wUvG89CG9w5T+u4UxapkWp20MfQfdjsE=)
I'm not even reading from s3, what I'm actually doing is:
df.coalesce(100).write.partitionBy("mth").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out")
So I change the coalesce partition, but I'm not wure what else should I do to mitigate this error and make my jobs more stable.
for reading the file from s3 using glue
datasource0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": "s3/path"}, format = "json", transformation_ctx = "datasource0")
for writing file to s3 using glue
output = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": "s3/path"}, format = "parquet", transformation_ctx = "output")
I have successfully connected to my bigquery using:
sparkSession.read.format("jdbc").option("url",jdbc_url).option("dbtable","cash_in").option("driver","cdata.jdbc.googlebigquery.GoogleBigQueryDriver").load()
However I don't want to read data from BigQuery, I want to write from my aurora database to it.
Something like:
retDatasink4 = glueContext.write_dynamic_frame.from_options(frame = orders, connection_type = "bigquery", connection_options = {"jdbc_url": jdbc_url, "dbtable" : "cash_in", "driver", "cdata.jdbc.googlebigquery.GoogleBigQueryDriver" })
However te connection_type mysql, postgresql, redshift, sqlserver e oracle. .
I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected.
My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. I know I point it at the table that was generated by the crawler, but from this doc:
Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.
If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service.
The main thing to understand is:
Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can either create tables/databases from Glue Console / Athena Query console.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides.
after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database.
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
If you drill down deep into the AWS Glue Data Catalog. It has tables residing under the databases. By clicking on these tables you get exposed to the metadata which shows which the s3 folder where the the current table is being pointed towards as a result of the crawler run.
You can still create tables over an s3 structured file manually by adding tables via data catalog option:
and pointing it to your s3 location.
Another way is to use AWS-athena console to create tables pointing s3 locations. You would be using a regular create table script with the location field holding your s3 location.