pyspark.sql.utils.AnalysisException: Reference 'title' is ambiguous, could be: title, title - amazon-web-services

I am using glue version 3.0, python version 3, spark version 3.1.
I am extracting data from xml creating dataframe and writing data to s3 path in csv format.
Before writing dataframe I printed the schema and 1 record of dataframe using show(1). till this point everything was fine.
but while writing it to csv file in s3 location got error duplicate column found as my dataframe had 2 columns namely "Title" and "title".
tried to add a new column title2 which will have content of title and thought of dropping title later with below command
from pyspark.sql import functions as f
df=df.withcoulumn('title2',f.expr("title"))
but was getting error
Reference 'title' is ambiguous, could be: title, title
Tried
df=df.withcoulumn('title2',f.col("title"))
got same error.
any help or approach to resolve this please..

By default spark is case in-sensitive, we can make spark sensitive by setting the spark.sql.caseSensitive to True.
from pyspark.sql import functions as f
df = spark.createDataFrame([("CaptializedTitleColumn", "title_column", ), ], ("Title", "title", ))
spark.conf.set('spark.sql.caseSensitive', True)
df.withColumn('title2',f.expr("title")) .show()
Output
+--------------------+------------+------------+
| Title| title| title2|
+--------------------+------------+------------+
|CaptializedTitleC...|title_column|title_column|
+--------------------+------------+------------+

Related

Spark SQL Union on Empty dataframe

We have glue job running script in which we have created two dataframes using createOrReplaceTempView command. The dataframes are then joined via a union:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
joined_df = spark.sql(
"""
Select
id
, name
, product
, year
, month
, day
from df1
Union ALL
Select
id
, name
, product
, year
, month
, day
from df2
"""
)
joined_df.createOrReplaceTempView("joined_df")
Everything appeared to be working fine until it failed. Upon research, I am speculating it is because one of the dataframe is empty. This job runs daily and occasionally, one of the dataframe will not have data for the day (df2).
The error message returned in Cloudwatch log is not entirely clear:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`id`' given input columns:
[]; line 22 pos 7;\n'Union\n:- Project [contract_id#0, agr_type_cd#1, ind_product_cd#2,
report_dt#3, effective_dt#4, agr_trns_cd#5, agreement_cnt#6, dth_bnft#7, admn_system_cd#8,
ind_lob_cd#9, cca_agreement_cd#10, year#11, month#12, day#13]\n: +- SubqueryAlias `df1`\n
'Project ['id, name#20,
'product, 'year, 'month, 'day]\n +- SubqueryAlias `df2`\n +- LogicalRDD false\n"
I'm seeking advice on how to resolve such an issue. Thank you.
UPDATE:
I'm fairly certain I know the issue, but unsure on how to solution. On days where the source file in s3 is "empty" no records but only has the header row, the read from catalog returns an empty set:
df2 = glueContext.create_dynamic_frame.from_catalog(
database = glue_database,
table_name = df2_table,
push_down_predicate = pred
)
df2 = df2.toDF()
df2.show()
The output:
++
||
++
++
Essentially, the from_catalog method is not reading the schema from Glue. I would expect that even without data, the header would be detected and a union would just return everything from df1. But, since I'm receiving a empty set without header, the union cannot occur because it's acting as though the "schema has changed" or non-existent... But, the underlying s3 files have not changed schema. This job triggers daily when source files have been loaded to S3. When there is data in the file, the union does not fail as the schema can be detected from_catalog. Is there a way to read the header even when no data returns?

writing from a Spark DataFrame to BigQuery table gives BigQueryException: Provided Schema does not match

My PySpark computes a DataFrame that I want to insert into a BigQuery table (from a dataproc cluster).
On the BigQuery side, the partition field is REQUIRED.
On the DataFrame side, the partition field inferred is not REQUIRED, that is why I make a schema defining this field as REQUIRED :
StructField("date_part",DateType(),False)
So, I create a new DF with the new schema and when I show this DF, I see as expected :
date_part: date (nullable = false)
But my PySpark ended like that :
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table xyz$20211115. Field date_part has changed mode from REQUIRED to NULLABLE
Is there something I missed ?
Update :
I am using Spark 3.0 image
And spark-bigquery-latest_2.12.jar connector

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

How to partition data by datetime in AWS Glue?

The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough

pyodbc and xmp toolkit adding metadata printing on command line but only adding first letter in metadata

Hi I'm having a problem adding metadata to some jpg images.
I am querying a filemaker database via pyodbc
sql = cur.execute("""SELECT division, attribute_20, brand_name FROM table WHERE item_number=?""", (prod))
row = cur.fetchone()
when I print(row[0]) i get an output of 'BRN' and the type is 'unicode'
However when I try too add this as metadata xmp.set_property(consts.XMP_NS_DC, u'DivisionName', row[0])
it only inputs the first letter. <dc:DivisionName>B</dc:DivisionName>
I have tried converting this to a string but it's the same problem.
Any help would be really appreciated!
Richard