Pyspark with AWS Glue join on multiple columns creating duplicates

Pyspark with AWS Glue join on multiple columns creating duplicates - amazon-web-services

I have two tables in AWS Glue, table_1 and table_2 that have almost identical schemas, however, table_2 has two additional columns. I am trying to join these two tables together on the columns that are the same and add the columns that are unique to table_2 with null values for the "old" data whose schema does not include those values.
Currently, I am able to join the two tables, using something similar to:
joined_table = Join.apply(table_1, table_2, 'id', 'id')
where the first 'id' is the id column in table_1 and the second 'id' is the id column in table_2. This call successfully joins the table into one, however, the resulting joined_table has duplicate fields for the matching columns.
My two questions are:
How can I leverage AWS Glue job with Pyspark to join all columns that match across the two tables so that there are not duplicate columns and while adding the new fields?
This sample call only takes in the 'id' column as I was trying to get this just to work, however, I want to pass in all the columns that match across the two tables. How can I pass in a list of columns to this Join.apply call? I am aware of the available methods from Pyspark directly, however, am wondering if there is a way specific to AWS Glue jobs or if there is something I need to do within AWS Glue to leverage Pyspark functionality directly.

I found that I needed to rename the columns in table_1 and then was missing a call to .drop_fields after my Join.apply call to remove the old columns from the joined table.
Additionally, you can pass in a list of column names rather than the single 'id' column that I was trying to use in the question.

joineddata = Join.apply(frame1 = table1, frame2 = table2, keys1 = ['id'], keys2 = ['id'], transformation_ctx = 'joinedData')
The join in aws glue doesn't handle duplicates. You need to convert to dataframes and then drop duplicate.
If you have duplicates, Try this:
selectedFieldsDataFrame = joineddata.toDF()
selectedFieldsDataFrame.dropDuplicates()

Related

BQ Get labels from information schema

I need to get the labels of all the BQ tables in a project.
Currently the only way I found is to loop over all the tables and retrieve the labels.
tables = client.list_tables(dataset_id)
for table in tables:
if table.labels:
for label, value in table.labels.items():
This approach works but is time consuming.
Is there any possibility to get the labels using a unique BQ query?
INFORMATION_SCHEMA.TABLES doesn't return the labels.

You can define an option to return the labels from the INFORMATION SCHEMA.
SELECT
*
FROM
INFORMATION_SCHEMA.SCHEMATA_OPTIONS
WHERE
schema_name = 'schema'
AND option_name = 'labels';

Filter Dynamo DB rows

I want to filter all dynamo db rows where 2 columns have same value
table = client.Table('XXX')
response = table.query(
KeyConditionExpression=Key('column1').eq(KeyConditionExpression=Key('column2'))
)
this is wrong as we can't pass KeyConditionExpression inside eq statement. I don't want to scan through all the rows and filter the rows.
Scanned through multipole resources and answers but every resources talks about the multiple column checking with some value not multiple condition involving columns
Is there anyway we can achieve this?

Yes, this is possible.
If you want query over all records you need to use scan, if you want to query only records with one specific partition key you can use query.
For both you can use a FilterExpression, which will filter the records after retrieving them from the database, but before returning them to the user (so beware, using a scan with this will read all your records).
A scan from the CLI could look like this:
aws dynamodb scan \
--table-name your-table \
--filter-expression "#i = #j" \
--expression-attribute-names '{"#i": "column1", "#j": "column2"}'

Create a Global Secondary Index with a partition key of 'Column1Value#Column2Value'
Then it's simply a matter of querying the GSI.

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?

I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

How to substitute NULL with value in Power BI when joining one to many

In my model I have table AssignedToUser that don't contain any NULL values.
Table TotalCounts contains number of tasks for each User.
Those two table joined on UserGUID, and table TotalCounts contains NULL value for UserGUID.
When I drop everything in one table there is NULL value for AssignedToUser.
How can I substitute value NULL for AssignedToUser for "POOL".
Under EditQuery I tried to Create additional column
if [AssignedToUser] = null then "POOL" else [AssignedToUser]
But that didnt help.
UPDATE:
Thanks Alexis.
I have created FullAssignedToUsers table, but when I try to make a relationship with TotalCounts on UserGUID - it doesnt like it.
Data in new a table looks like this:
UPDATE:
File .ipbx can be accessed here:
https://www.dropbox.com/s/95frggpaq6tce7q/User%20Open%20Closed%20Tasks%20Experiment.pbix?dl=0

I believe the problem here is that your join has UserGUID values that are not in your AssignedToUsers table.
To correct this, one possibility is to replace your AssignedToUsers table with one that contains all the UserGUID values from the TotalCounts table.
FullAssignedToUsers =
ADDCOLUMNS(VALUES(TotalCounts[UserGUID]),
"AssignedToUser",
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID]))
The should get you the full outer join. You can then create the custom column like you described in the table and use that column in your visual.
You'll probably want to break the relationships with the original AssignedToUsers table and create relationships with the new one instead.
If you don't want to take that extra step, you can do an ISBLANK inside your new table formula.
FullAssignedToUsers =
ADDCOLUMNS(VALUES(TotalCounts[UserGUID]),
"AssignedToUser",
IF(
ISBLANK(
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID])),
"POOL",
LOOKUPVALUE(AssignedToUsers[AssignedToUser],
AssignedToUsers[UserGUID], TotalCounts[UserGUID])))
Note: This is equivalent to doing a right outer join merge on the AssignedToUsers table in the query editor and then replacing the nulls with "POOL". I'd actually recommend approaching it that way instead.
Another way to approach it is to pull the AssignedToUser column over to the TotalCounts table in a custom column and use that column in your visual instead.
AssignedToUsers =
IF(ISBLANK(RELATED(AssignedToUsers[AssignedToUser])),
"POOL",
RELATED(AssignedToUsers[AssignedToUser]))
This is equivalent to doing a left outer join merge on the TotalCounts table in the query editor, expanding the AssignedToUser column, then replacing nulls with "POOL" in that column.

In Dax missing values are Blank() not null. Try this:
=IF(
ISBLANK(AssignedToUsers[AssignedToUser]),
"Pool",
AssignedToUsers[AssignedToUser]
)

AWS Athena flattened data from nested JSON source

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b

I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark with AWS Glue join on multiple columns creating duplicates - amazon-web-services

Related

BQ Get labels from information schema

Filter Dynamo DB rows

AWS Glue dynamic frame - no column headers if no data

How to substitute NULL with value in Power BI when joining one to many

AWS Athena flattened data from nested JSON source

Categories

Resources