Pattern matching with regular expression in spark dataframes using spark-shell - regex

Suppose we are given dataset ("DATA") like :
YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY | ANDERSON | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN | JOHNSON | Spark|R; 90|56
2006 | NIHA | DIVA | w/o sports
and we have another dataset ("RESULT") like :
YEAR | FIRST NAME | LAST NAME
1992 | EMMA | CENA
2008 | JOY | ANDERSON
2008 | STEVEN | ANDERSON
2006 | NIHA | DIVA
and so on.
The output should be ("RESULT") :
YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA | CENA | | | |
2008 | JOY | ANDERSON | SPARK | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | PYTHON | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | SCALA | 45 | FALSE | TRUE
2008 | STEVEN | ANDERSON | | | |
2006 | NIHA | DIVA | | | FALSE |
2008 | STEVEN | JOHNSON | SPARK | 90 | |
2008 | STEVEN | JOHNSON | SPARK | 56 | |
2008 | STEVEN | JOHNSON | R | 90 | |
2008 | STEVEN | JOHNSON | R | 56 | |
and so on.
Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on.
Hope you understand my query. And I am using spark-shell with spark dataframes.
Note that "Spark" and "spark" should be considered as same.

As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell
data:
val df = List(
("2008","JOY ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN ","JOHNSON ","Spark|R;90|56"),
("2006","NIHA ","DIVA ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")
I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give
val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
.withColumn("letter",explode(split('backrefReplace(0),"\\|")))
.select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
explode(split('backrefReplace(1),"\\|")).as("digits"),
'backrefReplace(2).as("tags")
)
which gives
scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE |letter|digits|tags |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45 |w/o sports;w datascience|
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |56 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |56 | |
|2006|NIHA |DIVA |w/o sports | | |w/o sports |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:
List(("a;b;c")).toDF("str")
.withColumn("char",explode(split('str,";")))
.groupBy('str)
.pivot("char")
.count
.show()
+-----+---+---+---+
| str| a| b| c|
+-----+---+---+---+
|a;b;c| 1| 1| 1|
+-----+---+---+---+
Read more about pivot here
The final step is simply to do a left join on the second dataset (first "RESULT").

Related

Checking for a Range of Values

I could check for a range of values, use the BETWEEN operator.
MySQL [distributor]> select prod_name, prod_price from products where prod_price between 3.49 and 11.99;
+---------------------+------------+
| prod_name | prod_price |
+---------------------+------------+
| Fish bean bag toy | 3.49 |
| Bird bean bag toy | 3.49 |
| Rabbit bean bag toy | 3.49 |
| 8 inch teddy bear | 5.99 |
| 12 inch teddy bear | 8.99 |
| 18 inch teddy bear | 11.99 |
| Raggedy Ann | 4.99 |
| King doll | 9.49 |
| Queen doll | 9.49 |
+---------------------+------------+
9 rows in set (0.005 sec)
I reference to django docs and found gte, gt, lt, lte but no between.
How could I achieve the between functionality?
use this in django ORM products.objects.filter(prod_price__range=(3.49 , 11.99)) ref for more info

PowerBI: Use non-shown values for Drillthrough

I am trying to build a Power BI report for data from a SQL database where I have to show detail pages using Drillthrough. The only viable way to connect the datasets is using the database row ids.
From a user's perspective the row ids would not add any value but a lot of noise.
Is there a way to drillthrough using the row ids without showing them in a visual?
Yes, this is possible in the current release of Power Bi Desktop using a workaround solution that involves hiding the row id column in the parent (or summary) page.
Take the following tables as example:
ALBUM
+---------+------------------------+
| AlbumId | AlbumName |
+---------+------------------------+
| 1 | Hoist |
+---------+------------------------+
| 2 | The Story Of the Ghost |
+---------+------------------------+
TRACK
+---------+---------+--------------------------+
| TrackId | AlbumId | TrackName |
+---------+---------+--------------------------+
| 1 | 1 | Julius |
+---------+---------+--------------------------+
| 2 | 1 | Down With Disease |
+---------+---------+--------------------------+
| 3 | 1 | If I Could |
+---------+---------+--------------------------+
| 4 | 1 | Riker's Mailbox |
+---------+---------+--------------------------+
| 5 | 1 | Axilla, Part II |
+---------+---------+--------------------------+
| 6 | 1 | Lifeboy |
+---------+---------+--------------------------+
| 7 | 1 | Sample In a Jar |
+---------+---------+--------------------------+
| 8 | 1 | Wolfmans Brother |
+---------+---------+--------------------------+
| 9 | 1 | Scent of a Mule |
+---------+---------+--------------------------+
| 10 | 1 | Dog Faced Boy |
+---------+---------+--------------------------+
| 11 | 1 | Demand |
+---------+---------+--------------------------+
| 12 | 2 | Ghost |
+---------+---------+--------------------------+
| 13 | 2 | Birds of a Feather |
+---------+---------+--------------------------+
| 14 | 2 | Meat |
+---------+---------+--------------------------+
| 15 | 2 | Guyute |
+---------+---------+--------------------------+
| 16 | 2 | Fikus |
+---------+---------+--------------------------+
| 17 | 2 | Shafty |
+---------+---------+--------------------------+
| 18 | 2 | Limb by Limb |
+---------+---------+--------------------------+
| 19 | 2 | Frankie Says |
+---------+---------+--------------------------+
| 20 | 2 | Brian and Robert |
+---------+---------+--------------------------+
| 21 | 2 | Water in the Sky |
+---------+---------+--------------------------+
| 22 | 2 | Roggae |
+---------+---------+--------------------------+
| 23 | 2 | Wading in the Velvet Sea |
+---------+---------+--------------------------+
| 24 | 2 | The Moma Dance |
+---------+---------+--------------------------+
| 25 | 2 | End of Session |
+---------+---------+--------------------------+
Add them as data sources. The 1:many relationship between AlbumId should be created. Create a parent page with a table containing AlbumId and AlbumName. Then create the details page with a table containing only the TrackName column. In the Drillthrough filter field of the details page, drag the Album Table -> AlbumId to this field.
Now go back to the parent page and notice that when you right click on an album, you get the drillthrough menu to the details page. This works, but now you have a messy AlbumId column on your parent page.
The workaround is to hide the AlbumId on the parent report. First go to the Format(Paint roller) menu of the table on the parent report and in the column header -> word wrap turn this off. Then drag the column separator of the table to hide the AlbumId. See before and after images below.
BEFORE HIDE
AFTER HIDE
I have the powerbi file posted here if you want to see it in action.

unique column value per row

I find it best to use an example, so here we go:
Say I have a table with chores and a table with a weekly schedule like this:
CHORES:
|----+---------------+----------+-------|
| id | name | type | hours |
|----+---------------+----------+-------|
| 1 | clean kitchen | cleaning | 4 |
|----+---------------+----------+-------|
| 2 | clean toilet | cleaning | 3 |
etc
SCHEDULE:
|------+---------------+---------------+-----|
| week | monday | tuesday | etc |
|------+---------------+---------------+-----|
| 1 | clean kitchen | clean toilet | etc |
|------+---------------+---------------+-----|
| 2 | clean toilet | clean kitchen | etc |
etc
I want to make sure that for one week, you can't have duplicate cells, so this wouldn't be allowed:
SCHEDULE:
|------+---------------+--------------+-----|
| week | monday | tuesday | etc |
|------+---------------+--------------+-----|
| 1 | clean toilet | clean toilet | etc |
etc
What would I have to do in my models.py to get this behaviour?
Try django unique-together in model meta option.
https://docs.djangoproject.com/en/1.11/ref/models/options/#unique-together
I'd better user ManyToMany through another table like that:
SCHEDULE:
------+------------------------+
| week | chores |
|------+------------------------+
| 1 | many to many to chores |
|------+------------------------+
| 2 | many to many to chores |
And trough table like that
THROUGH TABLE:
|---------+---------------+---------------+
| week_id | day of week | chores_id |
|---------+---------------+---------------+
| 1 | Monday | clean toilet |
|---------+---------------+---------------+
| 1 | Tuesday | clean kitchen |
And in that table make unique together for week_id and chores_id

How to run raw query with a model with dynamic fields in Django 1.9?

I have a complex result that requires writing raw sql queries.
See https://stackoverflow.com/a/38548462/80353
The expected result is a table showing several columns.
The first column header is simply Product and the other column headers are store names.
The values are simply the product names and the aggregated sales values of the product in these stores.
Which stores will be shown is entirely dynamic. Maximum should be 9 stores.
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
For more details of the schema, check the question in How to get back aggregate values across 2 dimensions using Python Cubes?
My question
The schema is not super important to my question which is:
Since I am going to write a complex raw query, is there a way to map the query result to a model where the fields are dynamic?
I found documentation about how to execute raw queries in Django and how to execute raw queries to existing models with fixed fields and matching table.
My question is is it possible to do that for a model that has no matching table and dynamic fields?
If so, how?
Or if I choose to use materialised view in postgresql, how do I match it with a model class?

Two outer joins in django's queryset (for language fall back case)

I have two models, Version and Description.
class Version(models.Model):
version_name = models.CharField(max_length=100)
version_value = models.IntegerField()
url = models.CharField(max_length=240)
class Description(models.Model):
version = models.ForeignKey(Version)
lang = models.CharField(max_length=8)
content = models.TextField()
And a DescriptionSerializer.
class DescriptionSerializer(serializers.ModelSerializer):
version_name = serializers.RelatedField(source='version')
class Meta:
model = Description
fields = ('version_name', 'content')
They stored the descriptions of different versions in different languages.
E.g.
Version
+----+--------------+---------------+---------------------+
| id | version_name | version_value | url |
+----+--------------+---------------+---------------------+
| 1 | 1.0.0 | 1 | http://abc.net.tw/ |
| 2 | 1.0.1 | 2 | http://abc.net.tw/2 |
| 3 | 1.0.2 | 3 | http://abc.net.tw/3 |
| 4 | 1.0.3 | 4 | http://abc.net.tw/4 |
| 7 | 1.1.0 | 5 | http://abc.net.tw/5 |
| 8 | 1.1.1 | 6 | http://abc.net.tw/6 |
+----+--------------+---------------+---------------------+
Description
+------------+-------+---------+
| version_id | lang | content |
+------------+-------+---------+
| 1 | en_US | English |
| 1 | zh_TW | Chinese |
| 1 | es_ES | Spanish |
| 2 | en_US | English |
| 2 | zh_TW | Chinese |
| 2 | es_ES | Spanish |
| 3 | en_US | English |
| 3 | zh_TW | Chinese |
| 3 | es_ES | Spanish |
| 4 | en_US | English |
| 7 | en_US | English |
| 8 | en_US | English |
| 4 | es_ES | Spanish |
| 7 | es_ES | Spanish |
+------------+-------+---------+
I'm using django rest framework to implement a web API that returns the description of each version in certain language. If a description of certain language doesn't exist, use English version instead.
I can use following SQL to retrieve the desired result. I've read DRF's docs on relatedField and reverse relation. But I still can't figure out how to use django's ORM to do the same thing and to use it with django rest framework's serializer.
select
coalesce(d.id, d2.id), coalesce(d.version_id, d2.version_id), coalesce(d.lang, d2.lang), coalesce(d.content, d2.content)
from
version v
left outer join description d on v.id = d.version_id and d.lang='zh_TW'
left outer join description d2 on v.id = d2.version_id and d2.lang='en_US'
Please advise how to do it in django.
You can't use django orm for everything. There are numerous things you can't do with django. For those cases you either use straight up SQL (from django.db import connection, transaction etc...) or if the query results can be worked into objects you have described - then you can use raw queries (link)