Using REGEXP in a loop reading from other table in MySQL - regex

Have a question about the use of REGEXP in a query. I have a table in my database with near 553,000 records (34,517 x 16). And also have a list of values which need to find in that table. Using REGEXP I have success in find some values using this statement:
SELECT * FROM `TableA` WHERE ((*desiredvalue* REGEXP 'a|b|c|d...'));
Now, the list of desiredvalues growth from 20 to 1700, so, exists some way to put this new values in a single column in a Table B and search in the TableA using a reading loop over the new table. Mi first instinct was save the consult and paste the all 1700 records, but the idea is do it automatic when the Table Bbe updated.
Here's is an example of my initial matrix (all values are 14 character strings):
+-----+---+---+---+---+-----+----+
|Group|SP1|SP2|SP3|SP4|.....|SP15|
+-----+---+---+---+---+-----+----+
|G1 |a |b |c |d |.....|x |
|G2 | |b |h |d |.....|z |
|G4 |a |b | |m |.....|r |
|G5 |o |p |q |r |.....|h |
+-----+---+---+---+---+-----+----+
The idea if I have a list with of values val=(a,c,h,r,p), I obtain this result:
+---+-----+
|val|Group|
+---+-----+
|a |G1 |
|a |G4 |
|c |G1 |
|h |G2 |
|r |G4 |
|r |G5 |
|p |G5 |
+---+-----+
Thanks!
Christian

Related

How can I use regex_replace in pyspark to reformat the date from yyyymmdd to yyyy/mm/dd and reformat time from HHmmss to HH:mm:ss

I am trying to use regex_replace to reformat a date column from yyyymmdd to yyyy/mm/dd and another column from HHmmss to HH:mm:ss. Both date and time columns are strings.
From:
+----------+--------+
| date | time |
+----------+--------+
| 20200326 | 122450 |
+----------+--------+
To:
+------------+----------+
| date | time |
+------------+----------+
| 2020/03/26 | 12:24:50 |
+------------+----------+
Here's what I've tried:
datePattern = "([0-9]{4})([0-9]{2})([0-9]{2})"
timePattern = "([0-9]{2})([0-9]{2})([0-9]{2})"
df.withColumn("date", regexp_replace(df.date, datePattern, "$1/$2/$3"))
df.withColumn("time", regexp_replace(df.time, timePattern, "$1:$2:$3"))
Here's what I get:
+----------+--------+
| date | time |
+----------+--------+
| 20200326 | 122450 |
+----------+--------+
Not sure where I went wrong? Also, are there better practices than using regex_replace?
Using from_unixtime,unix_timestamp functions instead of regexp_replace!
df.show()
#+--------+------+
#| date| time|
#+--------+------+
#|20200326|122450|
#+--------+------+
df.withColumn("date",from_unixtime(unix_timestamp(col("date"),"yyyyMMdd"),"yyyy/MM/dd")).\
withColumn("time",from_unixtime(unix_timestamp(col("time"),"HHmmss"),"HH:mm:ss")).\
show()
#+----------+--------+
#| date| time|
#+----------+--------+
#|2020/03/26|12:24:50|
#+----------+--------+
From Spark-2.2+
We can use to_date(),to_timestamp() and date_format() functions for this case too!
from pyspark.sql.functions import *
df.withColumn("date",date_format(to_date(col("date"),"yyyyMMdd"),"yyyy/MM/dd")).\
withColumn("time",date_format(to_timestamp(col("time"),"HHmmss"),"HH:mm:ss")).\
show()
#+----------+--------+
#| date| time|
#+----------+--------+
#|2020/03/26|12:24:50|
#+----------+--------+

Checking for a Range of Values

I could check for a range of values, use the BETWEEN operator.
MySQL [distributor]> select prod_name, prod_price from products where prod_price between 3.49 and 11.99;
+---------------------+------------+
| prod_name | prod_price |
+---------------------+------------+
| Fish bean bag toy | 3.49 |
| Bird bean bag toy | 3.49 |
| Rabbit bean bag toy | 3.49 |
| 8 inch teddy bear | 5.99 |
| 12 inch teddy bear | 8.99 |
| 18 inch teddy bear | 11.99 |
| Raggedy Ann | 4.99 |
| King doll | 9.49 |
| Queen doll | 9.49 |
+---------------------+------------+
9 rows in set (0.005 sec)
I reference to django docs and found gte, gt, lt, lte but no between.
How could I achieve the between functionality?
use this in django ORM products.objects.filter(prod_price__range=(3.49 , 11.99)) ref for more info

Pattern matching with regular expression in spark dataframes using spark-shell

Suppose we are given dataset ("DATA") like :
YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY | ANDERSON | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN | JOHNSON | Spark|R; 90|56
2006 | NIHA | DIVA | w/o sports
and we have another dataset ("RESULT") like :
YEAR | FIRST NAME | LAST NAME
1992 | EMMA | CENA
2008 | JOY | ANDERSON
2008 | STEVEN | ANDERSON
2006 | NIHA | DIVA
and so on.
The output should be ("RESULT") :
YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA | CENA | | | |
2008 | JOY | ANDERSON | SPARK | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | PYTHON | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | SCALA | 45 | FALSE | TRUE
2008 | STEVEN | ANDERSON | | | |
2006 | NIHA | DIVA | | | FALSE |
2008 | STEVEN | JOHNSON | SPARK | 90 | |
2008 | STEVEN | JOHNSON | SPARK | 56 | |
2008 | STEVEN | JOHNSON | R | 90 | |
2008 | STEVEN | JOHNSON | R | 56 | |
and so on.
Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on.
Hope you understand my query. And I am using spark-shell with spark dataframes.
Note that "Spark" and "spark" should be considered as same.
As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell
data:
val df = List(
("2008","JOY ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN ","JOHNSON ","Spark|R;90|56"),
("2006","NIHA ","DIVA ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")
I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give
val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
.withColumn("letter",explode(split('backrefReplace(0),"\\|")))
.select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
explode(split('backrefReplace(1),"\\|")).as("digits"),
'backrefReplace(2).as("tags")
)
which gives
scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE |letter|digits|tags |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45 |w/o sports;w datascience|
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |56 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |56 | |
|2006|NIHA |DIVA |w/o sports | | |w/o sports |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:
List(("a;b;c")).toDF("str")
.withColumn("char",explode(split('str,";")))
.groupBy('str)
.pivot("char")
.count
.show()
+-----+---+---+---+
| str| a| b| c|
+-----+---+---+---+
|a;b;c| 1| 1| 1|
+-----+---+---+---+
Read more about pivot here
The final step is simply to do a left join on the second dataset (first "RESULT").

Rescale Dataset using Power BI

I'm trying to rescale a dataset in using PowerBI Desktop. I've imported a dataset full of raw data, but I can't use row context together with an aggregate. I'm trying to accomplish this:
Data:
+---------+-----+
| Name | Bar |
+---------+-----+
| Alfred | 0 |
| Alfred | -1 |
| Alfred | 1 |
| Burt | 1 |
| Burt | 0 |
| Charlie | 1 |
| Charlie | 1 |
| Charlie | 0 |
+---------+-----+
Calculations:
Foo: = SUM(Bar) / COUNT(Bar) GROUP BY Name
Which would Generate this dataset:
+---------+-----+
| Name | Foo |
+---------+-----+
| Alfred | 0 |
| Burt | .5 |
| Charlie | .67 |
+---------+-----+
Final Calculation:
Score: = (#Foo - MIN(Foo)) / (MAX(Foo)-MIN(Foo))
The goal is to grade on a curve with a set of data. I can do it in excel, but was hoping that Power BI could handle all the heavy lifting.
At this point it might be easier to do it all in SQL before bringing it into PowerBI, but that would make it significantly less dynamic (with date filters and the like). Thanks for any insight you might have!
I think you're looking for the GROUPBY DAX function. https://support.office.com/en-us/article/GROUPBY-Function-DAX-d6d064b2-fd8b-4c1b-97f8-c6d03cdf8ad0
You then would GROUPBY on the Name field and proceed from there. If need to use the measure outside of a visual that groups by each Name (like show me the average score after applying the curve), then you'll need to wrap that in a calculate table where you include the names, your measure projected as a column, and then do your aggregates (min/max/average) over that calculated table.

how to classify the whole data set in weka

I've got a supervised data set with 6836 instances, and I need to know the predictions of my model for all the instances, not only for a test set.
I followed the approach train-test (2/3-1/3) to know about my rates TPR and FPR, and I've got the predictions about my test (1/3), but I need to know the predcitions about all the 6836 instances.
How can I do it?
Thanks!
In the classify tab in Weka Explorer there should be a button that says 'More options...' if you go in there you should be able to output predictions as plain text. If you use cross validation rather than a percentage split you will get predictions for all instances in a table like this:
+-------+--------+-----------+-------+------------+
| inst# | actual | predicted | error | prediction |
+-------+--------+-----------+-------+------------+
| 1 | 2:no | 1:yes | + | 0.926 |
| 2 | 1:yes | 1:yes | | 0.825 |
| 1 | 2:no | 1:yes | + | 0.636 |
| 2 | 1:yes | 1:yes | | 0.808 |
| ... | ... | ... | ... | ... |
+-------+--------+-----------+-------+------------+
If you don't want to do cross validation you also can create a data set containing all your data (training + test) and add it as test data. Then you can go to more options and show the results as Campino already answered.