Parsing string using regexp_extract using pyspark

Parsing string using regexp_extract using pyspark - regex

I am trying to split the string to different columns using regular expression
Below is my data
decodeData = [('M|C705|Exx05','2'),
('M|Exx05','4'),
('M|C705 P|Exx05','6'),
('M|C705 P|8960 L|Exx05','7'),('M|C705 P|78|8960','9')]
df = sc.parallelize(decodeData).toDF(['Decode',''])
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '(M|P|M)(\\|Exx05)', 1)).withColumn('C705',regexp_extract(col('Decode'), '(M|P|M)(\\|C705)', 1)) .withColumn('8960',regexp_extract(col('Decode'), '(M|P|M)(\\|8960)', 1))
dfNew.show()
Result
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | 2 | | M| |
| M|Exx05 | 4 | M| | |
| M|C705 P|Exx05 | 6 | P| M| |
|M|C705 P|8960 L|Exx05| 7 | M| M| P|
| M|C705 P|78|8960 | 9 | | M| |
+--------------------+---+-----+----+-----+
Here I am trying to extract the Code for string Exx05,C705,8960 and this can fall into M/P/L codes
eg: While decoding 'M|C705 P|8960 L|Exx05' I expect the results as L M P in respective columns. However I am missing some logic here,which I am finding difficulty to crack
Expected results
+--------------------+---+-----+----+-----+
| Decode| |Exx05|C705| 8960|
+--------------------+---+-----+----+-----+
| M|C705|Exx05 | | M| M| |
| M|Exx05 | | M| | |
| M|C705 P|Exx05 | | P| M| |
|M|C705 P|8960 L|Exx05| | L| M| P|
| M|C705 P|78|8960 | | | M| P|
+--------------------+---+-----+----+-----+
When I am trying to change the reg expression accordingly , It works for some cases and it wont work for other sample cases, and this is just a subset of the actual data I am working on.
eg: 1. Exx05 can fall in any code M/L/P and even it can fall at any position,begining,middle,end etc
One Decode can only belong to 1 (M or L or P) code per entry/ID ie M|Exx05 P|8960 L|Exx05 - here Exx05 falls in M and L,This scenario will not exist.

You can add ([^ ])* in the regex to extend it so that it matches any consecutive patterns that are not separated by a space:
dfNew = df.withColumn(
'Exx05',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|Exx05)', 1)
).withColumn(
'C705',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|C705)', 1)
).withColumn(
'8960',
regexp_extract(col('Decode'), '(M|P|L)([^ ])*(\\|8960)', 1)
)
dfNew.show(truncate=False)
+---------------------+---+-----+----+----+
|Decode | |Exx05|C705|8960|
+---------------------+---+-----+----+----+
|M|C705|Exx05 |2 |M |M | |
|M|Exx05 |4 |M | | |
|M|C705 P|Exx05 |6 |P |M | |
|M|C705 P|8960 L|Exx05|7 |L |M |P |
|M|C705$P|78|8960 |9 | |M |P |
+---------------------+---+-----+----+----+

What about we use X(?=Y) also known as lookahead assertion. This ensures we match X only if it is followed by Y
from pyspark.sql.functions import*
dfNew = df.withColumn('Exx05',regexp_extract(col('Decode'), '([A-Z](?=\|Exx05))', 1)).withColumn('C705',regexp_extract(col('Decode'), '([A-Z](?=\|C705))', 1)) .withColumn('8960',regexp_extract(col('Decode'), '([A-Z]+(?=\|[0-9]|8960))', 1))
dfNew.show()
+--------------------+---+-----+----+----+
| Decode| t|Exx05|C705|8960|
+--------------------+---+-----+----+----+
| M|C705|Exx05| 2| | M| |
| M|Exx05| 4| M| | |
| M|C705 P|Exx05| 6| P| M| |
|M|C705 P|8960 L|E...| 7| L| M| P|
| M|C705 P|78|8960| 9| | M| P|
+--------------------+---+-----+----+----+

Related

extract a string before certain punctuation regex

How to extract words before the first punctuation | in presto SQL?
Table
+----+------------------------------------+
| id | title |
+----+------------------------------------+
| 1 | LLA | Rec | po#069762 | saddasd |
| 2 | Hello amustromg dsfood |
| 3 | Hel | sdfke bones. |
+----+------------------------------------+
Output
+----+------------------------------------+
| id | result |
+----+------------------------------------+
| 1 | LLA |
| 2 | |
| 3 | Hel |
+----+------------------------------------+
Attempt
REGEXP_EXTRACT(title, '(.*)([^|]*)', 1)
Thank you

Using the base string functions we can try:
SELECT id,
CASE WHEN title LIKE '%|%'
THEN TRIM(SUBSTR(title, 1, STRPOS(title, '|') - 1))
ELSE '' END AS result
FROM yourTable
ORDER BY id;

Calculate Total by specific column, then apply column total to each row

I would like to apply the total count('case_id'), while grouping by the Item.
This was my previous ask DAX Measure to calculate aggregate data, but group by Case ID. This gave me the total count('case_id') by sub_item.
Measure =
VAR datesSelection =
DATE(
YEAR(SELECTEDVALUE('Date Selection'[DateWoTime]))
,MONTH(SELECTEDVALUE('Date Selection'[DateWoTime]))
,DAY(SELECTEDVALUE('Date Selection'[DateWoTime]))
)
VAR devicesTotal =
CALCULATETABLE (
VALUES ( Outages[Sub_Item] ),
ALLSELECTED ( Outages ),
Outages[DATE] >= datesSelection,
VALUES ( Outages[Sub_Item] )
)
var counts =
CALCULATE (
COUNT( Outages[CASE_ID] ),
ALLSELECTED( Outages ),
Outages[Sub_Item] IN devicesTotal
)
return
counts
I'm getting this.
| Item | Sub_Item | TYPE | Case ID | Date | Measure |
|-------|----------|------|------------|------------------|---------|
| 701ML | abc | TFUS | 1312937981 | 7/16/19 7:18:00 | 1 |
| 702ML | abc | TFUS | 1312958225 | 7/16/19 11:13:00 | 1 |
| 702ML | abc1 | TFUS | 1312957505 | 7/16/19 11:03:00 | 1 |
| 702ML | abc2 | TFUS | 1312954287 | 7/16/19 10:24:00 | 1 |
| 702ML | abc3 | TFUS | 1312938599 | 7/16/19 7:28:00 | 1 |
| 702ML | abc4 | TFUS | 1290599620 | 5/25/18 15:43:00 | 2 |
| 702ML | abc4 | TFUS | 1312950297 | 7/16/19 9:43:00 | 2 |
| 708BI | abc | TFUS | 1312947288 | 7/16/19 9:13:00 | 1 |
| 712BI | abc | TFUS | 1312944078 | 7/16/19 8:30:00 | 1 |
| 785DL | abc | TFUS | 1312937536 | 7/16/19 7:12:00 | 1 |
| 786DL | abc | TFUS | 1312992583 | 7/16/19 14:59:00 | 1 |
| 791DI | abc | LFUS | 1289094627 | 4/28/18 20:07:00 | 2 |
| 791DI | abc | LFUS | 1312958972 | 7/16/19 11:17:00 | 2 |
| 791DI | abc1 | LFUS | 1313005237 | 7/16/19 14:00:00 | 2 |
| 791DI | abc2 | RCLR | 1290324328 | 5/22/18 15:36:00 | 2 |
| 841JU | abc | TFUS | 1312955016 | 7/16/19 10:32:00 | 1 |
| 841JU | abc1 | SBKR | 1288688911 | 4/15/18 10:09:56 | 2 |
| 841JU | abc1 | SBKR | 1312961007 | 7/16/19 11:46:24 | 2 |
| 871NI | abc2 | TFUS | 1304308511 | 3/24/19 19:13:00 | 2 |
| 871NI | abc | TFUS | 1313015455 | 7/16/19 18:39:00 | 2 |
| 917CN | abc | TFUS | 1312945831 | 7/16/19 8:58:00 | 1 |
| 918CN | abc | LFUS | 1292611263 | 6/30/18 9:41:00 | 2 |
| 918CN | abc | LFUS | 1313006283 | 7/16/19 17:03:00 | 2 |
| 922DU | abc | TFUS | 1312987081 | 7/16/19 14:20:00 | 1 |
| 922DU | abc1 | TFUS | 1313005803 | 7/16/19 17:04:00 | 1 |
| 922DU | abc2 | TFUS | 1313003541 | 7/16/19 16:42:00 | 1 |
| 931LF | abc | TFUS | 1312972165 | 7/16/19 12:46:00 | 1 |
When I would like to get this.
| Item | Sub_Item | TYPE | Case ID | Date | Measure |
|-------|----------|------|------------|-----------------|---------|
| 701ML | abc | TFUS | 1312937981 | 7/16/2019 7:18 | 1 |
| 702ML | abc | TFUS | 1312958225 | 7/16/2019 11:13 | 6 |
| 702ML | abc1 | TFUS | 1312957505 | 7/16/2019 11:03 | 6 |
| 702ML | abc2 | TFUS | 1312954287 | 7/16/2019 10:24 | 6 |
| 702ML | abc3 | TFUS | 1312938599 | 7/16/2019 7:28 | 6 |
| 702ML | abc4 | TFUS | 1290599620 | 5/25/2018 15:43 | 6 |
| 702ML | abc4 | TFUS | 1312950297 | 7/16/2019 9:43 | 6 |
| 708BI | abc | TFUS | 1312947288 | 7/16/2019 9:13 | 1 |
| 712BI | abc | TFUS | 1312944078 | 7/16/2019 8:30 | 1 |
| 785DL | abc | TFUS | 1312937536 | 7/16/2019 7:12 | 1 |
| 786DL | abc | TFUS | 1312992583 | 7/16/2019 14:59 | 1 |
| 791DI | abc | LFUS | 1289094627 | 4/28/2018 20:07 | 4 |
| 791DI | abc | LFUS | 1312958972 | 7/16/2019 11:17 | 4 |
| 791DI | abc1 | LFUS | 1313005237 | 7/16/2019 14:00 | 4 |
| 791DI | abc2 | RCLR | 1290324328 | 5/22/2018 15:36 | 4 |
| 841JU | abc | TFUS | 1312955016 | 7/16/2019 10:32 | 3 |
| 841JU | abc1 | SBKR | 1288688911 | 4/15/2018 10:09 | 3 |
| 841JU | abc1 | SBKR | 1312961007 | 7/16/2019 11:46 | 3 |
| 871NI | abc2 | TFUS | 1304308511 | 3/24/2019 19:13 | 2 |
| 871NI | abc | TFUS | 1313015455 | 7/16/2019 18:39 | 2 |
| 917CN | abc | TFUS | 1312945831 | 7/16/2019 8:58 | 1 |
| 918CN | abc | LFUS | 1292611263 | 6/30/2018 9:41 | 2 |
| 918CN | abc | LFUS | 1313006283 | 7/16/2019 17:03 | 2 |
| 922DU | abc | TFUS | 1312987081 | 7/16/2019 14:20 | 3 |
| 922DU | abc1 | TFUS | 1313005803 | 7/16/2019 17:04 | 3 |
| 922DU | abc2 | TFUS | 1313003541 | 7/16/2019 16:42 | 3 |
| 931LF | abc | TFUS | 1312972165 | 7/16/2019 12:46 | 1 |

You need to specify what level you are aggregating at in your measure. Currently, you are aggregating at the Sub_Item level.
To aggregate at the Item level, simply replace Sub_Item with Item in your measure.

How to remove quotes from front and end of the string Scala

I have a dataframe where some strings contains "" in front and end of the string.
Eg:
+-------------------------------+
|data |
+-------------------------------+
|"john belushi" |
|"john mnunjnj" |
|"nmnj tyhng" |
|"John b-e_lushi" |
|"john belushi's book" |
Expected output:
+-------------------------------+
|data |
+-------------------------------+
|john belushi |
|john mnunjnj |
|nmnj tyhng |
|John b-e_lushi |
|john belushi's book |
I am trying to remove only " double quotes from the string. Can some one tell me how can I remove this in Scala ?
Python provide ltrim and rtrim. Is there any thing equivalent to that in Scala ?

Use expr, substring and length functions and get the substring from 2 and length() - 2
val df_d = List("\"john belushi\"", "\"John b-e_lushi\"", "\"john belushi's book\"")
.toDF("data")
Input:
+---------------------+
|data |
+---------------------+
|"john belushi" |
|"John b-e_lushi" |
|"john belushi's book"|
+---------------------+
Using expr, substring and length functions:
import org.apache.spark.sql.functions.expr
df_d.withColumn("data", expr("substring(data, 2, length(data) - 2)"))
.show(false)
Output:
+-------------------+
|data |
+-------------------+
|john belushi |
|John b-e_lushi |
|john belushi's book|
+-------------------+

How to remove quotes from front and end of the string Scala?
myString.substring(1, myString.length()-1) will remove the double quotes.
import spark.implicits._
val list = List("\"hi\"", "\"I am learning scala\"", "\"pls\"", "\"help\"").toDF()
list.show(false)
val finaldf = list.map {
row => {
val stringdoublequotestoberemoved = row.getAs[String]("value")
stringdoublequotestoberemoved.substring(1, stringdoublequotestoberemoved.length() - 1)
}
}
finaldf.show(false)
Result :
+--------------------+
| value|
+--------------------+
| "hi"|
|"I am learning sc...|
| "pls"|
| "help"|
+--------------------+
+-------------------+
| value|
+-------------------+
| hi|
|I am learning scala|
| pls|
| help|
+-------------------+

Try it
scala> val dataFrame = List("\"john belushi\"","\"john mnunjnj\"" , "\"nmnj tyhng\"" ,"\"John b-e_lushi\"", "\"john belushi's book\"").toDF("data")
scala> dataFrame.map { row => row.mkString.stripPrefix("\"").stripSuffix("\"")}.show
+-------------------+
| value|
+-------------------+
| john belushi|
| john mnunjnj|
| nmnj tyhng|
| John b-e_lushi|
|john belushi's book|
+-------------------+

django Queryset exclude() multiple data

i have database scheme like this.
# periode
+------+--------------+--------------+
| id | from | to |
+------+--------------+--------------+
| 1 | 2018-04-12 | 2018-05-11 |
| 2 | 2018-05-12 | 2018-06-11 |
+------+--------------+--------------+
# foo
+------+---------+
| id | name |
+------+---------+
| 1 | John |
| 2 | Doe |
| 3 | Trodi |
| 4 | son |
| 5 | Alex |
+------+---------+
#bar
+------+---------------+--------------+
| id | employee_id | periode_id |
+------+---------------+--------------+
| 1 | 1 |1 |
| 2 | 2 |1 |
| 3 | 1 |2 |
| 4 | 3 |1 |
+------+---------------+--------------+
I need to show employee that not in salary.
for now I do like this
queryset=Bar.objects.all().filter(periode_id=1)
result=Foo.objects.exclude(id=queryset)
but its fail, how do filter employee list not in salary?...

Well here you basically want the foos such that there is no period_id=1 in the Bar table.
We can let this work with:
ex = Bar.objects.all().filter(periode_id=1).values_list('employee_id', flat=True)
result=Foo.objects.exclude(id__in=ex)

pseudocode into SAS macro code

I am not familiar with SAS base and macro language syntax ,my codes keep going wrong..can someone offer a piece of SAS macro code of my pseudocode.
1.create a macro array to store all the distinct variable in table Map_num;
select distinct variable:into numVarList separated by ' ' from Map_num;
quit;
2.for loop the macro array numVarList and for loop each value of each element
(1)pick up the ith element
(2)for loop all the value of the ith element,
(3)if the value of the customer (from customerScore table)is within the scale of "start" and "end",then update score=score+woe*beta
for example:
the customerScore table is:
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
| cst_id | A | B | C | D | E | F | G | H | I | J | K | score |
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
| 1 | 688567 | 873 | 134878 | 546546 | 3123 | 6 | 5345 | 768678 | 348957 | -921839 | -8217 | 0 |
| 2 | 3198 | 54667 | 9789867 | 53456756 | 78978 | 6456 | 645 | 534 | -219 | 13312 | 4543 | 0 |
| 3 | 35324 | 6456568 | 43 | 56756 | -8217 | 688567 | 873 | 134878 | 12 | 89173 | 213142 | 0 |
| 4 | 348957 | -921839 | -8217 | 5345 | 434534 | 3198 | 54667 | 9789867 | -8217 | -8217 | 8908102 | 0 |
| 5 | -219 | 13312 | 4543 | 4234 | 54667 | 35324 | 6456568 | 43 | 213142 | 213142 | 213 | 0 |
| 6 | 12 | 89173 | 213142 | 23234 | 348957 | -921839 | -8217 | 688567 | 873 | 134878 | 23424 | 0 |
| 7 | 688567 | 89173 | 213142 | -8217 | -219 | 13312 | 4543 | 3198 | 54667 | 9789867 | 3434 | 0 |
| 8 | 3198 | -8217 | 21313 | -8217 | 12 | 89173 | 213142 | 35324 | 6456568 | 43 | 3123 | 0 |
| 9 | 35324 | -8217 | 688567 | 688567 | 873 | 134878 | 688567 | 873 | 134878 | -8217 | 11 | 0 |
| 10 | 348957 | 89173 | 213142 | 3198 | 54667 | 9789867 | 3198 | 54667 | 9789867 | -8217 | 3198 | 0 |
| 11 | -219 | -921839 | -8217 | 35324 | 6456568 | 43 | 35324 | 6456568 | 43 | -921839 | -8217 | 0 |
| 12 | 12 | 13312 | 4543 | 89173 | 4234 | 3198 | 688567 | 873 | 134878 | 13312 | 4543 | 0 |
| 13 | 12 | 89173 | 213142 | 348957 | -921839 | -8217 | 3198 | 54667 | 9789867 | 89173 | 213142 | 0 |
| 14 | 2 | 89173 | 213142 | -219 | 13312 | 4543 | 35324 | 6456568 | 43 | 54667 | 4543 | 0 |
| 15 | 348957 | -921839 | -8217 | 12 | 89173 | 213142 | 13312 | 4543 | 89173 | 4234 | 4543 | 0 |
| 16 | -219 | 13312 | 35324 | 6456568 | 43 | 213142 | 89173 | 213142 | 348957 | -921839 | -8217 | 0 |
| 17 | 12 | 89173 | -921839 | -8217 | 688567 | 873 | 89173 | 213142 | -219 | 13312 | 4543 | 0 |
| 18 | 688567 | 873 | 13312 | 4543 | 3198 | 54667 | -921839 | -8217 | 12 | 89173 | 213142 | 0 |
| 19 | 3198 | 54667 | 9789867 | 688567 | 873 | 134878 | 43 | 213142 | 213142 | 213 | 9789867 | 0 |
| 20 | 35324 | 6456568 | 43 | 43 | 213142 | 213142 | 213 | 89173 | 4234 | 3198 | 688567 | 0 |
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
if table Map_num is below,then cst_id score is update:score=0+(-1.2)*3 + 2*3 + (0.1)*3 + 7*3
+----------+------------+------------+------+------+
| variable | start | end | woe | beta |
+----------+------------+------------+------+------+
| A | -999999999 | 57853 | -1 | 3 |
| A | 57853 | 89756 | -1.1 | 3 |
| A | 89756 | 897452 | -1.2 | 3 |
| A | 897452 | 9999999999 | -1.3 | 3 |
| B | -999999999 | 4235 | 2 | 3 |
| B | 4235 | 65785 | 3 | 3 |
| B | 65785 | 9999999999 | 4 | 3 |
| C | -999999999 | 9673 | 3.1 | 3 |
| C | 9673 | 75341 | 2.1 | 3 |
| C | 75341 | 98543 | 1.1 | 3 |
| C | 98543 | 567864 | 0.1 | 3 |
| C | 567864 | 9999999999 | -1 | 3 |
| D | -999999999 | 8376 | 5 | 3 |
| D | 8376 | 93847 | 6 | 3 |
| D | 93847 | 9999999999 | 7 | 3 |
+----------+------------+------------+------+------+
if table Map_num is below,then cst_id score is update:score=0+3*2 + 5*2 + 0*2 + 7*2 +3*2
+----------+------------+------------+-----+------+
| variable | start | end | woe | beta |
+----------+------------+------------+-----+------+
| E | -999999999 | 3 | 1 | 2 |
| E | 3 | 500000 | 3 | 2 |
| E | 500000 | 800000 | 2 | 2 |
| E | 800000 | 9999999999 | 4 | 2 |
| A | -999999999 | 6700 | 6 | 2 |
| A | 590000 | 680000 | 4 | 2 |
| A | 680000 | 9999999999 | 5 | 2 |
| C | -999999999 | 89678 | 9 | 2 |
| C | 89678 | 566757 | 0 | 2 |
| C | 566757 | 986785 | 2.8 | 2 |
| C | 986785 | 9999999999 | 1.1 | 2 |
| K | -999999999 | 7865 | 7 | 2 |
| K | 7865 | 25637 | 9 | 2 |
| K | 25637 | 65742 | 8 | 2 |
| K | 65742 | 9999999999 | 0.2 | 2 |
| B | -999999999 | 56753 | 3 | 2 |
| B | 56753 | 5465624 | 4 | 2 |
| B | 5465624 | 9999999999 | 1 | 2 |
+----------+------------+------------+-----+------+
thanks in advance!
table customerScore and Map_num are changing everyday for each rows and their column name:variable,start,end,woe,beta are not changed.I need to update the column score in table customerScore and the score is according to table Map_num.If the column A value in table customerScore is 688567 ,so it is 89756 <688567<897452,so the socre will be update:score=score+(-1.2 )* 3...is that clear for you?!
it is a nested loop using SAS macro as I comprehended.

Unfortunately the customerScore is not in a form that is readily aligned for a really simple SQL computation.
SQL way
One important aspect is to recognize the selection of map and woe for each score part from map_num can be done relatively easily in SQL, but processing the individual variables has to be 'coaxed' through macro
Consider only the variable A from the first map_num as a example case.
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="A"
and map_num.start < customerScore.A <= mapnum.end
) as A_contribution_to_score
from
customerScore
Now consider the B contribution that is added to the overall expression
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="A"
and map_num.start < customerScore.A <= mapnum.end
)
+
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="B"
and map_num.start < customerScore.B <= mapnum.end
)
from
customerScore
You should see that a macro could determine the distinct map_num values of variable to be used to construct a rather lengthy SQL expression that searches for the appropriate woe and beta product to apply to each row in customerScore.
The macro and SQL update statement could be something like
%macro updateScore (data=, map=);
%local i n_var;
proc sql noprint;
select distinct variable into :variable1- from &map;
%let N_var = &sqlobs;
update &data as OUTER
set score = score
%do I = 1 %to &N_var;
%let variable = &&variable&i;
+
( select
INNER.woe * INNER.beta
from &map as INNER
where INNER.variable="&variable"
and INNER.start < OUTER.&variable <= INNER.end
)
%end;
; /* end of update statement */
quit;
%mend;
%updateScore(data=customerScore, map=map_num)
Your data structure needs a bit of work if you want the score update made via a map_num to be reversible (i.e. capable of having an undo action applied).
If tracking the map selections is important you would want an additional similar query in the macro that creates a table recording the important aspects of the map data selection
create table mapplication as
select cst_id
%do I = 1 %to &N_var;
%let variable = &&variable&i;
%let innerness = from &map as INNER where INNER.variable="&variable" and INNER.start < OUTER.&variable <= INNER.end;
, &variable
, ( select INNER.woe &innerness ) as &variable._woe
, ( select INNER.beta &innerness ) as &variable._beta
, ( select INNER.start &innerness ) as &variable._start
, ( select INNER.end &innerness ) as &variable._end
%end;
from &data as OUTER;
Examining the 'mapplication' data can possibly help diagnose bad map_num data.

First let's start with a working set of data so we have something that SAS code can work with.
data cust ;
input cst_id A B ;
cards;
1 688567 873
2 3198 54667
;
data map_data ;
input variable :$32. start end woe beta ;
cards;
A -999999999 57853 -1 3
A 57853 89756 -1.1 3
A 89756 897452 -1.2 3
A 897452 9999999999 -1.3 3
B -999999999 4235 2 3
B 4235 65785 3 3
B 65785 9999999999 4 3
;
If you want to combine the first table with the second then you need to transpose it.
proc transpose data=cust out=cust_data(rename=(col1=value)) name=variable ;
by cst_id ;
run;
The result for our small example looks like this.
Obs cst_id variable value
1 1 A 688567
2 1 B 873
3 2 A 3198
4 2 B 54667
Since the transpose has moved the variable names into data values instead of metadata values we can now easily join the customer data with the map data.
I will assume that you only want the cases where the value of the variable falls between the START and END variables.
proc sql ;
create table want as
select *
from cust_data a
inner join map_data b
on a.variable = b.variable
and a.value between b.start and b.end
order by 1,2
;
quit;
For this little sample it would be this data.
Obs cst_id variable value start end woe beta
1 1 A 688567 89756 897452 -1.2 3
2 1 B 873 -999999999 4235 2.0 3
3 2 A 3198 -999999999 57853 -1.0 3
4 2 B 54667 4235 65785 3.0 3
At this point you now have something that might make it possible to calculate a score, if you could explain what the formula is.
So assuming that you want to take the sum of WOE*BETA then your SQL query should probably look like this.
proc sql ;
create table scores as
select a.cst_id,sum(woe*beta) as score
from cust_data a
inner join map_data b
on a.variable = b.variable
and a.value between b.start and b.end
group by 1
order by 1
;
quit;
Which has this result.
Obs cst_id score
1 1 2.4
2 2 6.0
Not sure where macro code or looping would help with this problem. If the names of the input dataset vary then you could use macro variables to hold the names, but the input dataset names are used only once each in this code.
For example you could make macro variables CUST, MAP and OUT.
%let cust=work.cust;
%let map=work.map_data;
%let out=work.scores;
Then replace the dataset names in the code with the macro variable references.
proc transpose data=&cust. out=cust_data(rename=(col1=value)) name=variable ;
by cst_id ;
run;
proc sql ;
create table &out. as
select a.cst_id,sum(woe*beta) as score
from cust_data a
inner join &map. b
on a.variable = b.variable
and a.value between b.start and b.end
group by 1
order by 1
;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing string using regexp_extract using pyspark - regex

Related

extract a string before certain punctuation regex

Calculate Total by specific column, then apply column total to each row

How to remove quotes from front and end of the string Scala

django Queryset exclude() multiple data

pseudocode into SAS macro code

Categories

Resources