Replace a number combination with a colon in pyspark dataframe - regex

I have a column in pyspark as:
column_a
force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700 3460
weight is 14N and force is 5N 7000 10000
acceleration due to gravity is 10 and force is 6N 15000 4500
force is 12 4 N and weight is 7N 9000 17000 and second force is 12N
I want to replace the numbers which are in the range of (1000, 20000) and which occur one after another by a colon (;). For example in 4th row 12 and 4 are one after another but, they do not fall into the range so we will not replace them with a colon (;).
So my final output will be
column_a
force is 12 N and weight is 5N ; and second force is 12N ;
weight is 14N and force is 5N ;
acceleration due to gravity is 10 and force is 6N ;
force is 12 4 N and weight is 7N ; and second force is 12N
How do I achieve this in pyspark?

You can use regexp_replace to replace the specificed format with ;.
The hardest part is coming up with the regex, we can use Numeric Range Regex Generator to find the regex pattern to match the condition.
from pyspark.sql import functions as F
data = [("force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700.010 3460",),
("weight is 14N and force is 5N 7000 10000",),
("acceleration due to gravity is 10 and force is 6N 15000 4500.1999999901",),
("force is 12 4 N and weight is 7N 9000 17000 and second force is 12N",),
("handle zero padded decimals 20000.000000 20000.00",),
("Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000",),]
df = spark.createDataFrame(data, ("column_a", ))
df = spark.createDataFrame(data, ("column_a", ))
# This pattern matches whole and decimal numbers between 1000 and 20000 inclusive
numeric_pattern ="(((100[0-9]|10[1-9][0-9]|1[1-9][0-9]{2}|[2-9][0-9]{3}|1[0-9]{4})(\\.\\d+)?)|(20000)(\\.0*)?)"
# This pattern matches 2 numeric patterns separated by a space
pattern = f".({numeric_pattern}\\s{numeric_pattern})\\b"
df.withColumn("column_a", F.regexp_replace(F.col("column_a"), pattern, " ;")).show(truncate=False)
"""
+----------------------------------------------------------------------------+
|column_a |
+----------------------------------------------------------------------------+
|force is 12 N and weight is 5N ; and second force is 12N ; |
|weight is 14N and force is 5N ; |
|acceleration due to gravity is 10 and force is 6N ; |
|force is 12 4 N and weight is 7N ; and second force is 12N |
|handle zero padded decimals ; |
|Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000|
+----------------------------------------------------------------------------+
"""

Related

SAS - Selecting optimal quantities

I'm trying to solve a problem in SAS where I have quantities of customers across a range of groups, and the quantities I select need to be as even across the different categories as possible. This will be easier to explain with a small table, which is a simplification of a much larger problem I'm trying to solve.
Here is the table:
Customer Category | Revenue band | Churn Band | # Customers
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
Suppose I need to select 3000 customers from category A, and 3000 customers from category B. From the second category, within each A and B, I need to select an equal amount from 1 and 2. If possible, I need to select a proportional amount across each 1, 2, and 3 subcategories. Is there an elegant solution to this problem? I'm relatively new to SAS and so far I've investigated OPTMODEL, but the examples are either too simple or too advanced to be much use to me yet.
Edit: I've thought about using survey select. I can use this to select equal sizes across the Revenue Bands 1, 2, and 3. However where I'm lacking customers in the individual churn bands, surveyselect may not select the maximum number of customers available where those numbers are low, and I'm back to manually selecting customers.
There are still some ambiguities in the problem statement, but I hope that the PROC OPTMODEL code below is a good start for you. I tried to add examples of many different features, so that you can toy around with the model and hopefully get closer to what you actually need.
Of the many things you could optimize, I am minimizing the maximum violation from your "If possible" goal, e.g.:
min MaxMismatch = MaxChurnMismatch;
I was able to model your constraints as a Linear Program, which means that it should scale very well. You probably have other constraints you did not mention, but that would probably beyond the scope of this site.
With the data you posted, you can see from the output of the print statements that the optimal penalty corresponds to choosing 1500 customers from A,1,1, where the ideal would be 1736. This is more expensive than ignoring the customers from several groups:
[1] ChooseByCat
A 3000
B 3000
[1] [2] [3] Choose IdealProportion
A 1 1 1500 1736.670
A 1 2 0 135.882
A 1 3 0 78.762
A 2 1 28 9.934
A 2 2 1240 1003.330
A 2 3 232 82.310
B 1 1 1500 1580.210
B 1 2 0 193.358
B 1 3 0 161.072
B 2 1 1500 1608.593
B 2 2 0 153.976
B 2 3 0 161.072
Proportion MaxChurnMisMatch
0.35478 236.67
That is probably not the ideal solution, but figuring how to model exactly your requirements would not be as useful for this site. You can contact me offline if that is relevant.
I've added quotes from your problem statement as comments in the code below.
Have fun!
data custCounts;
input cat $ rev churn n;
datalines;
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
;
proc optmodel printlevel = 0;
set CATxREVxCHURN init {} inter {<'A',1,1>};
set CAT = setof{<c,r,ch> in CATxREVxCHURN} c;
num n{CATxREVxCHURN};
read data custCounts into CATxREVxCHURN=[cat rev churn] n;
put n[*]=;
var Choose{<c,r,ch> in CATxREVxCHURN} >= 0 <= n[c,r,ch]
, MaxChurnMisMatch >= 0, Proportion >= 0 <= 1
;
/* From OP:
Suppose I need to select 3000 customers from category A,
and 3000 customers from category B. */
num goal = 3000;
/* See "implicit slice" for the parenthesis notation, i.e. (c) below. */
impvar ChooseByCat{c in CAT} =
sum{<(c),r,ch> in CATxREVxCHURN} Choose[c,r,ch];
con MatchCatGoal{c in CAT}:
ChooseByCat[c] = goal;
/* From OP:
From the second category, within each A and B,
I need to select an equal amount from 1 and 2 */
con MatchRevenueGroupsWithinCat{c in CAT}:
sum{<(c),(1),ch> in CATxREVxCHURN} Choose[c,1,ch]
= sum{<(c),(2),ch> in CATxREVxCHURN} Choose[c,2,ch]
;
/* From OP:
If possible, I need to select a proportional amount
across each 1, 2, and 3 subcategories. */
con MatchBandProportion{<c,r,ch> in CATxREVxCHURN, sign in / 1 -1 /}:
MaxChurnMismatch >= sign * ( Choose[c,r,ch] - Proportion * n[c,r,ch] );
min MaxMismatch = MaxChurnMismatch;
solve;
print ChooseByCat;
impvar IdealProportion{<c,r,ch> in CATxREVxCHURN} = Proportion * n[c,r,ch];
print Choose IdealProportion;
print Proportion MaxChurnMismatch;
quit;

MATLAB: How to read PRE tag and create cellarray with NaN

I am trying to read data from html file
The data are delimmited by <PRE></PRE> tag
e.g.:
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I would to replace the blank space with NaN and to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
This data will be saved on mydata.dat file
If you have the HTML file hosted somewhere, then:
url = 'http://www.myDomain.com/myFile.html';
html = urlread(url);
% Use regular expressions to remove undesired HTML markup.
txt = regexprep(html,'<script.*?/script>','');
txt = regexprep(txt,'<style.*?/style>','');
txt = regexprep(txt,'<pre.*?/pre>','');
txt = regexprep(txt,'<.*?>','')
Now you should have the date in text format in txt variable. You can use textscan to parse the txt var and you can scan for the whitespace or for the numbers.
More Info:
- urlread
- regexprep
This isn't the perfect solution but it seems to get you there.
Assuming t is one long string, the delimiter is white space, and you know the number of columns:
numcols = 7;
sample = '1 2 3 4 5 7 1 3 5 7';
test = textscan(sample,'%f','delimiter',' ','MultipleDelimsAsOne',false);
test = test{:}; % Pull the double out of the cell array
test(2:2:end) = []; % Dump out extra NaNs
test2 = reshape(test,numcols,length(test)/numcols)'; % Have to mess with it a little to reshape rowwise instead of columnwise
Returns:
test2 =
1 2 3 4 5 NaN 7
1 NaN 3 NaN 5 NaN 7
This is assuming the delimiter is white space and constant. Textscan doesn't allow you to stack whitespace as a delimiter, so it throws a NaN after each white space character if there isn't data present. In your example data there are two white space characters between each data point, so every other NaN (or, more generically, n_whitespace - 1) can be thrown out, leaving you with the NaNs you actually want.

Combinational Circuit with LED Lighting

Combinational Circuit design question.
A
____
| |
F | | B
| |
____
| G |
E | | C
| |
____
D
Suppose this is a LED display. It would take input of 4 bit
(0000)-(1111) and display the Hex of it. For example
if (1100) come in it would display C by turning on AFED and turning off BCG.
If (1010) comes in it would display A by turning on ABCEFG
and turn off D.
These display will all be Capital letters so there is no visual
difference between 0 and D and 8 and B.
Develop a truth table and an optimized expression using Karnaugh Maps.
I'm not exactly sure how to begin. For the truth table would I be using (w,x,y,z) as input variable or just the ABCDEFG variable since it's the one turning on and off?
input (1010)-->A--> ABCEFG~D (~ stand for NOT)
input (1011)-->B--> ABCDEFG
input (1100)-->C--> ADEF~B~C~G
So would I do for all hex 0-F then that would give me the min. term canonical then use Karnaugh Map to optimize it? Any help would be grateful!
1) Map your lights to bits:
ABCDEFG, so truth table will be:
ABCDEFG
input (1010)-->A-->1110110
and so on.
You will have big table (with 16 rows).
2) Then follow sample on wikipedia for every output light.
You need to do 7 of these: Each for one segment in the 7-segment display.
This figure is for illustration only. It doesn't necessarily map to any segment in your problem.
cd=00 01 11 10 <-- where abcd = 0000 for 0 : put '1' if the light is on
ab= 00 1 1 1 1 = 0001 for 1 : put '0' if it's off for
ab= 01 1 1 1 0 = 0010 for 2 ... the given segment
ab= 11 0 1 1 1
ab= 10 1 1 1 0 = 1111 for f
^^^^ = d=1 region
^^^^ = c==1 region
The two middle rows represent "b==1" region and the two last rows are a==1 region.
From that map find maximum size rectangles (that are of size [1,2 or 4] x [1, 2 or 4]); that can be overlapping. The middle 2x4 region is coded as 'd'. The top row is '~a~b'. The top left 2x2 square is '~a~c'. A bottom left square that wraps from row 4 to row 1 is '~b~c'. Finally the small 2x1 region that covers position x=4, y=3 is 'abc'.
This function would thus be 'd + ~a~b + ~a~c + ~b~c + abc'. If there are no redundant squares (that are completely covered by other squares), then this formula should be optimal canonical form. (not counting XOR operation). Repeat for 7 times for the real data!
Any selection/permutation of the variables should give the same logical circuit, whether you use abcd or dcba or acbd etc.

Ternary Numbers, regex

I'm looking for some regex/automata help. I'm limited to + or the Kleene Star. Parsing through a string representing a ternary number (like binary, just 3), I need to be able to know if the result is 1-less than a multiple of 4.
So, for example 120 = 0*1+2*3+1*9 = 9+6 = 15 = 16-1 = 4(n)-1.
Even a pointer to the pattern would be really helpful!
You can generate a series of values to do some observation with bc in bash:
for n in {1..40}; do v=$((4*n-1)); echo -en $v"\t"; echo "ibase=10;obase=3;$v" | bc ; done
3 10
7 21
11 102
15 120
19 201
23 212
27 1000
31 1011
...
Notice that each digit's value (in decimal) is either 1 more or 1 less than something divisible by 4, alternately. So the 1 (lsb) digit is one more than 0, the 3 (2nd) digit is one less than 4, the 9 (3rd) digit is 1 more than 8, the 27 (4th) digit is one less than 28, etc.
If you sum up all the even-placed digits and all the odd-placed digits, then add 1 to the odd-placed ones (if counting from 1), you should get equality.
In your example: odd: (0+1)+1, even: (2). So they are equal, and so the number is of the form 4n-1.

Calculating frequency of fractions in SAS

I'm trying to calculate the frequency of fractions in my data set (excluding whole numbers).
For example, my variable P takes values 24+1/2, 97+3/8, 12+1/4, 57+1/2, etc. and I'm looking to find the frequency of 1/2, 3/8, and so on. Can anyone help?!
Thanks in advance!
Clyde013
Clyde013, here is one way, assuming that p is of character type. hth. cheers, chang
> Pulled from SAS-L
/* test data -- if p is a character var */
data one;
input p $ ##;
cards;
24+1/2
97+3/8
12+1/4
57+1/2
36 3/8 ;
run;
/* frequencies of frations? */
data two;
set one;
whole = scan(p, 1, "+");
frac = scan(p, 2, "+");
run;
proc freq data=two;
tables frac;
run;
/* on lst
Cumulative Cumulative
frac Frequency Percent Frequency Percent
---------------------------------------------------------
1/2 2 50.00 2 50.00
1/4 1 25.00 3 75.00
3/8 1 25.00 4 100.00
Frequency Missing = 2 */