I have a column in pyspark as:
column_a
force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700 3460
weight is 14N and force is 5N 7000 10000
acceleration due to gravity is 10 and force is 6N 15000 4500
force is 12 4 N and weight is 7N 9000 17000 and second force is 12N
I want to replace the numbers which are in the range of (1000, 20000) and which occur one after another by a colon (;). For example in 4th row 12 and 4 are one after another but, they do not fall into the range so we will not replace them with a colon (;).
So my final output will be
column_a
force is 12 N and weight is 5N ; and second force is 12N ;
weight is 14N and force is 5N ;
acceleration due to gravity is 10 and force is 6N ;
force is 12 4 N and weight is 7N ; and second force is 12N
How do I achieve this in pyspark?
You can use regexp_replace to replace the specificed format with ;.
The hardest part is coming up with the regex, we can use Numeric Range Regex Generator to find the regex pattern to match the condition.
from pyspark.sql import functions as F
data = [("force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700.010 3460",),
("weight is 14N and force is 5N 7000 10000",),
("acceleration due to gravity is 10 and force is 6N 15000 4500.1999999901",),
("force is 12 4 N and weight is 7N 9000 17000 and second force is 12N",),
("handle zero padded decimals 20000.000000 20000.00",),
("Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000",),]
df = spark.createDataFrame(data, ("column_a", ))
df = spark.createDataFrame(data, ("column_a", ))
# This pattern matches whole and decimal numbers between 1000 and 20000 inclusive
numeric_pattern ="(((100[0-9]|10[1-9][0-9]|1[1-9][0-9]{2}|[2-9][0-9]{3}|1[0-9]{4})(\\.\\d+)?)|(20000)(\\.0*)?)"
# This pattern matches 2 numeric patterns separated by a space
pattern = f".({numeric_pattern}\\s{numeric_pattern})\\b"
df.withColumn("column_a", F.regexp_replace(F.col("column_a"), pattern, " ;")).show(truncate=False)
"""
+----------------------------------------------------------------------------+
|column_a |
+----------------------------------------------------------------------------+
|force is 12 N and weight is 5N ; and second force is 12N ; |
|weight is 14N and force is 5N ; |
|acceleration due to gravity is 10 and force is 6N ; |
|force is 12 4 N and weight is 7N ; and second force is 12N |
|handle zero padded decimals ; |
|Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000|
+----------------------------------------------------------------------------+
"""
I plan to run the following cross-sectional regression for 10 years and plot the coefficient estimate for variable x in one graph.
Thanks to this post, I wrote the following and it works:
forvalues i=1/10 {
reg y x if year==1
estimates store year`i'
local allyears `allyears' year`i' ||
local labels `labels' `i'
}
coefplot `allyears', keep(grade) vertical bycoefs bylabels(`labels')
I want to add the following to the same graph but don't know how:
A horizontal line segment x=5 for year 1 to year 5, and another horizontal line segment x=4 for year 6 to year 10.
A shaded area ranging from x=4 to x=6 for year 1 to year 5, and another shaded area ranging from x=2 to 4 for year 6 to year 10.
(Note that my horizontal axis is year, and my vertical axis is coefficient for x.)
Any help is greatly appreciated!
Here's an example based on the nlswork toy dataset:
clear
use http://www.stata-press.com/data/r12/nlswork.dta
for values i = 70 / 73 {
regress ln_w grade if year==`i'
estimates store year`i'
local allyears `allyears'year`i' ||
local labels `labels' `i'
}
coefplot `allyears', keep(grade) vertical bycoefs bylabels(`labels') ///
addplot(scatteri 0.08 1 0.08 3, recast(connected) || ///
scatteri 0.09 1 0.09 3, recast(connected) || ///
scatteri 0.065 2 0.065 3 0.075 3 0.075 2, recast(area) lwidth(none))
I'm using proc genmod to predict an outcome measured at 4 time points. The outcome is a total score on a mood inventory, which can range from 0 to 82. A lot of participants have a score of 0, so the negative binomial distribution in proc genmod seemed like a good fit for the data.
Now, I'm struggling with how to write/interpret the estimates statements. The primary predictors are TBI status at baseline (0=no/1=yes), and visit (0=baseline, 1=second visit, 2=third visit, 4=fourth visit), and an interaction of TBI status and visit.
How do I write my estimates, such that I'm getting out:
1. the average difference in mood inventory score for person with TBI versus a person without, at baseline.
and
2. the average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits?
Below is what I have thus far, but I'm not sure how to interpret the output, also below, if indeed my code is correct.:
proc genmod data = analyze_long_3 ;
class id screen_tbi (param = ref ref = first) ;
model nsi_total = visit_cent screen_tbi screen_tbi*visit_cent /dist=negbin ;
output predicted = predstats;
repeated subject=id /type=cs;
estimate "tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 1 0 /exp;
estimate "no tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 0 1 /exp;
estimate 'longitudinal TBI' intercept 1
visit_cent -1 1 1 1
screen_tbi 1 0
screen_tbi*visit_cent 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 / exp;
estimate 'longitudinal no TBI ' intercept 1
visit_cent -1 1 1 1
screen_tbi 0 1
screen_tbi*visit_cent 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 / exp;
where sample = 1 ;
run;
The first research question is to have the average difference score, at baseline, for person with TBI versus a person without. It can be achieved by the following steps:
1) Get the estimated average log (score) when TBI = yes, and Visit = baseline;
2) Get the estimated average log (score) when TBI = no, and Visit =baseline;
3) 1) – 2) to have the difference in log(score) values
4) Exp[3)] to have the difference as percentage of change in scores
To simplify, let T=TBI levels, and V = Visit Levels. One thing to clarify, in your post, there are 4 visit points, the first as reference; therefore there should be 3 parameters for V, not four.
Taking the example of step 1), let’s try to write the ESTIMATE statement. It is a bit tricky. At first it sounds like this (T=0 and V =0 as reference):
ESTIMATE ‘Overall average’ intercept T 1 V 0 0 0;
But it is wrong. In the above statement, all arguments for V are set to 0. When all arguments are 0, it is the same as taking out V from the statement:
ESTIMATE ‘Overall average’ intercept T 1;
This is not the estimate of average for T=1 at baseline level. Rather, it produces an average for T=1, regardless of visit points, or, an average for all visit levels.
The problem is that the reference is set as V=0. In that case, SAS cannot tell the difference between estimates for the reference level, and the estimates for all levels. Indeed it always estimates the average for all levels. To solve it, the reference has to be set to -1, i.e., T=-1 and V=-1 as reference, such that the statement likes:
ESTIMATE ‘Average of T=1 V=baseline’ intercept T 1 V -1 -1 -1;
Now that SAS understands: fine! the job is to get the average at baseline level, not at all levels.
To make the reference value as -1 instead of 0, in the CLASS statement, the option should be specified as PARAM = EFFECT, not PARAM = REF. That brings another problem: once PARAM is not set as REF, SAS will ignore the user defined references. For example:
CLASS id T (ref=’…’) V (ref=’…’) / PARAM=EFFECT;
The (ref=’…’) is ignored when PARAM=EFFECT. How to let SAS make TBI=No and Visit=baseline as references? Well, SAS automatically takes the last level as the reference. For example, if the variable T is ordered ascendingly, the value -1 comes as the first level, while the value 1 comes as the last level; therefore 1 will be the reference. Conversely, if T is ordered in descending order, the value -1 comes at the end and will be used as the ref. This is achieved by the option ‘DESCENDING’ in the CLASS statement.
CLASS id T V / PARAM=EFFECT DESCENDING;
That way, the parameters are ordered as:
T 1 (TBI =1)
T -1 (ref level of TBI, i.e., TBI=no)
V 1 0 0 (for visit =4)
V 0 1 0 (visit = 3)
V 0 0 1 (visit =2)
V -1 -1 -1 (this is the ref level, visit=baseline)
The above information is reported in the ODS table ‘Class Level Information’. It is always good to check the very table each time after running PROC GENMOD. Note that the level (visit = 4) comes before the level (visit =3), visit =3 coming before visit=2.
Now, let’s talk a bit about the parameters and the model equation. As you might know, in SAS, the V for multi-levels is indeed broken down into dummy Vs. If baseline is set as ref level, the dummies will be like:
V4 = the fourth visit or baseline
V3= the third visit, or baseline
V2 = the second visit or baseline
Accordingly, the equation can be written as:
LOG(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
whereas:
s = the total score on a mood inventory
T = 1 for TBI status of yes, = -1 for TBI status of no
V4 = 1 for the fourth visit, = -1 for baseline
V3 = 1 for the third visit, =-1 for baseline
V2 = 1 for the second visit, = -1 for the baseline
b0 to b4 are beta estimates for the parameters
Of note, the order in the model is the same as the order defined in the statement CLASS, and the same as the order in the ODS table ‘Class Level Information’. The V4, V3, V2 have to appear in the model, all or none, i.e., if the VISIT term is to be included, V4 V3 V2 should be all introduced into the model equation. If the VISIT term is not included, none of V4, V3, and V2 should be in the equation.
With interaction terms, 3 more dummy terms must be created:
T_V4 = T*V4
T_V3 = T*V3
T_V2 = T*V2
Hence the equation with interaction terms:
Log(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6* T_V3 + b7* T_V2
The SAS statement of ‘ESTIMATE’ is correspondent to the model equation.
For example, to estimate an overall average for all parameters and all levels, the equation is:
[Log(S)] = b0 ;
whereas [LOG(S)] stands for the expected LOG(score). Accordingly, the statement is:
ESTIMATE ‘overall (all levels of T and V)’ INTERCEPT;
In the above statement, ‘INTERCEPT’ in the statement is correspondent to ‘b0’ in the equation
To estimate an average of log (score) for T =1, and for all levels of visit points, the equation is
[LOG(S)] = b0 + b1 * T = b0 + b1 * 1
And the statement is
ESTIMATE ‘T=Yes, V= all levels’ INTERCEPT T 1;
In the above case, ‘T 1’ in the statement is correspondent to the part “*1” in the equation (i.e., let T=1)
To estimate an average of log (score) for T =1, and for visit = baseline, the equation is:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
= b0 + b1*(1) + b2*(-1)+ b3*(-1) + b4*(-1)
The statement is:
ESTIMATE ‘T=Yes, V=Baseline’ INTERCEPT T 1 V -1 -1 -1;
‘V -1 -1 -1’ in the statement is correspondent to the values of V4, V3, and V2 in the equation. We’ve mentioned above that the dummies V4 V3 and V2 must be all introduced into the model. That is why for the V term, there are always three numbers, such as ‘V -1 -1 -1’, or ‘V 1 1 1’, etc. SAS will give warning in log if you make it like ‘V -1 -1 -1 -1’, because there are four '-1's, 1 more than required. In that case, the excessive '-1' will be ignored. On the contrary, ‘V 1 1’ is fine. It is the same as ‘V 1 1 0’. But what does 'V 1 1 0' means? To figure it out, you have to read Allison’s book (see reference).
For now, let’s carry on, and add the interaction terms. The equation:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6*T_V3 + b7*T_V2
As T_V4 = T*V4 = 1 * (-1) = -1, similarly T_V3 = -1, T_V2=-1, substitute into the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(-1) + b6*(-1) + b7*(-1)
The statement is:
ESTIMATE ‘(1) T=Yes, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V -1 -1 -1;
The ‘T*V -1 -1 -1’ are correspondent to the values of T_V4, T_V3 and T_V2 in the equation.
And that is the statement for step 1)!
Step 2 follows the same thoughts. To get the estimated average log (score) when TBI = no, and Visit =baseline.
T = -1, V4=-1, V3=-1, V2=-1.
T_V4 = T * V4 = (-1) * (-1) = 1
T_V3 = T * V3 = (-1) * (-1) = 1
T_V2 = T * V2 = (-1) * (-1) = 1
Substituting the values in the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(1) + b6*(1) + b7*(1)
Note that the numbers: For T: 1; for V: -1 -1 -1; for interaction terms: 1 1 1
And the SAS statement:
ESTIMATE ‘(2) T=No, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V 1 1 1;
The estimate results can be found in the ODS table ‘Contrast Estimate Results’.
For step 3), subtract the estimate (1) – (2), to have the difference of log(score); and for step(4), have the exponent of the diff in step 3).
For the second research question:
The average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits.
Over the 4 study visits means for all visit levels. By now, you might have known that the statement is simpler:
ESTIMATE ‘(1) T=Yes, V=all levels’ INTERCEPT T 1;
ESTIMATE ‘(2) T=Yes, V=all levels’ INTERCEPT T -1;
Why there are no interaction terms? Because all visit levels are considered. And when all levels are considered, you do not have to put any visit-related terms into the statement.
Finally, the above approach requires some manual calculation. Indeed it is possible to make one single line of ESTIMATE statement that is equivalent to the aforementioned approach. However, the method we discussed above is way easier to understand. For more sophisticated methods, please read Allison’s book.
Reference:
1. Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition. Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.
I am comparing the values for shadow price (pi) calculated with gurobi and pulp. I get different values for the same input and I am not sure how to do it with pulp. Here is the lp file that I use:
Minimize
x[0] + x[1] + x[2] + x[3]
Subject To
C[0]: 7 x[0] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
For the above lp file, gurobi gives me shadow prices:
[0.14285714285714285, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
But If I execute the following lp model:
Minimize
x[0] + x[1] + x[2] + x[3] + x[4]
Subject To
C[0]: 7 x[0] + 2 x[4] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] + 2 x[4] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
With gurobi I get:
[0.0, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
The correct value is the one that gurobi returns (I think ?).
Why I get the same shadow prices with pulp for different models ? How I can get the same results as gurobi ?
(I did not supply the source code because the question will be too long, I think the lp models are enough)
In the second example, there are two dual solutions that are optimal: the one PuLP gives you, and the one you get by calling Gurobi directly. The unique optimal primal solution is [0.0, 131.67, 199.5, 48.5, 105.5], which makes the slacks for all the constraints are 0 in the optimal primal solution. For c[0] if you reduce the right hand side, you get no reduction in the objective, but if you increase it, the cheapest way to make the constraint feasible is by increasing x[0]. Gurobi only guarantees that you will produce an optimal primal and dual solution. The specific optimal solution you get is arbitrary.
The first example is just a precision issue.
I am trying to read data from html file
The data are delimmited by <PRE></PRE> tag
e.g.:
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I would to replace the blank space with NaN and to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
This data will be saved on mydata.dat file
If you have the HTML file hosted somewhere, then:
url = 'http://www.myDomain.com/myFile.html';
html = urlread(url);
% Use regular expressions to remove undesired HTML markup.
txt = regexprep(html,'<script.*?/script>','');
txt = regexprep(txt,'<style.*?/style>','');
txt = regexprep(txt,'<pre.*?/pre>','');
txt = regexprep(txt,'<.*?>','')
Now you should have the date in text format in txt variable. You can use textscan to parse the txt var and you can scan for the whitespace or for the numbers.
More Info:
- urlread
- regexprep
This isn't the perfect solution but it seems to get you there.
Assuming t is one long string, the delimiter is white space, and you know the number of columns:
numcols = 7;
sample = '1 2 3 4 5 7 1 3 5 7';
test = textscan(sample,'%f','delimiter',' ','MultipleDelimsAsOne',false);
test = test{:}; % Pull the double out of the cell array
test(2:2:end) = []; % Dump out extra NaNs
test2 = reshape(test,numcols,length(test)/numcols)'; % Have to mess with it a little to reshape rowwise instead of columnwise
Returns:
test2 =
1 2 3 4 5 NaN 7
1 NaN 3 NaN 5 NaN 7
This is assuming the delimiter is white space and constant. Textscan doesn't allow you to stack whitespace as a delimiter, so it throws a NaN after each white space character if there isn't data present. In your example data there are two white space characters between each data point, so every other NaN (or, more generically, n_whitespace - 1) can be thrown out, leaving you with the NaNs you actually want.