regex with a limit on new lines - regex

I have the following text:
gvkey |
1017 | .9610464 1.04128 0.92 0.356 -1.079825 3.001917
1018 | -.0599428 1.306879 -0.05 0.963 -2.621379 2.501493
1021 | -.0766854 .9906029 -0.08 0.938 -2.018231 1.86486
1034 | -2.678616 1.308118 -2.05 0.041 -5.24248 -.1147511
1056 | 1.694514 .9563385 1.77 0.076 -.1798751 3.568903
1065 | 1.106467 .9584568 1.15 0.248 -.7720734 2.985008
10001 | .7988226 1.019213 0.78 0.433 -1.198799 2.796444
10010 | .8203764 .9429188 0.87 0.384 -1.02771 2.668463
10022 | 1.590896 .9615904 1.65 0.098 -.2937862 3.475579
10030 | .0067641 .9798901 0.01 0.994 -1.913785 1.927313
10039 | 3.767551 .9168058 4.11 0.000 1.970645 5.564458
10056 | 2.29646 .9789753 2.35 0.019 .3777042 4.215217
10066 | 2.635614 .9398462 2.80 0.005 .7935496 4.477679
10088 | 1.679799 .930843 1.80 0.071 -.1446195 3.504218
10089 | -16.62772 1.017178 -16.35 0.000 -18.62135 -14.63409
10093 | .3149815 .9174881 0.34 0.731 -1.483262 2.113225
10097 | 2.976634 .9224759 3.23 0.001 1.168615 4.784654
10107 | -.1184532 .9405728 -0.13 0.900 -1.961942 1.725036
10115 | 1.899066 .9165281 2.07 0.038 .102704 3.695428
208068 | -1.236473 .9326577 -1.33 0.185 -3.064448 .5915026
209341 | -.804362 .9516883 -0.85 0.398 -2.669637 1.060913
213449 | -1.248011 .9460252 -1.32 0.187 -3.102186 .6061647
220546 | -4.424031 .9431063 -4.69 0.000 -6.272485 -2.575576
221821 | -.9759739 .9240414 -1.06 0.291 -2.787062 .8351139
222111 | -3.733076 .9440901 -3.95 0.000 -5.583458 -1.882693
223098 | -2.892674 1.158793 -2.50 0.013 -5.163865 -.6214818
242977 | -1.324193 .9371738 -1.41 0.158 -3.161019 .5126345
|
_cons | .1156292 .915384 0.13 0.899 -1.678491 1.909749
------------------------------------------------------------------------------------
gvkey |
1017 | .9610464 1.04128 0.92 0.356 -1.079825 3.001917
1018 | -.0599428 1.306879 -0.05 0.963 -2.621379 2.501493
1021 | -.0766854 .9906029 -0.08 0.938 -2.018231 1.86486
1034 | -2.678616 1.308118 -2.05 0.041 -5.24248 -.1147511
1056 | 1.694514 .9563385 1.77 0.076 -.1798751 3.568903
1065 | 1.106467 .9584568 1.15 0.248 -.7720734 2.985008
10001 | .7988226 1.019213 0.78 0.433 -1.198799 2.796444
10010 | .8203764 .9429188 0.87 0.384 -1.02771 2.668463
10022 | 1.590896 .9615904 1.65 0.098 -.2937862 3.475579
10030 | .0067641 .9798901 0.01 0.994 -1.913785 1.927313
10039 | 3.767551 .9168058 4.11 0.000 1.970645 5.564458
10056 | 2.29646 .9789753 2.35 0.019 .3777042 4.215217
10066 | 2.635614 .9398462 2.80 0.005 .7935496 4.477679
10088 | 1.679799 .930843 1.80 0.071 -.1446195 3.504218
10089 | -16.62772 1.017178 -16.35 0.000 -18.62135 -14.63409
10093 | .3149815 .9174881 0.34 0.731 -1.483262 2.113225
10097 | 2.976634 .9224759 3.23 0.001 1.168615 4.784654
10107 | -.1184532 .9405728 -0.13 0.900 -1.961942 1.725036
10115 | 1.899066 .9165281 2.07 0.038 .102704 3.695428
208068 | -1.236473 .9326577 -1.33 0.185 -3.064448 .5915026
209341 | -.804362 .9516883 -0.85 0.398 -2.669637 1.060913
213449 | -1.248011 .9460252 -1.32 0.187 -3.102186 .6061647
220546 | -4.424031 .9431063 -4.69 0.000 -6.272485 -2.575576
221821 | -.9759739 .9240414 -1.06 0.291 -2.787062 .8351139
222111 | -3.733076 .9440901 -3.95 0.000 -5.583458 -1.882693
223098 | -2.892674 1.158793 -2.50 0.013 -5.163865 -.6214818
242977 | -1.324193 .9371738 -1.41 0.158 -3.161019 .5126345
|
_cons | .1156292 .915384 0.13 0.899 -1.678491 1.909749
------------------------------------------------------------------------------------
And am looking for a regex expression that removes all the gvkeys, so the output would like something this:
|
_cons | .1156292 .915384 0.13 0.899 -1.678491 1.909749
------------------------------------------------------------------------------------
|
_cons | .1156292 .915384 0.13 0.899 -1.678491 1.909749
------------------------------------------------------------------------------------
I'm very new to regex, and have tried searching for the following and replace it with nothing in Notepad++:
gvkey.*[\n].*_cons
The problem is it finds all the values in between the first gvkey column and the second one and removes everything in between.
Is there a way to have the search term find each gvkey column once? (So in my example, it would find and replace the gvkey column twice in total)
Many thanks in advance.

You may use this regex for search in MULTILINE mode (it is called matches newline option in Notepad++:
^\h*gvkey[\s\S]*?\R(?=\h+\|)
And replace with an empty string.
RegEx Demo
RegEx Details:
^: Line start
\h*: matches 0 or more horizontal whitespaces
gvkey: Matches gvkey string
[\s\S]*?: Matches 0 or more of any character including newlines (lazy)
\R: Matches any newlines
(?=\h+\|): Positive lookahead to assert that we have 0 or more horizontal whitespaces followed by a pipe character ahead of us

Why just you don't try this?
Find what: ^(?!.*_con).*\s
Replace with: blank

Related

Regex search with limited new lines

I have the following text:
KN_Divers_Blau | -6.429897 8.010333 -0.80 0.422 -22.14101 9.28122
Ind_ROA_Ave | .3407456 .3389998 1.01 0.315 -.3241539 1.005645
Ind_Tobin_Q1_Ave | -.5065654 .2104229 -2.41 0.016 -.9192797 -.0938511
Ind_Growth_Ave | -1.404911 1.852805 -0.76 0.448 -5.038922 2.229101
Pat_Dum | -18.31015 5.452194 -3.36 0.001 -29.00385 -7.616457
|
year |
1981 | -5.575117 2.805975 -1.99 0.047 -11.07863 -.0715993
1982 | -6.171125 5.447273 -1.13 0.257 -16.85517 4.512919
1983 | -11.8282 8.84588 -1.34 0.181 -29.17812 5.521726
1984 | -20.39602 11.73682 -1.74 0.082 -43.41611 2.624069
1985 | -23.7097 14.29652 -1.66 0.097 -51.75028 4.330874
1986 | -29.43432 16.51849 -1.78 0.075 -61.83297 2.964339
1987 | -35.30922 18.5138 -1.91 0.057 -71.62137 1.002936
1988 | -49.09056 19.95166 -2.46 0.014 -88.22289 -9.958242
1989 | -53.98487 21.88913 -2.47 0.014 -96.91725 -11.05248
1990 | -67.58938 23.41111 -2.89 0.004 -113.5069 -21.67185
1991 | -78.59984 25.52294 -3.08 0.002 -128.6594 -28.54026
1992 | -88.89806 28.22778 -3.15 0.002 -144.2628 -33.53332
1993 | -98.40131 31.35391 -3.14 0.002 -159.8975 -36.90512
1994 | -102.953 33.25041 -3.10 0.002 -168.1689 -37.73712
1995 | -116.2812 37.25681 -3.12 0.002 -189.355 -43.20726
1996 | -118.0298 38.76035 -3.05 0.002 -194.0527 -42.00698
1997 | -118.4325 38.4338 -3.08 0.002 -193.8149 -43.05017
1998 | -123.8912 37.96394 -3.26 0.001 -198.352 -49.43038
1999 | -128.3908 39.44807 -3.25 0.001 -205.7626 -51.01913
2000 | -133.2699 40.31404 -3.31 0.001 -212.3401 -54.19972
2001 | -126.159 37.63045 -3.35 0.001 -199.9658 -52.35232
2002 | -119.8247 36.05833 -3.32 0.001 -190.5479 -49.10146
2003 | -109.2157 34.54755 -3.16 0.002 -176.9758 -41.45563
2004 | -114.1801 33.58204 -3.40 0.001 -180.0465 -48.31378
2005 | 0 (omitted)
|
_cons | -187.8645 62.81122 -2.99 0.003 -311.0597 -64.66936
KN_Divers_Blau | -6.57637 8.068413 -0.82 0.415 -22.4014 9.248663
Ind_ROA_Ave | .3641781 .3411348 1.07 0.286 -.3049088 1.033265
Ind_Tobin_Q1_Ave | -.5070564 .2105863 -2.41 0.016 -.9200911 -.0940217
Ind_Growth_Ave | -1.424116 1.871656 -0.76 0.447 -5.095101 2.246869
Pat_Dum | -18.51642 5.463958 -3.39 0.001 -29.23319 -7.799652
|
year |
1981 | -4.660021 2.721933 -1.71 0.087 -9.998702 .678659
1982 | -5.557028 5.497126 -1.01 0.312 -16.33885 5.224794
1983 | -10.63977 8.795378 -1.21 0.227 -27.89063 6.611104
1984 | -18.76668 11.39263 -1.65 0.100 -41.1117 3.578331
1985 | -23.61831 14.32697 -1.65 0.099 -51.71861 4.481984
1986 | -29.10203 16.61986 -1.75 0.080 -61.6995 3.495445
1987 | -34.29028 18.46377 -1.86 0.063 -70.50431 1.923745
1988 | -48.44084 19.75174 -2.45 0.014 -87.18104 -9.70065
1989 | -54.73721 22.04372 -2.48 0.013 -97.97281 -11.50162
1990 | -67.16001 23.65404 -2.84 0.005 -113.554 -20.76601
1991 | -77.92565 25.97627 -3.00 0.003 -128.8744 -26.97694
1992 | -88.53438 28.49949 -3.11 0.002 -144.432 -32.63673
1993 | -97.72113 31.57967 -3.09 0.002 -159.6601 -35.78213
1994 | -102.3819 33.38187 -3.07 0.002 -167.8557 -36.90815
1995 | -115.8907 37.23702 -3.11 0.002 -188.9258 -42.85566
1996 | -118.6755 39.02702 -3.04 0.002 -195.2214 -42.12961
1997 | -118.675 38.75563 -3.06 0.002 -194.6886 -42.66145
1998 | -124.622 38.53307 -3.23 0.001 -200.1991 -49.04492
1999 | -128.1722 39.91359 -3.21 0.001 -206.4569 -49.88741
2000 | -133.1516 40.6607 -3.27 0.001 -212.9017 -53.40144
2001 | -126.7362 38.51777 -3.29 0.001 -202.2833 -51.18914
2002 | -119.7739 36.83191 -3.25 0.001 -192.0145 -47.53344
2003 | -108.5075 34.97694 -3.10 0.002 -177.1097 -39.90524
2004 | -111.8748 33.35352 -3.35 0.001 -177.2929 -46.45662
2005 | 0 (omitted)
|
_cons | -178.691 61.08993 -2.93 0.003 -298.5101 -58.87189
And am looking for a regex expression that removes all the years, so the output would like something this:
KN_Divers_Blau | -6.429897 8.010333 -0.80 0.422 -22.14101 9.28122
Ind_ROA_Ave | .3407456 .3389998 1.01 0.315 -.3241539 1.005645
Ind_Tobin_Q1_Ave | -.5065654 .2104229 -2.41 0.016 -.9192797 -.0938511
Ind_Growth_Ave | -1.404911 1.852805 -0.76 0.448 -5.038922 2.229101
Pat_Dum | -18.31015 5.452194 -3.36 0.001 -29.00385 -7.616457
|
|
_cons | -187.8645 62.81122 -2.99 0.003 -311.0597 -64.66936
KN_Divers_Blau | -6.57637 8.068413 -0.82 0.415 -22.4014 9.248663
Ind_ROA_Ave | .3641781 .3411348 1.07 0.286 -.3049088 1.033265
Ind_Tobin_Q1_Ave | -.5070564 .2105863 -2.41 0.016 -.9200911 -.0940217
Ind_Growth_Ave | -1.424116 1.871656 -0.76 0.447 -5.095101 2.246869
Pat_Dum | -18.51642 5.463958 -3.39 0.001 -29.23319 -7.799652
|
|
_cons | -178.691 61.08993 -2.93 0.003 -298.5101 -58.87189
I'm extremely new to regex, and have tried searching for the following and replace it with nothing in Notepad++:
year.*[\n].*1981.*[\n].*2005
The problem is it finds all the values in between the first years columns and the second one and removes everything in between.
Is there a way to have the search term find each years column once? (So in my example, it would find and replace the years column twice in total)
Many thanks in advance.
You may use the following pattern:
^\s*year.*(?:[\r\n]+\s*\d{4}\b.*)*[\r\n]+
..and replace with an empty string.
Demo.
Breakdown:
^ Beginning of line.
\s* Match zero or more whitespace characters.
year.* Match "year" followed by any number of characters.
(?: Start of a non-capturing group.
[\r\n]+ Match one or more line-break character.
\s* Match zero or more whitespace characters.
\d{4}\b.* Match four digits followed by any number of characters.
) Close the non-capturing group.
* Match zero or more occurrences of the previous group.
[\r\n]+ Match one or more line-break character.

Chart Behavior in Oracle APEX

I have a time-series data in my table. Sample Data given below:
+------+------------+-----------+-----------+-------------+-------------+
| CODE | YEAR_MONTH | CALC_LVL1 | CALC_LVL2 | MSRMT_PCT_1 | MSRMT_PCT_2 |
+------+------------+-----------+-----------+-------------+-------------+
| A1 | 201912 | 87 | 564 | 0.14 | 0.1 |
| A1 | 201911 | 34 | 455 | 0.15 | 0.08 |
| A1 | 201910 | 20 | 295 | 0.1 | 0.14 |
| A1 | 201909 | 39 | 219 | 0.08 | 0.14 |
| A1 | 201908 | 98 | 438 | 0.14 | 0.11 |
| A1 | 201907 | 7 | 219 | 0.08 | 0.14 |
| A1 | 201812 | 63 | 564 | 0.14 | 0.17 |
| A1 | 201808 | 12 | 455 | 0.15 | 0.13 |
| A1 | 201805 | 48 | 409 | 0.13 | 0.13 |
| A1 | 201802 | 88 | 289 | 0.11 | 0.08 |
| A1 | 201801 | 9 | 492 | 0.14 | 0.13 |
+------+------------+-----------+-----------+-------------+-------------+
Is there any way that the default chart shows me the year values, and when user clicks on a year label, then it shows monthly data?
I am assuming you are looking for Time Axis chart where the chart label is mapped to the DATE or TIMESTAMP column. In the chart attributes, set the Time Axis Type to Enabled. Labels will then be correctly rendered as readable dates. You can then build another chart or report that can be drilled down from this chart. To do this, navigate to the chart, select the series and then in the property editor, navigate to Column Mapping. Select the column names for LABEL and VALUE. For Link > Type, select Redirect to Page in this application. Click Target, select the page and set page item and value.

xtreg omitting year dummy variables when using i.year

I have a panel dataset with the following years:
tab year
year | Freq. Percent Cum.
------------+-----------------------------------
2000 | 31 12.55 12.55
2001 | 31 12.55 25.10
2002 | 30 12.15 37.25
2003 | 31 12.55 49.80
2004 | 31 12.55 62.35
2005 | 31 12.55 74.90
2006 | 31 12.55 87.45
2007 | 31 12.55 100.00
------------+-----------------------------------
Total | 247 100.00
When I do xtreg dv iv i.year, I see that year 2000 is not included, as well as 2007:
xtreg local_gr rtxdum i.year
note: 2007.year omitted because of collinearity
Random-effects GLS regression Number of obs = 247
Group variable: province_n~e Number of groups = 31
R-sq: Obs per group:
within = 0.6194 min = 7
between = 0.0016 avg = 8.0
overall = 0.2356 max = 8
Wald chi2(7) = 341.51
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
local_gr | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rtxdum | -753799.7 291543.7 -2.59 0.010 -1325215 -182384.5
|
year |
2001 | 388246 291543.7 1.33 0.183 -183169.2 959661.2
2002 | 745406.4 294294.5 2.53 0.011 168599.8 1322213
2003 | 1175610 291543.7 4.03 0.000 604194.4 1747025
2004 | 1773982 291543.7 6.08 0.000 1202567 2345397
2005 | 2600005 291543.7 8.92 0.000 2028589 3171420
2006 | 4425318 291543.7 15.18 0.000 3853903 4996734
2007 | 0 (omitted)
|
_cons | 1564670 447832.4 3.49 0.000 686934.1 2442405
-------------+----------------------------------------------------------------
sigma_u | 2217878.8
sigma_e | 1150064.9
rho | .78809251 (fraction of variance due to u_i)
------------------------------------------------------------------------------
The message says 2007 was omitted due to collinearity, but I don't understand why year 2000 would not show up in the results?
Because it is the base level. You can see it by using the allbaselevels option:
webuse nlswork, clear
xtset idcode
xtreg ln_w grade tenure i.race not_smsa south, allbaselevels
Random-effects GLS regression Number of obs = 28,091
Group variable: idcode Number of groups = 4,697
R-sq: Obs per group:
within = 0.1005 min = 1
between = 0.4498 avg = 6.0
overall = 0.3305 max = 15
Wald chi2(6) = 6509.50
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
grade | .07605 .0018128 41.95 0.000 .0724969 .0796031
tenure | .0361319 .0006298 57.37 0.000 .0348975 .0373663
|
race |
white | 0 (base)
black | -.0530121 .0102916 -5.15 0.000 -.0731832 -.0328409
other | .0762678 .0415911 1.83 0.067 -.0052492 .1577849
|
not_smsa | -.1289554 .0074296 -17.36 0.000 -.1435172 -.1143936
south | -.0786512 .0075533 -10.41 0.000 -.0934555 -.063847
_cons | .6759773 .0244723 27.62 0.000 .6280125 .7239421
-------------+----------------------------------------------------------------
sigma_u | .26440074
sigma_e | .30295598
rho | .43235646 (fraction of variance due to u_i)
------------------------------------------------------------------------------

error while checking if column value is in other columns in panda data frame

I have a multi-thousand rowed data frame, part of which include data like the below. I also have additional columns ["FP","Y","SLC","C_ID","NR"] in this data frame.
z_to_s | z_to_t | s_to_t | t_p | min | max
0.04 | | 0.06 | 0.29 | 0.04 | 0.29
0.01 | | NS | NS | 0.01 | 0.01
ND | | NS | NS | ND | ND
0.04 | | ND* | NS | ND* | 0.04
| 0.55* | | | 0.55 | 0.55
19.88* | | 0.46 | 0.09 | 0.09 |19.88
The "min" and "max" columns each denote minimum and maximum values from the "z_to_s", "z_to_t", "s_to_t", and "t_p" columns. ND or ND* is always considered the minimum while NS is ignored. I need to maintain the original form of the input data so my final output should look like this:
z_to_s | z_to_t | s_to_t | t_p | min | max
0.04 | | 0.06 | 0.29 | 0.04 | 0.29
0.01 | | NS | NS | 0.01 | 0.01
ND | | NS | NS | ND | ND
0.04 | | ND* | NS | ND* | 0.04
| 0.55* | | | 0.55* | 0.55
19.88* | | 0.46 | 0.09 | 0.09 | 19.88*
To do this, I've been trying to use the below code to formulate new columns called "QC_min" and "QC_max"
df["QC_min"] = df.drop(["FP","Y","SLC","C_ID","NR","min","max"], axis = 1).isin(data_concat["min"]).any(axis = 1)
df["QC_max"] = df.drop(["FP","Y","SLC","C_ID","NR","min","max"], axis = 1).isin(data_concat["max"]).any(axis = 1)
so "QC_min" and "QC_max" has TRUE/FALSE values depending on "min"/"max" matching any one of the ["z_to_s","z_to_t","s_to_t","t_p"] column values. I want to write another line of code so if "QC_min" or "QC_max" is FALSE, I add a "*" to the end of the corresponding "min" or "max" value. However, the output from the above code shows up like this.
z_to_s | z_to_t | s_to_t | t_p | min | max | QC_min | QC_max
0.04 | | 0.06 | 0.29 | 0.04 | 0.29 | FALSE | FALSE
0.01 | | NS | NS | 0.01 | 0.01 | FALSE | FALSE
ND | | NS | NS | ND | ND | TRUE | TRUE
0.04 | | ND* | NS | ND* | 0.04 | TRUE | FALSE
| 0.55* | | | 0.55 | 0.55 | FALSE | FALSE
19.88* | | 0.46 | 0.09 | 0.09 | 19.88 | FALSE | FALSE
where all the number objects show up as false regardless of whether they match or not, while the string objects are true. I have checked my data type, wondering if this was a data type int/float/str issue. If I add an astype(str) to my "min" or "max" so my code becomes
df["QC_min"] = df.drop(["FP","Y","SLC","C_ID","NR","min","max"], axis = 1).isin(data_concat["min"]).astype(str).any(axis = 1)
df["QC_max"] = df.drop(["FP","Y","SLC","C_ID","NR","min","max"], axis = 1).isin(data_concat["max"]).astype(str).any(axis = 1)
everything becomes TRUE, regardless of the *, like this:
z_to_s | z_to_t | s_to_t | t_p | min | max | QC_min | QC_max
0.04 | | 0.06 | 0.29 | 0.04 | 0.29 | TRUE | TRUE
0.01 | | NS | NS | 0.01 | 0.01 | TRUE | TRUE
ND | | NS | NS | ND | ND | TRUE | TRUE
0.04 | | ND* | NS | ND* | 0.04 | TRUE | TRUE
| 0.55* | | | 0.55 | 0.55 | TRUE | TRUE
19.88* | | 0.46 | 0.09 | 0.09 | 19.88 | TRUE | TRUE
Where am I going wrong? Suggestions on how to fix this/do what I want to do would be much appreciated. Thanks.

Stata : distinct values, count and histogram

I am new to Stata and still learning.
I have a var shaped like that :
+-------+
| Phase |
+-------+
| I |
+-------+
| I |
+-------+
| II |
+-------+
| III |
+-------+
| II |
+-------+
My goal is to draw a histogram with the possible value (I,II,III) (x-axis) and the number of each (2,2,1) (y-axis).
I though I could make a loop and store the number of each possible in an array but arrays does not seem to be implemented in Stata.
Is the any kind of function that do what I want already implemented or I have to write a function to distinct the value, then count them, then draw the histogram ?
Thank you.
/edit :
processed.p |
hase | Freq. Percent Cum.
------------+-----------------------------------
I | 266 0.92 0.92
I/II | 1,006 3.50 4.42
II | 10,867 37.76 42.18
II/III | 344 1.20 43.37
III | 9,248 32.13 75.51
IV | 6,984 24.27 99.77
NA | 65 0.23 100.00
------------+-----------------------------------
Total | 28,780 100.00
I found a way of counting distinct values.
I found the solution :
tab processedphase, matcell(x)
in order to obtain
processed.p |
hase | Freq. Percent Cum.
------------+-----------------------------------
I | 266 0.92 0.92
I/II | 1,006 3.50 4.42
II | 10,867 37.76 42.18
II/III | 344 1.20 43.37
III | 9,248 32.13 75.51
IV | 6,984 24.27 99.77
NA | 65 0.23 100.00
------------+-----------------------------------
Total | 28,780 100.00
then :
matrix list x
svmat x