Deleting tables with regular expressions

Deleting tables with regular expressions - regex

Not really a specific question, since I don't know enough - more of a question on how to approach this.
Example file can be seen below:
LOADING CONDITION : 4-Homogenous cargo 98% 1.018t/m3, draught 3.35m
- outgoing
ITEMS OF LOADING
-------------------------------------------------------------------------------
CAPA ITEM REFERENCE X1 X2 WEIGHT KG LCG YG FSM
No (m) (m) (t) (m) (m) (m) (t.m)
-------------------------------------------------------------------------------
13 No2 CARGO TK P 1.650 29.400 609.04 2.745 15.525 -3.384 483.49
14 No2 CARGO TK S 1.650 29.400 603.61 2.745 15.525 3.384 483.49
15 No1 CARGO TK P 29.400 56.400 587.23 2.745 42.900 -3.384 470.42
16 No1 CARGO TK S 29.400 56.400 592.45 2.745 42.900 3.384 470.42
17 MGO tank aft 21.150 23.400 23.42 6.531 22.275 -0.500 15.70
18 TO storage tank 21.150 23.400 2.68 7.225 22.275 2.300 0.00
19 MGO fore tank 33.150 35.400 25.90 6.643 34.275 -0.212 0.00
-------------------------------------------------------------------------------
DEADWEIGHT 2444.34 2.828 29.007 -0.005 1923.52
SUMMARY OF LOADING
WEIGHT KG LCG YG FSM
(t) (m) (m) (m) (t.m)
-------------------------------------------------------------------------------
DEADWEIGHT 2444.34 2.828 29.007 -0.005 1923.52
LIGHT SHIP 634.00 3.030 28.654 0.000 0.00
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52
LOADING CONDITION : 4-Homogenous cargo 98% 1.018t/m3, draught 3.35m
- outgoing
Damage Case : 1bott: all cargo & void3
Flooding Percentage : 100 %
Flooded Volumes : No.3 Void space P No.3 Void space S No2 CARGO TK P
No2 CARGO TK S No1 CARGO TK P No1 CARGO TK S
-------------------------------------------------------------------------------
WEIGHT KG LCG YG FSM CORR.KG
(t) (m) (m) (m) (t.m) (m)
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52 3.494
RUN-OFF WEIGHTS 0.00 0.000 0.000 0.000 0.00 0.000
-------------------------------------------------------------------------------
DAMAGE CONDITION 3078.34 2.869 28.935 -0.004 1923.52 3.494
EQUILIBRIUM NOT FOUND ON STARBOARD
LOADING CASE :
4-Homogenous cargo 98% 1.018t/m3, draught 3.35m - outgoing
-------------------------------------------------------------------------------
WEIGHT KG LCG YG FSM CORR.KG
(t) (m) (m) (m) (t.m) (m)
-------------------------------------------------------------------------------
TOTAL WEIGHT 3078.34 2.869 28.935 -0.004 1923.52 3.494
SUMMARY OF RESULTS OF DAMAGE STABILITY
-------------------------------------------------------------------------------
DAMAGE CASE % R HEEL GM FBmin GZ>0 GZmax Area
(deg) (m) (m) (deg) (m) (m.rad)
-------------------------------------------------------------------------------
1bott: all cargo & void3 100 0 EQUILIBRIUM NOT FOUND
% : Flooding percentage.
R : R=1 if run-off weights considered, R=0 if no run-off.
HEEL : Heel at equilibrium (negative if equilibrium is on port).
GM : GM at equilibrium.
FBmin : Minimum distance of margin line, weathertight or non-weathertight
points from waterline.
GZ>0 : Range of positive GZ limited to immersion of non-weathertight openings.
GZmax : Maximum GZ value.
It is one of many, they can differ a bit, but they all come down to tables in textual form. I need to clean up some items from them, before pasting them in a report.
So I was wondering - what would be the text way to delete a certain table. For example, SUMMARY OF LOADING (it starts with the line containing "SUMMARY OF LOADING" and end in the line containing "TOTAL WEIGHT").
How to match that table and delete it?

Try the following from within vim
:g/SUMMARY OF LOADING/, /TOTAL WEIGHT/d
sed works in the same way:
sed '/SUMMARY OF LOADING/, /TOTAL WEIGHT/d' input_with_tables.txt

Fredrik Pihl's solution with :g works well if you need to delete all such tables. For more specific edits, you could use my CountJump plugin to create custom motions and text objects by defining start and end patterns (like SUMMARY OF LOADING and TOTAL WEIGHT in your case), and then quickly jump to the next table and delete a table with a quick mapping.

Related

manipulating the format of date on X-axis

I have a weekly dataset. I use this code to plot the causality between variables. Stata shows the number of weeks of each year on the X-axis. Is it possible to show only year or year-month instead of year-week on the X-axis?
generate Date =wofd(D)
format Date %tw
tsset Date
tvgc Momentum supply, p(3) d(3) trend window(25) prefix(_) graph

The fact that you have weekly data is only a distraction here.
You should only use Stata's weekly date functions if your weeks satisfy Stata's rules:
Week 1 starts on 1 January, always.
Later weeks start 7 days later in turn, except that week 52 is always 8 or 9 days long.
Hence there is no week 53.
These are documented rules, and they do not match your data. You are lucky that you have no 53 week years in your data; otherwise you would get some bizarre results.
See much detailed discussion at references turned up by search week, sj.
The good news is that you need just to build on what you have and put labels and ticks on your x axis. It's a little bit of work but no more than use of standard and documented label and tick options. The main ideas are blindingly obvious once spelled out:
Labels Put informative labels in the middle of time intervals. Suppress the associated ticks. You can suppress a tick by setting its length to zero or its colour to invisible.
Ticks Put ticks as the ends (equivalently beginnings) of time intervals. Lengthen ticks as needed.
Grid lines Lines demarcating years could be worth adding. None are shown here, but the syntax is just an extension of that given.
Axis titles If the time (usually x) axis is adequately explained, that axis title is redundant and even dopey if it is some arbitrary variable name.
See especially https://www.stata-journal.com/article.html?article=gr0030 and https://www.stata-journal.com/article.html?article=gr0079
With your data, showing years is sensible but showing months too is likely to produce crowded detail that is hard to read and not much use. I compromised on quarters.
* Example generated by -dataex-. For more info, type help dataex
clear
input str10 D float(Momentum Supply)
"12/2/2010" -1.235124 4.760894
"12/9/2010" -1.537671 3.002344
"12/16/2010" -.679893 1.5665628
"12/23/2010" 1.964229 .5875537
"12/30/2010" -1.1872853 -1.1315695
"1/6/2011" .028031677 .065580264
"1/13/2011" .4438451 1.2316793
"1/20/2011" -.3865465 1.7899017
"1/27/2011" -.4547117 1.539866
"2/3/2011" 1.6675532 1.352376
"2/10/2011" -.016190516 3.72986
"2/17/2011" .5471755 2.0804555
"2/24/2011" .2695233 2.1094923
"3/3/2011" .5136591 -1.0686383
"3/10/2011" .606721 3.786967
"3/17/2011" .004175631 .4544936
"3/24/2011" 1.198901 -.3316304
"3/31/2011" .1973385 .5846249
"4/7/2011" 2.2470737 1.0026894
"4/14/2011" .3980386 -2.6676855
"4/21/2011" -1.530687 -7.214682
"4/28/2011" -.9735931 3.246654
"5/5/2011" .13312873 .9581707
"5/12/2011" -.8017629 -.468076
"5/19/2011" -.11491735 -4.354526
"5/26/2011" .3627179 -2.233418
"6/2/2011" .13805833 2.2697728
"6/9/2011" .27832976 .58203816
"6/16/2011" -1.9467738 -.2834298
"6/23/2011" -.9579238 -1.0356172
"6/30/2011" 1.1799787 1.1011268
"7/7/2011" -2.0982232 .5292908
"7/14/2011" -.2992591 -.4004747
"7/21/2011" .5904395 -2.5159726
"7/28/2011" -.21626104 1.936029
"8/4/2011" -.02421602 -.8160484
"8/11/2011" 1.5797064 -.6868965
"8/18/2011" 1.495294 -1.8621664
"8/25/2011" -1.2188485 -.8388996
"9/1/2011" .4991612 -1.6689343
"9/8/2011" 2.1691883 1.3244398
"9/15/2011" -1.2074957 .9707839
"9/22/2011" -.3399567 .6742781
"9/29/2011" 1.9860272 -3.331345
"10/6/2011" 1.935733 -.3882593
"10/13/2011" -1.278119 .6796986
"10/20/2011" -1.3209987 .2258049
"10/27/2011" 4.315368 .7879103
"11/3/2011" .58669937 -.5040554
"11/10/2011" 1.460597 -2.0426705
"11/17/2011" -1.338189 -.24199644
"11/24/2011" -1.6870773 -1.1143018
"12/1/2011" -.19232976 -1.2156726
"12/8/2011" -2.655519 -2.054406
"12/15/2011" 1.7161795 -.15301673
"12/22/2011" -1.43026 -3.138013
"12/29/2011" .03427247 -.28446484
"1/5/2012" -.15930523 -3.362428
"1/12/2012" .4222094 4.0962815
"1/19/2012" -.2413332 3.8277814
"1/26/2012" -2.850591 .067359865
"2/2/2012" -1.1785052 -.3558361
"2/9/2012" -1.0380571 .05134211
"2/16/2012" .8539951 -4.421839
"2/23/2012" .2636529 1.3424703
"3/1/2012" .022639304 2.734022
"3/8/2012" .1370547 .8043283
"3/15/2012" .1787796 -.56465846
"3/22/2012" -2.0645525 -2.9066684
"3/29/2012" 1.562931 -.4505192
"4/5/2012" 1.2587242 -.6908772
"4/12/2012" -1.5202224 .7883849
"4/19/2012" 1.0128288 -1.6764873
"4/26/2012" -.29182148 1.920932
"5/3/2012" -1.228097 -3.7068026
"5/10/2012" -.3124508 -3.034149
"5/17/2012" .7570716 -2.3398724
"5/24/2012" -1.0697783 -2.438565
"5/31/2012" 1.2796624 1.299344
"6/7/2012" -1.5482885 -1.228557
"6/14/2012" 1.396692 3.2158935
"6/21/2012" .3116726 8.035475
"6/28/2012" -.22332123 .7450229
"7/5/2012" .4655248 .04986914
"7/12/2012" .4769497 4.045938
"7/19/2012" .08743203 .25987592
"7/26/2012" -.402533 .3213503
"8/2/2012" -.1564897 1.5290447
"8/9/2012" -.0919008 .13955575
"8/16/2012" -1.3851573 1.0860283
"8/23/2012" .020250637 -.8858514
"8/30/2012" -.29458764 -1.6602173
"9/6/2012" -.39921495 -.8043483
"9/13/2012" 1.76396 4.2867813
"9/20/2012" -1.2335806 2.476225
"9/27/2012" .176066 -.5992883
"10/4/2012" .1075483 1.7167135
"10/11/2012" .06365488 1.1636261
"10/18/2012" -.2305842 -1.506699
"10/25/2012" -.1526354 -2.669866
"11/1/2012" -.06311637 -2.0813057
"11/8/2012" .55959195 .8805096
"11/15/2012" 1.5306772 -2.708766
"11/22/2012" -.5585792 .26319882
"11/29/2012" -.035690214 -1.6176193
"12/6/2012" -.7885767 1.1719254
"12/13/2012" .9131169 -1.1135346
"12/20/2012" -.6910864 -.4893669
"12/27/2012" .9836168 .4052487
"1/3/2013" -.8828759 .7161615
"1/10/2013" 1.505474 -.1768004
"1/17/2013" -1.3013282 -1.333739
"1/24/2013" -1.3670077 1.0568022
"1/31/2013" .05846912 -.7845241
"2/7/2013" .4923012 -1.202816
"2/14/2013" -.06551787 -.9198701
"2/21/2013" -1.8149366 -.1746187
"2/28/2013" .3370621 1.0104061
"3/7/2013" 1.2698976 1.273357
"3/14/2013" -.3884514 .7927139
"3/21/2013" -.1437847 1.7798674
"3/28/2013" -.2325031 .9336611
"4/4/2013" .03971701 .6680117
"4/11/2013" -.25990707 -3.0261614
"4/18/2013" .7046488 -.458615
"4/25/2013" -2.1198323 -.14664523
"5/2/2013" 1.591287 -.3687443
"5/9/2013" -1.1266721 -2.0973356
"5/16/2013" -.7595757 -1.1238302
"5/23/2013" 2.2590933 2.124479
"5/30/2013" -.7447268 .7387985
"6/6/2013" 1.3409324 -1.3744274
"6/13/2013" -.3844476 -.8341842
"6/20/2013" -.8135379 -1.7971268
"6/27/2013" -2.506065 -.4194731
"7/4/2013" -.4755843 -5.216218
"7/11/2013" -1.256806 1.8539237
"7/18/2013" -.13328764 -1.0578626
"7/25/2013" 1.2412375 1.7703875
"8/1/2013" 1.5033063 -2.2505422
"8/8/2013" -1.291876 -1.5896243
"8/15/2013" 1.0093634 -2.8861396
"8/22/2013" -.6952878 -.23103845
"8/29/2013" -.05459245 1.53916
"9/5/2013" 1.2413216 .749662
"9/12/2013" .19232245 2.81967
"9/19/2013" -2.6861706 -4.520664
"9/26/2013" .3105677 -5.274343
"10/3/2013" -.2184027 -3.251637
"10/10/2013" -1.233326 -5.031735
"10/17/2013" 1.9415965 -1.250861
"10/24/2013" -1.2008202 -1.5703772
"10/31/2013" -.6394427 -1.1347327
"11/7/2013" 2.715824 2.0324607
"11/14/2013" -1.5833142 2.5080755
"11/21/2013" .9940037 4.117931
"11/28/2013" -.8226601 3.752914
"12/5/2013" .09966203 1.865995
"12/12/2013" -.18744355 2.5426314
end
gen ddate = daily(D, "MDY")
gen year = year(ddate)
gen dow = dow(ddate)
tab year
tab dow
forval y = 2010/2013 {
local Y = `y' + 1
local yend `yend' `=mdy(1,1,`Y')'
if `y' > 2010 local ymid `ymid' `=mdy(7,1, `y')' "`y'"
forval q = 1/4 {
if `q' > 4 | `y' > 2010 {
local qmid : word `q' of 2 5 8 11
local qmids `qmids' `=mdy(`qmid', 15, `y')' "Q`q'"
local qend : word `q' of 4 7 10 4
local qends `qends' `=mdy(`qend', 1, `y')'
}
}
}
line M S ddate, xla(`ymid', tlength(*3) tlc(none)) xtic(`yend', tlength(*5)) xmla(`qmids', tlc(none) labsize(small) tlength(*.5)) xmti(`qends', tlength(*5)) xtitle("") scheme(s1color)

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?

I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

test with missing standard errors

How can I conduct a hypothesis test in Stata when my predictor perfectly predicts my dependent variable?
I would like to run the same regression over many subsets of my data. For each regression, I would then like to test the hypothesis that beta_1 = 1/2. However, for some subsets, I have perfect collinearity, and Stata is not able to calculate standard errors.
For example, in the below case,
sysuse auto, clear
gen value = 2*foreign*(price<6165)
gen value2 = 2*foreign*(price>6165)
gen id = 1 + (price<6165)
I get the output
. reg foreign value value2 weight length, noconstant
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 70) = .
Model | 22 4 5.5 Prob > F = .
Residual | 0 70 0 R-squared = 1.0000
-------------+------------------------------ Adj R-squared = 1.0000
Total | 22 74 .297297297 Root MSE = 0
------------------------------------------------------------------------------
foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
value | .5 . . . . .
value2 | .5 . . . . .
weight | 3.54e-19 . . . . .
length | -6.31e-18 . . . . .
------------------------------------------------------------------------------
and
. test value = .5
( 1) value = .5
F( 1, 70) = .
Prob > F = .
In the actual data, there is usually more variation. So I can identify the cases where the predictor does a very good job of predicting the DV--but I miss those cases where prediction is perfect. Is there a way to conduct a hypothesis test that catches these cases?
EDIT:
The end goal would be to classify observations within subsets based on the hypothesis test. If I cannot reject the hypothesis at the 95% confidence level, I classify the observation as type 1. Below, both groups would be classified as type 1, though I only want the second group.
gen type = .
for values 1/2 {
quietly: reg foreign value value2 weight length if id = `i', noconstant
test value = .5
replace type = 1 if r(p)>.05
}

There is no way to do this out of the box that I'm aware of. Of course you could program it yourself to get an approximation of the p-value in these cases. The standard error is missing here because the relationship between x and y is perfectly collinear. There is no noise in the model, nothing deviates.
Interestingly enough though, the standard error of the estimate is useless in this case anyway. test performs a Wald test for beta_i = exp against beta_i != exp, not a t-test.
The Wald test uses the variance-covariance matrix from the regression. To see this yourself, refer to the Methods and formulas section here and run the following code:
(also, if you remove the -1 from gen mpg2 = and run, you will see the issue)
sysuse auto, clear
gen mpg2 = mpg * 2.5 - 1
qui reg mpg2 mpg, nocons
* collect matrices to calculate Wald statistic
mat b = e(b) // Vector of Coefficients
mat V = e(V) // Var-Cov matrix
mat R = (1) // for use in Rb-r. This does not == [0,1] because of
the use of the noconstant option in regress
mat r = (2.5) // Value you want to test for equality
mat W = (R*b-r)'*inv(R*V*R')*(R*b-r)
// This is where it breaks for you, because with perfect collinearity, V == 0
reg mpg2 mpg, nocons
test mpg = 2.5
sca F = r(F)
sca list F
mat list W
Now, as #Brendan Cox suggested, you might be able to simply use the missing value returned in r(p) to condition your replace command. Depending on exactly how you are using it. A word of caution on this, however, is that when the relationship between some x and y is such that y = 2x, and you want to test x = 5 vs test x = 2, you will want to be very careful about the interpretation of missing p-values - In both cases they are classified as type == 1, where the test x = 2 command should not result in that outcome.
Another work-around would be to simply set p = 0 in these cases, since the variance estimate will asymptotically approach 0 as the linear relationship becomes near perfect, and thus the Wald statistic will approach infinity (driving p down, all else equal).
A final yet more complicated work-around in this case could be to calculate the F-statistic manually using the formula in the manual, and setting V to some arbitrary, yet infinitesimally small number. I've included code to do this below, but it is quite a bit more involved than simply issuing the test command, and in truth only an approximation of the actual p-value from the F distribution.
clear *
sysuse auto
gen i = ceil(_n/5)
qui sum i
gen mpg2 = mpg * 2 if i <= 5 // Get different estimation results
replace mpg2 = mpg * 10 if i > 5 // over different subsets of data
gen type = .
local N = _N // use for d.f. calculation later
local iMax = r(max) // use to iterate loop
forvalues i = 1/`iMax' {
qui reg mpg2 mpg if i == `i', nocons
mat b`i' = e(b) // collect returned results for Wald stat
mat V`i' = e(V)
sca cov`i' = V`i'[1,1]
mat R`i' = (1)
mat r`i' = (2) // Value you wish to test against
if (cov`i' == 0) { // set V to be very small if Variance = 0 & calculate Wald
mat V`i' = 1.0e-14
}
mat W`i' = (R`i'*b`i'-r`i')'*inv(R`i'*V`i'*R`i'')*(R`i'*b`i'-r`i')
sca W`i' = W`i'[1,1] // collect Wald statistic into scalar
sca p`i' = Ftail(1,`N'-2, W`i') // pull p-value from F dist
if p`i' > .05 {
replace type = 1 if i == `i'
}
}
Also note that this workaround will become slightly more involved if you want to test multiple coefficients.
I'm not sure if I advise these approaches without issuing a word of caution considering you are in a very real sense "making up" variance estimates, but without a variance estimate you wont be able to test the coefficients at all.

Rolling Standard Deviation

I use Stata for estimating rolling standard deviation of ROA (using 4 window in previous year). Now, I would like to keep only those rolling standard deviation that has at least 3 observation (out of 4) in the ROA. How can I do this using Stata?
ROA roa_sd
. .
. .
. .
.0108869 .
.0033411 .
.0032814 .0053356 (this value should be missing as it was calculated using only 2 valid value)
.0030827 .0043739
.0029793 .0038275

Your question is answered on the blog post I link to above in the comments. You can use rolling and then add an additional screen to discard sigma when the number of observations doesn't meet your threshold.
But for simple calculations like sigma and beta (i.e., standard deviation and univariate regression coefficient) you can do much better with a more manual approach. Compare the rolling solution with my manual solution.
/* generate panel by adpating the linked code */
clear
set obs 20000
gen date = _n
gen id = floor((_n - 1) / 20) + 1
gen roa = int((100) * runiform())
replace roa = . in 1/4
replace roa = . in 10/12
replace roa = . in 18/20
/* solution with rolling */
/* http://statadaily.wordpress.com/2014/03/31/rolling-standard-deviations-and-missing-observations/ */
timer on 1
xtset id date
rolling sd2 = r(sd), window(4) keep(date) saving(f2, replace): sum roa
merge 1:1 date using f2, nogenerate keepusing(sd2)
xtset id date
gen tag = missing(l3.roa) + missing(l2.roa) + missing(l1.roa) + missing(roa) > 1
gen sd = sd2 if (tag == 0)
timer off 1
/* my solution */
timer on 2
rolling_sd roa, window(4) minimum(3)
timer off 2
/* compare */
timer list
list in 1/50
I show the manual solution is much faster.
. /* compare */
. timer list
1: 132.38 / 1 = 132.3830
2: 0.10 / 1 = 0.0990
Save the following as rolling_sd.ado in your personal ado file directory (or in your current working directory). I'm sure that someone could further streamline this code. Note that this code has the additional advantage of meeting the minimum data requirements at the front edge of the window (i.e., calculates sigma with first three observations, rather than waiting for all four).
*! 0.2 Richard Herron 3/30/14
* added minimum data requirement
*! 0.1 Richard Herron 1/12/12
program rolling_sd
version 11.2
syntax varlist(numeric), window(int) minimum(int)
* get dependent and indpendent vars from varlist
tempvar n miss xs x2s nonmiss1 nonmiss2 sigma1 sigma2
local w = `window'
local m = `minimum'
* generate cumulative sums and missing values
xtset
bysort `r(panelvar)' (`timevar'): generate `n' = _n
by `r(panelvar)': generate `miss' = sum(missing(`varlist'))
by `r(panelvar)': generate `xs' = sum(`varlist')
by `r(panelvar)': generate `x2s' = sum(`varlist' * `varlist')
* generate variance 1 (front of window)
generate `nonmiss1' = `n' - `miss'
generate `sigma1' = sqrt((`x2s' - `xs'*`xs'/`nonmiss1')/(`nonmiss1' - 1)) if inrange(`nonmiss1', `m', `w') & !missing(`nonmiss1')
* generate variance 2 (back of window, main part)
generate `nonmiss2' = `w' - s`w'.`miss'
generate `sigma2' = sqrt((s`w'.`x2s' - s`w'.`xs'*s`w'.`xs'/`nonmiss2')/(`nonmiss2' - 1)) if inrange(`nonmiss2', `m', `w') & !missing(`nonmiss2')
* return standard deviation
egen sigma = rowfirst(`sigma2' `sigma1')
end

Regular Expression Split on nᵗʰ occurrence

I have a string stored in Hive, and I want to split the text on the 4th occurrence of , (or any other character).
I would really appreciate if someone can give me hint about the regular expression to do this.
The text is below:
The Band,The Band,,Up On Cripple Creek (2000 Digital Remaster),2000,Greatest Hits,The Band,,The Weight (2000 Digital Remaster),2003,Rhythm Of The Rain,The Cascades,,Rhythm Of The Rain (LP Version),2005,Chronicle Volume One,Creedence Clearwater Revival,,Who'll Stop the Rain,1976,The Complete Sun Singles, vol. 1,Johnny Cash,,I Walk the Line,2001,Greatest Hits,Bob Seger,,Against The Wind,1980,Their Greatest Hits,The Eagles,,Lyin' Eyes,1975,Johnny Horton's Greatest Hits,Johnny Horton,,North To Alaska,1987,Super Hits,Marty Robbins,,You Gave Me A Mountain,1969,Greatest Hits,Bob Seger,,Night Moves,1976,Hello Darlin' 15 #1 Hits,Conway Twitty,,It's Only Make Believe,2003,Anthology,Kenny Rogers & The First Edition,,Ruby, Don't Take Your Love To Town,1996,Greatest Hits,Neil Young,,Old Man,2004,Harvest,Neil Young,,Heart Of Gold,2009,The Very Best Of,The Springfields,,Silver Threads And Golden Needles,2011,The Best Of The Statler Brothers,The Statler Brothers,,Susan When She Tried,1987,The Definitive Collection,The Statler Brothers,,The Class Of '57,2005,The Definitive Collection,The Statler Brothers,,I'll Go To My Grave Loving You,2005,Greatest Hits: 1974-1978,Steve Miller Band,,The Joker,2006,Greatest Hits: 1974-1978,Steve Miller Band,,Rock'n Me,2006,Early Girl 7" Hits,Gale Garnett,,We'll Sing In The Sunshine,2010,King of the Road,Various Artists,,I Can't Stop Loving You - Don Gibson,2004,America's Troubador,Willie Nelson,,Angel Flying To Close To The Ground,2005,Their Greatest Hits,The Eagles,,Take It To The Limit,1975,Their Greatest Hits,The Eagles,,Desperado,1973,Highwayman,The Highwaymen,,Desperados Waiting For A Train,1985,Super Hits,Marty Robbins,,My Woman, My Woman, My Wife,1970,Super Hits,Marty Robbins,,Some Memories Just Won't Die,1982,Highwayman,The Highwaymen,,Committed To Parkview,1985,Greatest Hits - Roy Clark,Roy Clark,,Yesterday When I Was Young,1995,Greatest Hits - Roy Clark,Roy Clark,,I Never Picked Cotton,1995,Simon & Garfunkel's Greatest Hits,Simon & Garfunkel,,Bridge Over Troubled Water [Live],1970,Collection,The Oak Ridge Boys,,Y'all Come Back Saloon,1977,Super Hits,Vern Gosdin,,Chiseled In Stone,1987,Super Hits,Vern Gosdin,,Who You Gonna Blame It On This Time,1987,The Very Best Of John Denver [Disc 2],John Denver,,Rocky Mountain High,1972,The Very Best Of John Denver [Disc 2],John Denver,,Take Me Home, Country Roads,1971,Souvenirs,Vince Gill,,Never Knew Lonely,1995,Souvenirs,Vince Gill,,When I Call Your Name,1995,Souvenirs,Vince Gill,,Pocket Full Of Gold,1995,Greatest Hits - Waylon Jennings,Waylon Jennings,,Bob Wills Is Still King,2000,Greatest Hits - Waylon Jennings,Waylon Jennings,,Just To Satisfy You,2000

The Regex...
/ # Start regex
( # Start group
[^,]*, # not `,` - zero or more times (*) followed by `,`
){4} # repeat group four times
/g # match globally (using String.match)
All together...
console.log( str.match(/([^,]*,){4}/g) );

^([^,]*,[^,]*,[^,]*,[^,]*,)(.*)$
should give you two strings, 1) the part up to and including the fourth comma, and 2) the rest of the string.

Split by comma, and join again in a simple loop...
var arr, i, out, str;
str = "The Band,The Band,,Up On Cripple Creek (2000 Digital Remaster),2000,Greatest Hits,The Band,,The Weight (2000 Digital Remaster),2003,Rhythm Of The Rain,The Cascades,,Rhythm Of The Rain (LP Version),2005,Chronicle Volume One,Creedence Clearwater Revival,,Who'll Stop the Rain,1976,The Complete Sun Singles, vol. 1,Johnny Cash,,I Walk the Line,2001,Greatest Hits,Bob Seger,,Against The Wind,1980,Their Greatest Hits,The Eagles,,Lyin' Eyes,1975,Johnny Horton's Greatest Hits,Johnny Horton,,North To Alaska,1987,Super Hits,Marty Robbins,,You Gave Me A Mountain,1969,Greatest Hits,Bob Seger,,Night Moves,1976,Hello Darlin' 15 #1 Hits,Conway Twitty,,It's Only Make Believe,2003,Anthology,Kenny Rogers & The First Edition,,Ruby, Don't Take Your Love To Town,1996,Greatest Hits,Neil Young,,Old Man,2004,Harvest,Neil Young,,Heart Of Gold,2009,The Very Best Of,The Springfields,,Silver Threads And Golden Needles,2011,The Best Of The Statler Brothers,The Statler Brothers,,Susan When She Tried,1987,The Definitive Collection,The Statler Brothers,,The Class Of '57,2005,The Definitive Collection,The Statler Brothers,,I'll Go To My Grave Loving You,2005,Greatest Hits: 1974-1978,Steve Miller Band,,The Joker,2006,Greatest Hits: 1974-1978,Steve Miller Band,,Rock'n Me,2006,Early Girl 7\" Hits,Gale Garnett,,We'll Sing In The Sunshine,2010,King of the Road,Various Artists,,I Can't Stop Loving You - Don Gibson,2004,America's Troubador,Willie Nelson,,Angel Flying To Close To The Ground,2005,Their Greatest Hits,The Eagles,,Take It To The Limit,1975,Their Greatest Hits,The Eagles,,Desperado,1973,Highwayman,The Highwaymen,,Desperados Waiting For A Train,1985,Super Hits,Marty Robbins,,My Woman, My Woman, My Wife,1970,Super Hits,Marty Robbins,,Some Memories Just Won't Die,1982,Highwayman,The Highwaymen,,Committed To Parkview,1985,Greatest Hits - Roy Clark,Roy Clark,,Yesterday When I Was Young,1995,Greatest Hits - Roy Clark,Roy Clark,,I Never Picked Cotton,1995,Simon & Garfunkel's Greatest Hits,Simon & Garfunkel,,Bridge Over Troubled Water [Live],1970,Collection,The Oak Ridge Boys,,Y'all Come Back Saloon,1977,Super Hits,Vern Gosdin,,Chiseled In Stone,1987,Super Hits,Vern Gosdin,,Who You Gonna Blame It On This Time,1987,The Very Best Of John Denver [Disc 2],John Denver,,Rocky Mountain High,1972,The Very Best Of John Denver [Disc 2],John Denver,,Take Me Home, Country Roads,1971,Souvenirs,Vince Gill,,Never Knew Lonely,1995,Souvenirs,Vince Gill,,When I Call Your Name,1995,Souvenirs,Vince Gill,,Pocket Full Of Gold,1995,Greatest Hits - Waylon Jennings,Waylon Jennings,,Bob Wills Is Still King,2000,Greatest Hits - Waylon Jennings,Waylon Jennings,,Just To Satisfy You,2000";
arr = str.split(/,/);
out = [];
i = 0;
while (i < arr.length - 4) {
i += 4;
out.push("" + arr[i] + ", " + arr[i + 1] + ", " + arr[i + 2] + ", " + arr[i + 3]);
}
console.log(out);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Deleting tables with regular expressions - regex

Try the following from within vim :g/SUMMARY OF LOADING/, /TOTAL WEIGHT/d sed works in the same way: sed '/SUMMARY OF LOADING/, /TOTAL WEIGHT/d' input_with_tables.txt

Related

manipulating the format of date on X-axis

Python: How to calculate tf-idf for a large data set

test with missing standard errors

Rolling Standard Deviation

Regular Expression Split on nᵗʰ occurrence

Categories

Resources