AWK for-loop with break statment - if-statement

Today I am working on a problem correcting data errors in files that have a few unknowns. The unknowns are the number of fields in each file, and which fields and records have the string "---".
An example of the data is:
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
4 4 2 39.6406 4 38.4964 --- 37.7414 --- 36.7149
5 5 --- 40.2504 --- 39.0286 --- 38.4879 --- 38.1004
The desired output is:
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
4 4 2 39.6406 4 38.4964 --- --- --- ---
5 5 --- --- --- --- --- --- --- ---
I have tried using for-loops, such as:
awk '{for (i = NF; i >= 1; i--){if ($i=="---")$(i-1)="---"}{print $0}}' file
which resulted in:
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
---
---
and I also tried:
awk '{for (i=1;i<=NF;i++){if ($i=="---")$(i+1)="---"}{print $0}}' file
which resulted in the error:
"awk: program limit exceeded: maximum number of fields size=32767"
FILENAME="file" FNR=4 NR=4
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
In my first attempt, the for-loop went all the way to the first field, and in the second attempt, the records with the desired string had an infinite loop.
My gut feeling is I need to apply a break statement, yet after many hours of searching, I can't find an example that has helped me. I know there is more then one way to skin a cat, so if you know a better way to accomplish my goal, keeping in mind that there are multiple files with different field counts, or if you can provide an example of a break statement with one of my for-loops, I, and others looking for an example, will be extremely grateful.
Thank you

this should work
$ awk '{for(i=3;i<NF;i+=2) if($i=="---") $(i+1)=$i}1' file |
column -t
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
4 4 2 39.6406 4 38.4964 --- --- --- ---
5 5 --- --- --- --- --- --- --- ---

All you need is a simple substitution so that's an ideal job for sed:
$ sed -E 's/(-+ +)[^ ]+/\1\1 /g' file
1 2 1 39.6406 1 38.8512 1 38.3479 1 37.9744
2 1 4 39.1527 3 38.7329 2 38.3479 2 37.9744
3 3 3 39.5186 2 38.8512 3 38.2079 3 37.6385
4 4 2 39.6406 4 38.4964 --- --- --- ---
5 5 --- --- --- --- --- --- --- ---

Related

SAS code - sum of last N rows for every row

I have a dataset like this for each ID;
Months
ID
Number
2018-07-01
1
0
2018-08-01
1
0
2018-09-01
1
1
2018-10-01
1
3
2018-11-01
1
1
2018-12-01
1
2
2019-01-01
1
0
2019-02-01
1
0
2019-03-01
1
1
2019-04-01
1
0
2019-05-01
1
0
2019-06-01
1
0
2019-07-01
1
1
2019-08-01
1
0
2019-09-01
1
0
2019-10-01
1
2
2019-11-01
1
0
2019-12-01
1
0
2020-01-01
1
0
2020-02-01
1
0
2020-03-01
1
0
2020-04-01
1
0
2020-05-01
1
0
2020-06-01
1
0
2020-07-01
1
0
2020-08-01
1
1
2020-09-01
1
0
2020-10-01
1
0
2020-11-01
1
1
2020-12-01
1
0
2021-01-01
1
0
2021-02-01
1
1
2021-03-01
1
1
2021-04-01
1
0
2018-07-01
2
0
.......
.......
.......
(Similar values for each ID)
I want a dataset like this;
Months
ID
Number
Sum_Next_6Number
2018-07-01
1
0
7
2018-08-01
1
0
7
2018-09-01
1
1
7
2018-10-01
1
3
4
2018-11-01
1
1
3
2018-12-01
1
2
1
2019-01-01
1
0
2
2019-02-01
1
0
2
2019-03-01
1
1
1
2019-04-01
1
0
3
2019-05-01
1
0
3
2019-06-01
1
0
3
2019-07-01
1
1
2
2019-08-01
1
0
2
2019-09-01
1
0
2
2019-10-01
1
2
0
2019-11-01
1
0
0
2019-12-01
1
0
0
2020-01-01
1
0
0
2020-02-01
1
0
1
2020-03-01
1
0
1
2020-04-01
1
0
1
2020-05-01
1
0
2
2020-06-01
1
0
2
2020-07-01
1
0
2
2020-08-01
1
1
2
2020-09-01
1
0
3
2020-10-01
1
0
3
2020-11-01
1
1
Nan
2020-12-01
1
0
Nan
2021-01-01
1
0
Nan
2021-02-01
1
1
Nan
2021-03-01
1
1
Nan
2021-04-01
1
0
Nan
2018-07-01
2
0
0
.......
.......
.......
.......
If there is no 6 months left then this values should be Nan.
Is there a way to do this? Thank you in advance.
data want(drop = i n);
set have curobs = c nobs = nobs;
Sum_Next_6Numbers = 0;
do p = c + 1 to 6 + c;
if p > nobs then do;
Sum_Next_6Numbers = .; leave;
end;
set have(keep = Number ID rename = (Number = n id = i)) point = p;
if id ne i then do;
Sum_Next_6Numbers = .; leave;
end;
Sum_Next_6Numbers + n;
end;
run;

Adding observations between rows

I would like to create new observations as follows:
A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
*3* 1 .
*3* 2 .
*3* 3 .
*3* 4 .
*3* 5 .
4 1 4
4 2 3
4 3 1
The new lines are indicated by asterisks.
How can I create new observations for variable A and B?
This is a simple expand:
clear
input A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
4 1 4
4 2 3
4 3 1
end
generate id = _n
expand 6 if id == 10
replace id = 11 if _n == _N
replace A = 3 if id == 10
replace C = . if id == 10
bysort id: replace B = cond(_n == 1, 1, B[_n-1]+1) if id == 10
Which will produce the desired output:
list, sepby(A)
+----------------+
| A B C id |
|----------------|
1. | 1 1 1 1 |
2. | 1 2 2 2 |
3. | 1 3 4 3 |
4. | 1 4 5 4 |
5. | 1 5 2 5 |
|----------------|
6. | 2 1 1 6 |
7. | 2 2 5 7 |
8. | 2 3 3 8 |
9. | 2 4 3 9 |
|----------------|
10. | 3 1 . 10 |
11. | 3 2 . 10 |
12. | 3 3 . 10 |
13. | 3 4 . 10 |
14. | 3 5 . 10 |
|----------------|
15. | 4 1 4 11 |
16. | 4 2 3 11 |
17. | 4 3 1 12 |
+----------------+
The code could be shorter.
expand 2 if _n < 6
replace A = 3 if _n > _N - 5
*replace B = _n + 5 - _N if A == 3
replace C = . if A == 3
sort A B

Selecting unique cases

I have a dataset work.test1 that consists of 4 variables hhid (household id), pid (person id), pidlink (combination of hhid and pid) and bin (positive or negative).
example data looks like this:
obs hhid pid pidlink bin
1 10600 1 1060001 1
2 10600 1 1060001 1
3 10800 1 1080001 1
4 10800 1 1080001 1
5 10800 2 1080002 1
6 10800 2 1080002 2
7 12200 1 1220001 1
8 12200 1 1220001 2
Now I want to create a dataset work.test2 that should only contain unique hhid that are either bin 2 (if there is a bin=2 in the household) or bin 1 (if there are no bin 2 in the household). If there are more than 1 bin=2, i would choose the first one. And if there are no bin 2 but there are more than 1 bin 1 i would chose the first one. The resulting dataset should only have unique hhid (single entry per household).
The resulting output should look like this:
obs hhid pid pidlink bin
1 10600 1 1060001 1
2 10800 2 1080001 2
3 12200 1 1220001 2
Thank you
As far as data and output shown a group by and max function should work and given me desired results.
data have(drop =obs);
input obs hhid pid pidlink bin;
datalines;
1 10600 1 1060001 1
2 10600 1 1060001 1
3 10800 1 1080001 1
4 10800 1 1080001 1
5 10800 2 1080002 1
6 10800 2 1080002 2
7 12200 1 1220001 1
8 12200 1 1220001 2
;
proc sql;
select hhid, max(pid) as pid, max(pidlink) as pidlink, max(bin) as bin
from have
group by 1;
if you have more columns then it gets little tricky but you can do it but again you need more choice, otherwise you will more records. see the query below
data have(drop =obs);
input obs hhid pid pidlink bin anotherval1 abotherval2 $;
datalines;
1 10600 1 1060001 1 7 A
2 10600 1 1060001 1 8 B
3 10800 1 1080001 1 6 C
4 10800 1 1080001 1 8 D
5 10800 2 1080002 1 8 E
6 10800 2 1080002 2 9 F
7 12200 1 1220001 1 10 G
8 12200 1 1220001 2 7 H
;
proc sql;
select * from have
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin) ;
if you want to have only distinct records with additional columns then
data have1;
set have;
val =_n_;
run;
proc sql;
create table have2(drop =val) as
select * from
(select * from have1
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin))a
group by hhid, pid, pid,bin
having val=min(val);

Conditional join Pandas.Dataframe

I try to partially join two dataframes :
import pandas
import numpy
entry1= pandas.datetime(2014,6,1)
entry2= pandas.datetime(2014,6,2)
df1=pandas.DataFrame(numpy.array([[1,1],[2,2],[3,3],[3,3]]), columns=['zick','zack'], index=[entry1, entry1, entry2, entry2])
df2=pandas.DataFrame(numpy.array([[2,3],[3,3]]), columns=['eins','zwei'], index=[entry1, entry2])
I tried
df1 = df1[(df1['zick']>= 2) & (df1['zick'] < 4)].join(df2['eins'])
but this doesn't work. After joining values of df1['eins'] are expected to be [NaN,2,3,3].
How to do it? I'd like to it inplace without df copies.
I think this is what you actually meant to use:
df1 = df1.join(df2['eins'])
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[~mask, 'eins'] = np.nan
df1
yielding:
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Issue you were having is that you were joining filtered dataframe, and not the original one, there was no place for NaN to appear (every cell was satisfying your filter).
EDIT:
Considering new inputs in the comments below, here is another approach.
Create an empty column that will need to be updated with values from second dataframe:
df1['eins'] = np.nan
print df1
print df2
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 NaN
2014-06-02 3 3 NaN
2014-06-02 3 3 NaN
eins zwei
2014-06-01 2 3
2014-06-02 3 3
Set the filter and make values in the column_to_be_updated satisfying the filter equal to 0.
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 0
2014-06-02 3 3 0
2014-06-02 3 3 0
Update inplace your df1 with df2 values (only values equal to 0 will be updated):
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Now if you want to change the filter and do the update again it will not change previously updated values:
mask = (df1['zick']>= 1) & (df1['zick'] == 1)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 0
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 2
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3

Create consecutive ID based on non-consecutive ID in Stata

Given the following variables (id and partner) in Stata, I would like to create a new variable (pid) that is simply the consecutive partner counter within id (as you can see, partner is not consecutive). Here is a MWE:
clear
input id partner pid
1 1 1
1 1 1
1 3 2
1 3 2
2 2 1
2 3 2
2 3 2
2 3 2
2 5 3
2 5 3
end
// create example data
clear
input id partner
1 1
1 1
1 3
1 3
2 2
2 3
2 3
2 3
2 5
2 5
end
// create pid
bysort id partner : gen pid = _n == 1
by id : replace pid = sum(pid)
// admire the result
list, sepby(id)