Using awk in if statements

Using awk in if statements - if-statement

I have a data file that looks like this:
1 . 0 10109 AA AA
1 . 0 10123 C CCCT
1 . 0 10133 A AAC
1 . 0 10134 A ACAAC
1 . 0 10140 A ACCCTAAC
1 . 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 . 0 10146 AC A
1 . 0 10147 G C
In the instance of "." in the second column, I would like to replace it with a merged output of columns 1 and 4, like this:
1 1:10109 0 10109 AA AA
1 1:10123 0 10123 C CCCT
1 1:10133 0 10133 A AAC
1 1:10134 0 10134 A ACAAC
1 1:10140 0 10140 A ACCCTAAC
1 1:10143 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 1:10146 0 10146 AC A
1 1:10147 0 10147 G C
I've been attempting to do this with an if/then statement... but I know I have the syntax wrong, I'm just not sure how wrong.
if [$2 -eq "." /data/pathtofile]
then
awk '{print $1 ":" $4}'
else
awk '{print $2}' >> "/data/cleanfile"
fi
What am I missing?

You could do this through awk itself.
awk -v FS="\t" -v OFS="\t" '$2=="."{$2=$1":"$4}{$1=$1}1' file
OR
$ awk '$2=="."{$2=$1":"$4}{$1=$1}1' file
1 1:10109 0 10109 AA AA
1 1:10123 0 10123 C CCCT
1 1:10133 0 10133 A AAC
1 1:10134 0 10134 A ACAAC
1 1:10140 0 10140 A ACCCTAAC
1 1:10143 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 1:10146 0 10146 AC A
1 1:10147 0 10147 G C

Related

Aggregate dummy variables to multiple categorical variables

I have 8 dummy variables (0/1). Those 8 variables have to be aggregated to one categorical variable with 8 items (categories). Normally, people should have just marked one out of the 8 dummy variables, but some marked multiple ones.
When a Person has marked two items, the first value should go into the first categorical variable, whereas the second value should go to the second categorical variable. When there are 3 items marked, the third values should go into a third categorical variable and so on (up to 3).
I know how to aggregate the dummies to a categorical variable, but I do not know which approach there is to divide the values to different variables, based on the number of marked dummies.
If the problem is not clear, please tell me. It was difficult for me to describe it properly.
Edit:
My approach is the follwoing:
local MCM_zahl4 F0801 F0802 F0803 F0804 F0805 F0806 F0807 F0808
gen MCM_zaehl_4 = 0
foreach var of varlist `MCM_zahl4' {
replace MCM_zaehl_4 = MCM_zaehl_4 + 1 if `var' == 1
}
tab MCM_zaehl_4
/*
MCM_zaehl_4 | Freq. Percent Cum.
------------+-----------------------------------
0 | 31 4.74 4.74
1 | 598 91.44 96.18
2 | 22 3.36 99.54
3 | 3 0.46 100.00
------------+-----------------------------------
Total | 654 100.00
*/
gen bildu2 = -999999
gen bildu2_D = -999999
replace bildu2 = 1 if F0801 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 2 if F0802 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 3 if F0803 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 4 if F0804 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 5 if F0805 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 6 if F0806 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 7 if F0807 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 8 if F0808 == 1 & MCM_zaehl_4 == 1
Then I split all cases MCM_zaehl_4 > 1 manually in three variables.
E. g. for two mcm:
replace bildu2 = 5 if ID == XXX
replace bildu2_D = 2 if ID == XXX
For that approach I'd need an auomation, because for more observations I won't be able to do it manually.

If I understood you correctly, you could try the following to aggregate your multiples dummy variables into multiple aggregate columns based on the number of answers that the person marked. It assumes the repeated answers are consecutive. I reduced your problem to 6 dummy (a1-a6) and people can answer up to 3 questions.
clear
input id a1 a2 a3 a4 a5 a6
1 1 0 0 0 0 0
2 1 1 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 0 0 0
5 0 1 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 1
end
egen n_asnwers = rowtotal(a*)
gen wanted_1 = .
gen wanted_2 = .
gen wanted_3 = .
local i = 1
foreach v of varlist a* {
replace wanted_1 = `v' if `v' == 1 & n_asnwers == 1
replace wanted_2 = `v' if `v' == 1 & n_asnwers == 2
replace wanted_3 = `v' if `v' == 1 & n_asnwers == 3
local ++i
}
list
/*
+------------------------------------------------------------------------------+
| id a1 a2 a3 a4 a5 a6 n_asnw~s wanted_1 wanted_2 wanted_3 |
|------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 . . |
2. | 2 1 1 0 0 0 0 2 . 1 . |
3. | 3 1 1 1 0 0 0 3 . . 1 |
4. | 4 1 1 1 0 0 0 3 . . 1 |
5. | 5 0 1 0 0 0 0 1 1 . . |
|------------------------------------------------------------------------------|
6. | 6 1 0 0 0 0 0 1 1 . . |
7. | 7 0 0 0 0 1 0 1 1 . . |
8. | 8 0 0 0 0 0 1 1 1 . . |
+------------------------------------------------------------------------------+
*/

Convert word Python Pandas Data Frame into Zero One Data Frame

Input
userID col1 col2 col3 col4 col5 col6 col7 col8 col9
1 Java c c++ php python perl html hadoop nodejs
2 nodejs c# c++ oops css html angular java php
3 php python html java angular hadoop c nodejs c#
4 python php css perl hadoop c nodejs c# html
5 perl css python hadoop c nodejs c# java php
6 Java python css perl nodejs c# java php hadoop
7 javascript java perl nodejs angular php mysql hadoop html
8 angular mysql mongodb cs hadoop angular oops html perl
9 nodejs hadoop mysql mongodb angular oops html python java
Desire Output
userID Java C C++ php python perl html hadoop nodejs oops mysql mongo
1 1 1 1 1 1 1 1 1 1 0 0 0
2 1 0 1 1 0 0 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 1 0 0 0
4 0 0 0 0 1 1 1 0 1 1 1 1

Use get_dummies + groupby by column names and aggregate max:
df = pd.get_dummies(df.set_index('userID'), prefix='', prefix_sep='')
df = df.groupby(level=0, axis=1).max().reset_index()
print (df)
userID Java angular c c# c++ cs css hadoop html java javascript \
0 1 1 0 1 0 1 0 0 1 1 0 0
1 2 0 1 0 1 1 0 1 0 1 1 0
2 3 0 1 1 1 0 0 0 1 1 1 0
3 4 0 0 1 1 0 0 1 1 1 0 0
4 5 0 0 1 1 0 0 1 1 0 1 0
5 6 1 0 0 1 0 0 1 1 0 1 0
6 7 0 1 0 0 0 0 0 1 1 1 1
7 8 0 1 0 0 0 1 0 1 1 0 0
8 9 0 1 0 0 0 0 0 1 1 1 0
mongodb mysql nodejs oops perl php python
0 0 0 1 0 1 1 1
1 0 0 1 1 0 1 0
2 0 0 1 0 0 1 1
3 0 0 1 0 1 1 1
4 0 0 1 0 1 1 1
5 0 0 1 0 1 1 1
6 0 1 1 0 1 1 0
7 1 1 0 1 1 0 0
8 1 1 1 1 0 0 1

Stata inverse matrix function

I'm trying to get inverse matrix with inv() function.
Excel function is working fine but I can't get it from Stata 11 and Stata 12 version
matrix A = (0,0,553959,18071,0,0,86985,0,0,0\0,0,13752,1986661,0,0,14178,0,0,0\245764,55172,0,0,0,0,210238,15835,0,174155\135950,1217897,0,0,211554,0,348453,197592,424893,704246\0,0,40442,171113,0,0,0,0,0,0\277015,720994,0,0,0,0,0,0,0,0\0,0,0,0,0,989861,121720,67779,0,58624\286,20529,34840,90896,0,8147,157021,265924,51955,4187\0,0,0,0,0,0,299389,86656,0,90804\0,0,58171,973844,0,0,0,0,0,0)
matrix list A
matrix D = inv(A)*A
matrix list D
I get:
D[10,10]
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
c1 .99815163 -.0007439 24 256 -4.441e-18 .02180544 .63042827 .71306993 .13905754 .72740125
c2 .00071017 1.0002858 -64 -640 1.978e-17 -.00837793 1.0656752 -.27397047 -.05342766 -.27947675
c3 2.008e-20 8.082e-21 2.143632 20.313752 0 -2.369e-19 .08155506 -7.747e-18 -1.511e-18 -2.800e-19
c4 7.748e-22 3.118e-22 .04412596 1.7837869 0 -9.141e-21 .00314672 -2.989e-19 -5.829e-20 -1.080e-20
c5 -.03648975 -.01468572 512 2048 1 .430473 13.357737 14.077098 2.7452099 14.360021
c6 .000033 .00001328 -1.125 -12 0 .9996107 -.09016952 -.01273068 -.00248264 -.01298654
c7 -1.280e-19 -5.153e-20 -7.292322 -129.5298 0 1.511e-18 .47996753 4.940e-17 9.633e-18 1.785e-18
c8 -.00276051 -.001111 32 512 0 .03256598 3.1088352 2.0649553 .20767957 1.0863588
c9 .01364134 .00549012 0 -1024 0 -.1609282 -9.7734934 -5.2625881 -.02627036 -5.3683558
c10 .00263441 .00106025 0 -128 0 -.03107834 -1.6240499 -1.0163072 -.1981926 -.03673303
But I think it should be:
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
14 -8 1 -9,42932E-17 -64 1,00614E-16 -1,73472E-18 0 1,33227E-15 32
0 -32 -5,55112E-17 1 0 1,11022E-16 0 0 0 -128
0,021597505 -0,210228064 0 0 0,788571331 4,27485E-20 2,13743E-20 0 5,47181E-18 0,788571331
0 0 0 0 0 1 0 0 0 0
0 0 0 0 -16 -8,67362E-19 1 0 0 0
3,5 -1,75 6,93889E-18 -2,7765E-19 -128 1,38778E-17 0 1 2,22045E-16 35
0 0 0 0 0 0 0 0 1 0
-0,007446191 -0,112499947 0 0 1,604924249 8,70031E-20 4,35016E-20 0 1,11364E-17 1,604924249

I believe the problem is your matrix is ill-conditioned, i.e. almost singular.
If you try to compute the inverse within Mata (Stata's matrix programming language), the result is:
: Ainv = luinv(A)
: Ainv
[symmetric]
1 2 3 4 5 6 7 8 9 10
+---------------------------------------------------+
1 | . |
2 | . . |
3 | . . . |
4 | . . . . |
5 | . . . . . |
6 | . . . . . . |
7 | . . . . . . . |
8 | . . . . . . . . |
9 | . . . . . . . . . |
10 | . . . . . . . . . . |
+---------------------------------------------------+
Couple that with:
If you use these functions with a singular matrix, returned will be a
matrix of missing values. The determination of singularity is made
relative to tol. See Tolerance under Remarks in [M-5]
lusolve() for details.
Source: help mf_luinv.
Checking the condition number, we see it is very high, confirming the ill-condition:
: C = cond(A)
: C
7.47519e+17
Numerical methods vary, but for a matrix like this, you can expect large inaccuracies. See help mf_lusolve##remarks3 as indicated above.

Use Stata to generate new variable, based on combination types of variables

Say I have a dataset with three variables a, b, c, and having 5 observations. Like the following:
a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
Now I want to generate a new variable called type, which is a possible combination of variable a, b and c. Specifically,
type=1 if a=b=c=0
type=2 if a=c=0 & b=1
type=3 if a=b=0 & c=1
type=4 if a=0 & b=c=1
type=5 if a=1 & b=c=0
type=6 if a=b=1 & c=0
type=7 if a=c=1 & b=0
type=8 if a=b=c=1
The new dataset I want to get is:
a b c type
1 0 1 7
1 1 1 8
0 1 0 2
0 1 1 4
1 0 0 5
Are there any general ways to realize this in Stata? It's better if this can also be extended when type is large, say 100 types. Thx a lot.

If the specific values of type don't matter, egen's group function works.
E.g.:
clear
input a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
0 1 0
1 1 1
end
sort a b c // not necessary, but clearer presentation
egen type = group(a b c)
li
with the result
+------------------+
| a b c type |
|------------------|
1. | 0 1 0 1 |
2. | 0 1 0 1 |
3. | 0 1 1 2 |
4. | 1 0 0 3 |
5. | 1 0 1 4 |
|------------------|
6. | 1 1 1 5 |
7. | 1 1 1 5 |
+------------------+

printing the number of times a element occurs in a file using regex

I have a long data similar to below
16:24:59 0 0 0
16:24:59 0 1 0
16:25:00 0 1 0
16:25:00 0 1 0
16:25:00 0 2 0
16:25:00 0 2 0
16:25:00 1 0 1
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 0 0 0
16:25:01 4 9 4
16:25:02 0 0 0
16:25:02 0 0 0
16:25:02 0 0 0
16:25:02 0 1 0
16:25:02 1 9 1
16:25:02 2 0 2
I wish to have a output where it prints the element in column 1, and the number of times it occurs. Below is what I expect. How can I do this?
16:24:59 2
16:25:00 5
16:25:01 5
16:25:02 6
How can I replace the above to
t1 2
t2 5
t3 5
t4 6
.
.
tn 9

It's pretty straight forward using awk
awk '{count[$1]++} END{ for ( i in count) print i, count[i]}'
Test
$ awk '{count[$1]++} END{ for ( i in count) print i, count[i]}' input
16:24:59 2
16:25:00 5
16:25:01 5
16:25:02 6
What it does?
count[$1]++ creates an associative array indexed by the first field.
END Action performed at the end of input file.
for ( i in count) print i, count[i] Iterate through the array count and print the values

Just in case you want a grep and uniq solution:
$ grep -Eo '^\s*\d\d:\d\d:\d\d' /tmp/lines.txt | uniq -c
2 16:24:59
5 16:25:00
5 16:25:01
6 16:25:02
Or, if tab delimited, use cut:
$ cut -f 2 /tmp/lines.txt | uniq -c
2 16:24:59
5 16:25:00
5 16:25:01
6 16:25:02

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using awk in if statements - if-statement

Related

Aggregate dummy variables to multiple categorical variables

Convert word Python Pandas Data Frame into Zero One Data Frame

Stata inverse matrix function

Use Stata to generate new variable, based on combination types of variables

printing the number of times a element occurs in a file using regex

Categories

Resources