Awk: replace column in one file with one column from another file - replace

I have two files File 1:
AAAAA01 T1 0 0 0 0 C C G G G
AAAAA02 T1 0 0 0 0 C C G G G
AAAAA03 T1 0 0 0 0 C C G G G
AAAAA04 T1 0 0 0 0 C C G G G
AAAAA05 T1 0 0 0 0 C C G G G
AAAAA06 T1 0 0 0 0 C C G G G
AAAAA07 T1 0 0 0 0 C C G G G
AAAAA08 T1 0 0 0 0 C C G G G
AAAAA09 T1 0 0 0 0 C C G G G
AAAAA10 T2 0 0 0 0 C C G G G
AAAAA11 T2 0 0 0 0 C C G G G
File 2:
2 0
2 0
3 0
2 0
2 0
3 0
2 0
2 0
3 0
3 0
3 0
I have tried the following awk options.. but I only got the first row in column 6 replaced.
awk 'BEGIN { OFS = FS } FNR==NR{a[NR]=$1;next}{$6=a[FNR]}1' File2.txt File1.txt > out1.txt
awk 'BEGIN {OFS = FS} NR == FNR {a[FNR] = $B; next} $A = a[FNR]' B=1 A=6 File2.txt File1.txt > out1.txt
How can I replace column 6 in File1 with column 1 in File2?

You can use below piece of code -
awk 'FNR==NR{a[NR]=$1;next}{$6=a[FNR]}1' File2.txt File1.txt > output.txt
this is giving me perfect output like below
AAAAA01 T1 0 0 0 2 C C G G G<br/>
AAAAA02 T1 0 0 0 2 C C G G G<br/>
AAAAA03 T1 0 0 0 3 C C G G G<br/>
AAAAA04 T1 0 0 0 2 C C G G G<br/>
AAAAA05 T1 0 0 0 2 C C G G G<br/>
AAAAA06 T1 0 0 0 3 C C G G G<br/>
AAAAA07 T1 0 0 0 2 C C G G G<br/>
AAAAA08 T1 0 0 0 2 C C G G G<br/>
AAAAA09 T1 0 0 0 3 C C G G G<br/>
AAAAA10 T2 0 0 0 3 C C G G G<br/>
AAAAA11 T2 0 0 0 3 C C G G G<br/>
when i tried on online emulator

alternative solution
awk '{$6=$(NF-1); $(NF-1)=$NF=""}1' <(paste file1 file2)

Related

How to grep this line "12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0"

I've a file called test.txt
12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0
12/15-12:24:51 <1692> END SESSION SUMMARY
12/15-12:24:55 <1692> INFO: SESSION SUMMARY
12/15-12:24:55 <1692> + - ch G B C L S T X Y -
12/15-12:24:55 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:55 <1692> END SESSION SUMMARY
12/15-12:24:59 <1692> INFO: SESSION SUMMARY
12/15-12:24:59 <1692> + - ch G B C L S T X Y -
12/15-12:24:59 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:59 <1692> END SESSION SUMMARY
12/15-12:25:03 <1692> INFO: SESSION SUMMARY
12/15-12:25:03 <1692> + - ch G B C L S T X Y -
12/15-12:25:03 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:03 <1692> END SESSION SUMMARY
12/15-12:25:07 <1692> INFO: SESSION SUMMARY
12/15-12:25:07 <1692> + - ch G B C L S T X Y -
12/15-12:25:07 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:07 <1692> END SESSION SUMMARY
and need output as
12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0
12/15-12:24:55 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:59 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:03 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:07 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
Tried following way but couldn't get
cat test.txt | perl -e '$str = do { local $/; <> }; while ($str =~ /(\d\d):(\d\d):(\d\d)?\s.*/) { print "$1:$2:$3:$4\n"}'
Your one-liner has some mistakes. I will go through them, then show you a solution.
cat test.txt |
You don't need to cat into a pipe, just use the file name as argument when using diamond operator <>.
perl -e '$str = do { local $/; <> };
This slurps the entire file into a single string. This is not useful in your case. This is only useful if you are expecting matches that include newlines.
while ($str =~ /(\d\d):(\d\d):(\d\d)?\s.*/) {
This part will only run once, because you did not use the /g modifier. This is especially bad since you are not running in line-by-line mode, because you slurped the file.
The regex will try to match one of the time stamps, I assume, e.g. 12:25:07. Why you would want to do that is beyond me, since each line in your input has such a time stamp, rendering the whole operation useless. You want to try to match something that is unique for the lines you do want.
print "$1:$2:$3:$4\n"}'
This part prints 4 capture groups, and you only have 3 (2 fixed and 1 optional). It will not print the entire line.
What you want is something simplistic like this:
perl -ne'print if /\#\#/' test.txt
Which will go through the file line-by-line, check each line for ## and print the lines found.
Or if you are using *nix, just grep '##' test.txt

Aggregate dummy variables to multiple categorical variables

I have 8 dummy variables (0/1). Those 8 variables have to be aggregated to one categorical variable with 8 items (categories). Normally, people should have just marked one out of the 8 dummy variables, but some marked multiple ones.
When a Person has marked two items, the first value should go into the first categorical variable, whereas the second value should go to the second categorical variable. When there are 3 items marked, the third values should go into a third categorical variable and so on (up to 3).
I know how to aggregate the dummies to a categorical variable, but I do not know which approach there is to divide the values to different variables, based on the number of marked dummies.
If the problem is not clear, please tell me. It was difficult for me to describe it properly.
Edit:
My approach is the follwoing:
local MCM_zahl4 F0801 F0802 F0803 F0804 F0805 F0806 F0807 F0808
gen MCM_zaehl_4 = 0
foreach var of varlist `MCM_zahl4' {
replace MCM_zaehl_4 = MCM_zaehl_4 + 1 if `var' == 1
}
tab MCM_zaehl_4
/*
MCM_zaehl_4 | Freq. Percent Cum.
------------+-----------------------------------
0 | 31 4.74 4.74
1 | 598 91.44 96.18
2 | 22 3.36 99.54
3 | 3 0.46 100.00
------------+-----------------------------------
Total | 654 100.00
*/
gen bildu2 = -999999
gen bildu2_D = -999999
replace bildu2 = 1 if F0801 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 2 if F0802 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 3 if F0803 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 4 if F0804 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 5 if F0805 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 6 if F0806 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 7 if F0807 == 1 & MCM_zaehl_4 == 1
replace bildu2 = 8 if F0808 == 1 & MCM_zaehl_4 == 1
Then I split all cases MCM_zaehl_4 > 1 manually in three variables.
E. g. for two mcm:
replace bildu2 = 5 if ID == XXX
replace bildu2_D = 2 if ID == XXX
For that approach I'd need an auomation, because for more observations I won't be able to do it manually.
If I understood you correctly, you could try the following to aggregate your multiples dummy variables into multiple aggregate columns based on the number of answers that the person marked. It assumes the repeated answers are consecutive. I reduced your problem to 6 dummy (a1-a6) and people can answer up to 3 questions.
clear
input id a1 a2 a3 a4 a5 a6
1 1 0 0 0 0 0
2 1 1 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 0 0 0
5 0 1 0 0 0 0
6 1 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 1
end
egen n_asnwers = rowtotal(a*)
gen wanted_1 = .
gen wanted_2 = .
gen wanted_3 = .
local i = 1
foreach v of varlist a* {
replace wanted_1 = `v' if `v' == 1 & n_asnwers == 1
replace wanted_2 = `v' if `v' == 1 & n_asnwers == 2
replace wanted_3 = `v' if `v' == 1 & n_asnwers == 3
local ++i
}
list
/*
+------------------------------------------------------------------------------+
| id a1 a2 a3 a4 a5 a6 n_asnw~s wanted_1 wanted_2 wanted_3 |
|------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 . . |
2. | 2 1 1 0 0 0 0 2 . 1 . |
3. | 3 1 1 1 0 0 0 3 . . 1 |
4. | 4 1 1 1 0 0 0 3 . . 1 |
5. | 5 0 1 0 0 0 0 1 1 . . |
|------------------------------------------------------------------------------|
6. | 6 1 0 0 0 0 0 1 1 . . |
7. | 7 0 0 0 0 1 0 1 1 . . |
8. | 8 0 0 0 0 0 1 1 1 . . |
+------------------------------------------------------------------------------+
*/

How to drop observations after a certain condition has been met in Python using pandas dataframe with more than 2 columns

I am using a pandas dataframe and I want to delete observations with the same name after they met the condition (cond=1).
My dataset looks like:
person med cond
A a 0
A b 0
A a 1
A d 0
A e 0
B a 0
B c 1
C e 1
C f 0
D a 0
D f 0
I want to get this:
person med cond
A a 0
A b 0
A a 1
B a 0
B c 1
C e 1
D a 0
D f 0
I want the code to first check if the next person has the same name, then check if the condition is met (cond=1) and if so drop all the next lines with the same name.
Can someone help me with this?
You can groupby on the df and then reference the col of interest in the lambda and then call reset_index(drop=True) to remove the redundant index:
In [38]:
df.groupby('person').apply( lambda x: x.loc[:x['cond'].idxmax()] if len(x[x['cond']==0]) != len(x) else x).reset_index(drop=True)
Out[38]:
person med cond
0 A a 0
1 A b 0
2 A a 1
3 B a 0
4 B c 1
5 C e 1
6 D a 0
7 D f 0

Use Stata to generate new variable, based on combination types of variables

Say I have a dataset with three variables a, b, c, and having 5 observations. Like the following:
a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
Now I want to generate a new variable called type, which is a possible combination of variable a, b and c. Specifically,
type=1 if a=b=c=0
type=2 if a=c=0 & b=1
type=3 if a=b=0 & c=1
type=4 if a=0 & b=c=1
type=5 if a=1 & b=c=0
type=6 if a=b=1 & c=0
type=7 if a=c=1 & b=0
type=8 if a=b=c=1
The new dataset I want to get is:
a b c type
1 0 1 7
1 1 1 8
0 1 0 2
0 1 1 4
1 0 0 5
Are there any general ways to realize this in Stata? It's better if this can also be extended when type is large, say 100 types. Thx a lot.
If the specific values of type don't matter, egen's group function works.
E.g.:
clear
input a b c
1 0 1
1 1 1
0 1 0
0 1 1
1 0 0
0 1 0
1 1 1
end
sort a b c // not necessary, but clearer presentation
egen type = group(a b c)
li
with the result
+------------------+
| a b c type |
|------------------|
1. | 0 1 0 1 |
2. | 0 1 0 1 |
3. | 0 1 1 2 |
4. | 1 0 0 3 |
5. | 1 0 1 4 |
|------------------|
6. | 1 1 1 5 |
7. | 1 1 1 5 |
+------------------+

Using awk in if statements

I have a data file that looks like this:
1 . 0 10109 AA AA
1 . 0 10123 C CCCT
1 . 0 10133 A AAC
1 . 0 10134 A ACAAC
1 . 0 10140 A ACCCTAAC
1 . 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 . 0 10146 AC A
1 . 0 10147 G C
In the instance of "." in the second column, I would like to replace it with a merged output of columns 1 and 4, like this:
1 1:10109 0 10109 AA AA
1 1:10123 0 10123 C CCCT
1 1:10133 0 10133 A AAC
1 1:10134 0 10134 A ACAAC
1 1:10140 0 10140 A ACCCTAAC
1 1:10143 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 1:10146 0 10146 AC A
1 1:10147 0 10147 G C
I've been attempting to do this with an if/then statement... but I know I have the syntax wrong, I'm just not sure how wrong.
if [$2 -eq "." /data/pathtofile]
then
awk '{print $1 ":" $4}'
else
awk '{print $2}' >> "/data/cleanfile"
fi
What am I missing?
You could do this through awk itself.
awk -v FS="\t" -v OFS="\t" '$2=="."{$2=$1":"$4}{$1=$1}1' file
OR
$ awk '$2=="."{$2=$1":"$4}{$1=$1}1' file
1 1:10109 0 10109 AA AA
1 1:10123 0 10123 C CCCT
1 1:10133 0 10133 A AAC
1 1:10134 0 10134 A ACAAC
1 1:10140 0 10140 A ACCCTAAC
1 1:10143 0 10143 C CTACT
1 rs144773400 0 10144 T TA
1 1:10146 0 10146 AC A
1 1:10147 0 10147 G C