Joining unique variable entries

Joining unique variable entries - stata

I am trying to join several string variables (c1, c2 etc.):
AKJ OFE ETH AKJ AKJ
345 952 319 123 345
I can join them with the following command:
generate c = c1 + c2 + c3 + c4 + c5
How can I join only their unique entries?
AKJ OFE ETH
345 952 319 123

An alternative solution is the following:
clear
input str3(c1 c2 c3 c4 c5)
AKJ OFE ETH AKJ AKJ
345 952 319 123 345
end
local vars c2 c3 c4 c5
local dvars c1
generate tempc1 = c1
foreach var of local vars {
generate temp`var' = `var'
foreach dvar of local dvars {
replace temp`var' = "" if `var' == `dvar'
}
local dvars `dvars' `var'
}
egen c = concat(temp*), punct(" ")
drop temp*
list
+-----------------------------------------------+
| c1 c2 c3 c4 c5 c |
|-----------------------------------------------|
1. | AKJ OFE ETH AKJ AKJ AKJ OFE ETH |
2. | 345 952 319 123 345 345 952 319 123 |
+-----------------------------------------------+

Related

Select lowest value per group

This question is related to Stata: select the minimum of each observation.
I have data as follows:
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
Some people have multiple readings on one day, eg see Sue on 31st March 1999. I want to select the lowest reading per day.
Here is my code which gets me some of the way. It is clunky and clumsy and I am looking for help to do what I want to do in a more straightforward way.
*make flag for repeat observations on same day
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
drop flag flag2
* group repeat observations together
egen group = group(id flag3 eventdate)
* find lowest `sys_bp_copy` value per group
bys group (eventdate flag3): egen low_sys=min(sys_bp_copy)
*remove the observations where the lowest value of `sys_bp`_copy doesn't exist
bys group: gen remove =1 if low_sys!=sys_bp_copy
drop if remove==1 & group !=.
****Problems with this and where I'd like help** **
The problem with the above approach is that for Sue, two of her repeat readings have the same val of sys_bp_copy. So my approach above leaves me with multiple readings for her.
In this instance I would like to refer to the dia_sys_copy and select the lowest value there to help me pick out one row per person when multiple readings are in place. Code for this is below - but there must be a simpler way to do this?
drop flag3 remove group
sort id eventdate
by id: gen flag =1 if eventdate==eventdate[_n-1]
by id: gen flag2=1 if eventdate==eventdate[_n+1]
by id: gen flag3 =1 if flag==1 | flag2==1
egen group = group(id flag3 eventdate)
bys group (eventdate flag3): egen low_dia=min(dia_bp_copy)
bys group: gen remove =1 if low_dia!=dia_bp_copy
drop if remove==1 & group !=.

The lowest systolic pressure for a patient on a particular day is easy to define: you just sort and look for the lowest value in each block of observations.
We can refine the definition by breaking ties on systolic by values of diastolic. That's another sort. In this example, that makes no difference.
clear
input str4 id int eventdate byte dia_bp_copy int sys_bp_copy
"pat" 15698 100 140
"pat" 16183 80 120
"pat" 19226 98 155
"pat" 19375 80 130
"sue" 14296 80 120
"sue" 14334 88 127
"sue" 14334 96 158
"sue" 14334 84 136
"sue" 14403 86 124
"sue" 14403 88 134
"sue" 14403 90 156
"sue" 14403 86 134
"sue" 14403 90 124
"sue" 14431 80 120
"sue" 14431 80 140
"sue" 14431 80 130
"sue" 15456 80 130
"sue" 15501 80 120
"sue" 15596 80 120
"mary" 14998 90 154
"mary" 15165 91 179
"mary" 15280 91 156
"mary" 15386 81 154
"mary" 15952 77 133
"mary" 15952 80 144
"mary" 16390 91 159
end
bysort id eventdate (sys) : gen lowest = sys[1]
bysort id eventdate (sys dia) : gen lowest_2 = sys[1]
egen tag = tag(id eventdate)
count if lowest != lowest_2
list id event dia sys lowest* if tag, sepby(id)
+-----------------------------------------------------------+
| id eventd~e dia_bp~y sys_bp~y lowest lowest_2 |
|-----------------------------------------------------------|
1. | mary 14998 90 154 154 154 |
2. | mary 15165 91 179 179 179 |
3. | mary 15280 91 156 156 156 |
4. | mary 15386 81 154 154 154 |
5. | mary 15952 77 133 133 133 |
7. | mary 16390 91 159 159 159 |
|-----------------------------------------------------------|
8. | pat 15698 100 140 140 140 |
9. | pat 16183 80 120 120 120 |
10. | pat 19226 98 155 155 155 |
11. | pat 19375 80 130 130 130 |
|-----------------------------------------------------------|
12. | sue 14296 80 120 120 120 |
13. | sue 14334 88 127 127 127 |
16. | sue 14403 86 124 124 124 |
21. | sue 14431 80 120 120 120 |
24. | sue 15456 80 130 130 130 |
25. | sue 15501 80 120 120 120 |
26. | sue 15596 80 120 120 120 |
+-----------------------------------------------------------+
egen is very useful (disclosure of various interests there), but the main idea here is just that by: defines groups of observations and you can do that for two or more variables, and not just one -- and control the sort order too. As it were, about half of egen is built on such ideas, but it can be easiest and best to use them directly.

If I understand:
Create an identifier for same id and same date
egen temp_group = group(id eventdate)
Find the first occurrence based on lowest sys_bp_copy and then lowest dia_bp_copy
bys temp_group (sys_bp_copy dia_bp_copy): gen temp_first = _n
keep if temp_first == 1
drop temp*
or in 1 line as suggest in comment:
bys id eventdate (sys_bp_copy dia_bp_copy): keep if _n==1

Large defrecord causes "Method code too large"

Is there a way to build a defrecord with lots of fields? It appears there is a limit of around 122 fields, as this gives a "Method code too large!" error:
(defrecord WideCsvFile
[a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19
a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39
a40 a41 a42 a43 a44 a45 a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59
a60 a61 a62 a63 a64 a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
a80 a81 a82 a83 a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99
a100 a101 a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115 a116 a117 a118 a119
a120 a121 a122])
while removing any of the fields allows record creation.

Java has a maximum size for its methods (see the answers to this question for specifics). defrecord creates methods whose size depends on the number of values the record will contain.
To deal with this issue, I see two options:
macroexpand-1 your call to defrecord, copy the results, and find a way to re-write the generated methods to be smaller.
Take a different approach to storing your data, such as using Clojure's vector class.
EDIT:
Now that I know what you want to do, I am more convinced that you should use vectors. Since you want to use indexes like a101, I've written you a macro to generate them:
(defmacro auto-index-vector [v prefix]
(let [indices (range (count (eval v)))
definitions (map (fn [ind]
`(def ~(symbol (str prefix ind)) ~ind)) indices)]
`(do ~#definitions)))
Let's try it out!
stack-prj.bigrecord> (def v1 (into [] (range 122)))
#'stack-prj.bigrecord/v1
stack-prj.bigrecord> (auto-index-vector v1 "a")
#'stack-prj.bigrecord/a121
stack-prj.bigrecord> (v1 a101)
101
stack-prj.bigrecord> (assoc v1 a101 "hi!")
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99 100 "hi!" 102 103 104 105 106 107 108 109 110 111 112
113 114 115 116 117 118 119 120 121]
To use this: you'll read your CSV data into a vector, call auto-index-vector on it with the prefix of your choosing, and use the resulting indices to perform vector operations on your data.

Grep: find lines only matching unknown character once

I have a list with hexadecimal lines. For example:
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
...
I'm trying to find all the lines with grep, where there is a character that occurs only once in the line.
For example: there is only one time a 'd' in the third line.
I tried this, but it's not working:
egrep '^.*([a-f0-9])[^\1]*$'

This can be done with a regex, but it has to be verbose.
It kind of can't be generalized.
# ^(?:[^a]*a[^a]*|[^b]*b[^b]*|[^c]*c[^c]*|[^d]*d[^d]*|[^e]*e[^e]*|[^f]*f[^f]*|[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*)$
^
(?:
[^a]* a [^a]*
| [^b]* b [^b]*
| [^c]* c [^c]*
| [^d]* d [^d]*
| [^e]* e [^e]*
| [^f]* f [^f]*
| [^0]* 0 [^0]*
| [^1]* 1 [^1]*
| [^2]* 2 [^2]*
| [^3]* 3 [^3]*
| [^4]* 4 [^4]*
| [^5]* 5 [^5]*
| [^6]* 6 [^6]*
| [^7]* 7 [^7]*
| [^8]* 8 [^8]*
| [^9]* 9 [^9]*
)
$
For discovery, if you put capture groups around the letters and numbers,
and use a brach reset:
^
(?|
[^a]* (a) [^a]*
| [^b]* (b) [^b]*
| [^c]* (c) [^c]*
| [^d]* (d) [^d]*
| [^e]* (e) [^e]*
| [^f]* (f) [^f]*
| [^0]* (0) [^0]*
| [^1]* (1) [^1]*
| [^2]* (2) [^2]*
| [^3]* (3) [^3]*
| [^4]* (4) [^4]*
| [^5]* (5) [^5]*
| [^6]* (6) [^6]*
| [^7]* (7) [^7]*
| [^8]* (8) [^8]*
| [^9]* (9) [^9]*
)
$
This is the output:
** Grp 0 - ( pos 0 , len 50 )
0b 5a 3f 5a 7d d0 5d e6 2b c4 7e 7d c2 c0 e6 9a
** Grp 1 - ( pos 7 , len 1 )
f
-----------------------
** Grp 0 - ( pos 50 , len 51 )
84 bd aa 74 f3 85 da 9d ac b6 e0 b6 62 0f b5 d5
** Grp 1 - ( pos 77 , len 1 )
c
-----------------------
** Grp 0 - ( pos 101 , len 51 )
c0 b0 f5 60 02 8b 1c a4 41 7c 53 f2 85 20 a0 d1
** Grp 1 - ( pos 148 , len 1 )
d

I don't know a way to do it with a regex. However you can use this stupid awk script:
awk -F '' '{for(i=1;i<=NF;i++){a[$i]++};for(i in a){if(a[i]==1){print;next}}}' input
The scripts counts the number of occurrences of every character in the line. At the end of the line it checks all totals and prints the line if at least one of those totals equals 1.

Here is a piece of code that uses a number of shell tools beyond grep.
It reads the input line by line. Generates a frequency table. Upon finding an element with frequency 1 it outputs the unique character and the entire line.
cat input | while read line ; do
export line ;
echo $line | grep -o . | sort | uniq -c | \
awk '/[ ]+1[ ]/ {print $2 ":" ENVIRON["line"] ; exit }' ;
done
Note that if you are interested in digits only you could replace grep -o . with grep -o "[a-f]"

Expect Script - detecting two unique instances of a pattern in one returned buffer

I'm trying to do two matches on one block returned data inside an expect script. This is the returned data from a command shows what this system is connected to(I changed the descriptions to protect sensitive information). I thought I could use expect_out(buffer), but I can't figure out how to parse the returned data to detect two unique instances of the patterns. I can re-run the command if I detect one instance a pattern, but that won't allow me to detect the case where I have two unique instances of a pattern in the returned data as expect{} would re-find the first pattern. For example 'abcd' and 'abcd'.
Case one: I will have zero instances of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the returned block - in that case nothing will be written to a file and that's fine.
Case two: I will have only once instance of 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs' in the file, the current code detects that case and then writes the existence of one pattern to a file for later processing.
Case three: I have two instances of the patterns 'abcd', 'efgh', 'ijkl', 'mnop', or 'qurs', in any combination of the pairs. I could have 'abcd', 'abcd'; 'abcd', 'efgh'; or 'ijkl', 'mnop'. If case 3 happens I need to write a different message to the file.
Can anyone help?
My data:
A4 | 48 48 changedToProtectPrivacy
A15 | 48 48 changedToProtectPrivacy
A16 | 48 48 changedToProtectPrivacy
A17 | 48 48 changedToProtectPrivacy
A18 | 48 48 changedToProtectPrivacy
A19 | 48 48 changedToProtectPrivacy
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B2 | 48 48 changedToProtectPrivacy
B3 | 48 48 changedToProtectPrivacy
B4 | 48 48 changedToProtectPrivacy
B5 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B21 | 48 48 changedToProtectPrivacy
B24 | abcd
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D10 | 00 ... 1 changedToProtectPrivacy
E6 | 00 ... 1 changedToProtectPrivacy
-=- Current code snippit -=-
expect { "prompt" } send { "superSecretCommand" ; sleep 2 }
expect {
"abcd" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
"efgh" { set infofile "info.$server" ;
set ::infofile [open $infofile a] ;
puts $::infofile "Connection detected" ;
close $::infofile ;
}
}

I guess what you need is like this:
[STEP 101] $ cat infile
A20 | 48 48 changedToProtectPrivacy
A21 | 48 48 changedToProtectPrivacy
A24 | abcd
A24 | abcd
B1 | 48 48 changedToProtectPrivacy
B6 | 48 48 changedToProtectPrivacy
B7 | ijkl
B21 | 48 48 changedToProtectPrivacy
B24 | efgh
B24 | abcd
D2 | 00 ... 1 changedToProtectPrivacy
D3 | efgh
D3 | abcd
D10 | 00 ... 1 changedToProtectPrivacy
D11 | ijkl
E6 | 00 ... 1 changedToProtectPrivacy
E7 | ijkl
[STEP 102] $ cat foo.exp
#!/usr/bin/expect
log_user 0
spawn -noecho cat infile
set pat1 {[\r\n]+[[:blank:]]*[A-Z][0-9]+[[:blank:]]*\|[[:blank:]]*}
set pat2 {[a-z]{4,4}}
expect {
-re "${pat1}($pat2)${pat1}($pat2)|${pat1}($pat2)" {
if {[info exists expect_out(3,string)]} {
send_user ">>> $expect_out(3,string)\n"
} else {
send_user ">>> $expect_out(1,string) $expect_out(2,string)\n"
}
array unset expect_out
exp_continue
}
}
[STEP 103] $ expect foo.exp
>>> abcd abcd
>>> ijkl
>>> efgh abcd
>>> efgh abcd
>>> ijkl
>>> ijkl
[STEP 104] $

Grep a whole line based on the first time a term is found

I would like to search through a rather large, sorted file (by the 4th, then the 3rd column), find the first time a new word is found in the 4th column and print out the whole line to a new file.
For example, my file looks like this:
c1 23 1912 PE_1.7
c1 25 2334 PE_1.7
c1 59 2340 PE_1.7
c1 28 2342 PE_1.7
c1 30 2345 PE_1.7
c1 45 2346 PE_1.7
c1 23 2348 PA_11.4
c1 24 2352 PA_11.4
c1 57 2362 PA_123.2
c1 26 2372 DA_1.5
And I would hope the new file would look like this:
c1 23 1912 PE_1.7
c1 23 2348 PA_11.4
c1 57 2362 PA_123.2
c1 26 2372 DA_1.5
I am rotten with regex but I was thinking something along these lines:
grep \t.[_].[\.]$
Is there a good way to do this type of grep, or am I barking up the wrong tree, so to speak?

This
uniq --skip-fields=3 input.txt
Yields:
c1 23 1912 PE_1.7
c1 23 2348 PA_11.4
c1 57 2362 PA_123.2
c1 26 2372 DA_1.5

try this awk one-liner:
awk 'p!=$4{print;p=$4}' file > newFile

Try this:
$ awk '!x[$4]++' file
c1 23 1912 PE_1.7
c1 23 2348 PA_11.4
c1 57 2362 PA_123.2
c1 26 2372 DA_1.5

It is simpler to use awk:
awk '!($4 in a) {a[$4]; print}' file
c1 23 1912 PE_1.7
c1 23 2348 PA_11.4
c1 57 2362 PA_123.2
c1 26 2372 DA_1.5

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Joining unique variable entries - stata

I am trying to join several string variables (c1, c2 etc.): AKJ OFE ETH AKJ AKJ 345 952 319 123 345 I can join them with the following command: generate c = c1 + c2 + c3 + c4 + c5 How can I join only their unique entries? AKJ OFE ETH 345 952 319 123

Related

Select lowest value per group

Large defrecord causes "Method code too large"

Grep: find lines only matching unknown character once

Expect Script - detecting two unique instances of a pattern in one returned buffer

Grep a whole line based on the first time a term is found

Categories

Resources