I have large data and I want to extract two types of data based on two conditions. I wrote a tcl script to extract the data by using regex (newbie to regex).
I have used the following condition which works fine and produces part of the desired output:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time ] {
I'm using the variable time somewhere in the script. The above condition produces the following o/p(this is just a sample as the file is large):
+ 30.808352 1 2 tcp 40 ------- 30 6.7 2.30 81 2073
+ 30.808416 1 2 tcp 40 ------- 128 8.16 2.159 81 2069
+ 30.809513 1 2 tcp 40 ------- 156 12.19 2.187 1 2077
+ 30.809641 1 2 tcp 80 ------- 156 12.19 2.187 1 2078
+ 30.809878 1 2 tcp 40 ------- 151 7.18 2.182 41 2079
+ 30.813096 1 2 tcp 40 ------- 161 9.20 2.192 0 2083
+ 30.813352 1 2 tcp 40 ------- 157 13.19 2.188 1 2085
+ 30.81348 1 2 tcp 80 ------- 157 13.19 2.188 1 2086
+ 30.815362 1 2 tcp 40 ------- 148 12.18 2.179 41 2088
+ 30.815426 1 2 tcp 40 ------- 148 5.9 2.179 41 2089
+ 30.818096 1 2 tcp 40 ------- 162 10.20 2.193 0 2091
+ 30.818544 1 2 tcp 40 ------- 158 3.78 2.189 1 2093
+ 30.818672 1 2 tcp 80 ------- 158 14.19 2.189 1 2094
+ 30.820657 1 2 tcp 40 ------- 153 9.19 2.184 41 2096
+ 30.821579 1 2 tcp 40 ------- 154 10.19 2.185 41 2097
Then, inside the above if condition, I want check the 9th column :
//condition 1
if (9th between [3-6].*) ( such as 3.78,6.7, 5.9)
The second condition is :
//condition 2
if (9th between [7-14].*) ( such as 14.19,12.18,10.19, 9.19,.....)
I'm struggling with two conditions above. I tried the following, I didn't get an error, however, no matching occurred !!
condition 1:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([3-9])\..*/ } $line ] {
I know I'm repeating part of the main if condition, becuase I don't know how to skip the columns !!!
condition 2:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([7-9]|1[0-4])\..*/} $line ] {
Any suggestions !!!
Why don't you split on space? You can achieve pretty much the same outcome using a few more lines. It will be readable and can people will understand the code better:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time] {
set elements [split $line " "] ;# You can actually omit the " " in this case
set 9th [lindex $elements 8]
# Condition 1
if {$9th >= 3 && $9th < 7} { do something }
# Condition 2
if {$9th >= 7 && $9th < 15} { do something }
}
match 7-14 \+ ([0-9.]+) 1 2.*- \d+\s(?:[7-9]|1[0-4]) Demo
match 3-6 \+ ([0-9.]+) 1 2.*- \d+\s[3-6] Demo
Related
I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you
You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
I have this data:
id test test_date value
1 A 02/06/2014 12:26 11
1 B 02/06/2014 12:26 23
1 C 02/06/2014 13:17 43
1 D 02/06/2014 13:17 65
1 E 02/06/2014 13:17 34
1 F 02/06/2014 13:17 64
1 A 05/06/2014 15:14 234
1 B 05/06/2014 15:14 646
1 C 05/06/2014 16:50 44
1 E 05/06/2014 16:50 55
2 E 05/06/2014 16:50 443
2 F 05/06/2014 16:50 22
2 G 05/06/2014 16:59 445
2 B 05/06/2014 20:03 66
2 C 05/06/2014 20:03 77
2 D 05/06/2014 20:03 88
2 E 05/06/2014 20:03 44
2 F 05/06/2014 20:19 33
2 G 05/06/2014 20:19 22
I would like to transform this data into wide format like this:
id date A B C D E F G
1 02/06/2014 12:26 11 23 43 65 34 64 .
1 05/06/2014 15:14 234 646 44 . 55 . .
2 05/06/2014 16:50 . . . . 443 22 445
2 05/06/2014 20:03 . 66 77 88 44 33 22
I am using reshape command in Stata, but it is not producing required results:
reshape wide test_date value, i(id) j(test) string
Any idea how to do this?
UPDATE:
You're right that we need this missvar. I try to create this by programming, but failed. Let say with-in 2 hours of test date the batch will consider same. We have only 7 tests (A,B,C,D,E,F,G). First I try to find the time difference;
bysort id: gen diff_bd = (test_date[_n] - test_date[_n-1])/(1000*60*60)
bysort id: generate missvar = _n if diff_bd <= 2
#jfeigenbaum has given part of the answer.
The problem I see is that you are missing a variable that identifies relevant sub-groups. These sub-groups seem to be bounded by test taking values A - G. But I may be wrong.
I've included this variable in the example data set, and named it missvar. I forced this variable into the data set believing it identifies groups that, although implicit in your original post, are important for your analysis.
clear
set more off
*----- example data -----
input ///
id str1 test str30 test_date value missvar
1 A "02/06/2014 12:26" 11 1
1 B "02/06/2014 12:26" 23 1
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
2 E "05/06/2014 16:50" 443 1
2 F "05/06/2014 16:50" 22 1
2 G "05/06/2014 16:59" 445 1
2 B "05/06/2014 20:03" 66 2
2 C "05/06/2014 20:03" 77 2
2 D "05/06/2014 20:03" 88 2
2 E "05/06/2014 20:03" 44 2
2 F "05/06/2014 20:19" 33 2
2 G "05/06/2014 20:19" 22 2
end
gen double tdate = clock( test_date, "DM20Yhm")
format %tc tdate
drop test_date
list, sepby(id)
*----- what you want ? -----
reshape wide value, i(id missvar tdate) j(test) string
collapse (min) tdate value?, by(id missvar)
rename value* *
list
There should be some way of identifying the groups programmatically. Relying on the original sort order of the data is one way, but it may not be the safest. It may be the only way, but only you know that.
Edit
Regarding your comment and the "missing" variable, one way to create it is:
// one hour is 3600000 milliseconds
bysort id (tdate): gen batch = sum(tdate - tdate[_n-1] > 7200000)
For your example data, this creates a batch variable identical to my missvar. You can also use time-series operators.
Let me emphasize the need for you to be carefull when presenting your example data. It must be representative of the real one or you might get code that doesn't suit it; that includes the possibility that you don't notice it because Stata gives no error.
For example, if you have the same test, applied to the same id within the two-hour limit, then you'll lose information with this code (in the collapse). (This is not a problem in your example data.)
Edit 2
In response to another question found in the comments:
Suppose a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at a different time :
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:35" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Test A is applied at 12:26 and at 12:35. Reshape will have no problem with this, but the collapse will discard information because it is taking the minimum values amongst the id missvar groups; notice that for the variable valueA, new information (the 99) will be lost (so too happens with all other variables, but you are explicit about wanting to discard that). After the reshape but before the collapse you get:
. list, sepby(id)
+--------------------------------------------------------------------------------------------------+
| id missvar tdate valueA valueB valueC valueD valueE valueF valueG |
|--------------------------------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 . . . . . |
2. | 1 1 02jun2014 12:35:00 99 . . . . . . |
3. | 1 1 02jun2014 13:17:00 . . 43 65 34 64 . |
4. | 1 2 05jun2014 15:14:00 234 646 . . . . . |
5. | 1 2 05jun2014 16:50:00 . . 44 . 55 . . |
|--------------------------------------------------------------------------------------------------|
6. | 2 1 05jun2014 16:50:00 . . . . 443 22 . |
7. | 2 1 05jun2014 16:59:00 . . . . . . 445 |
8. | 2 2 05jun2014 20:03:00 . 66 77 88 44 . . |
9. | 2 2 05jun2014 20:19:00 . . . . . 33 22 |
+--------------------------------------------------------------------------------------------------+
Running the complete code confirms what we just said:
. list, sepby(id)
+--------------------------------------------------------------------------+
| id missvar tdate A B C D E F G |
|--------------------------------------------------------------------------|
1. | 1 1 02jun2014 12:26:00 11 23 43 65 34 64 . |
2. | 1 2 05jun2014 15:14:00 234 646 44 . 55 . . |
|--------------------------------------------------------------------------|
3. | 2 1 05jun2014 16:50:00 . . . . 443 22 445 |
4. | 2 2 05jun2014 20:03:00 . 66 77 88 44 33 22 |
+--------------------------------------------------------------------------+
Suppose now a new observation for person 1, such that he receives a repeated test within the two-hour limit, but at the same time:
1 A "02/06/2014 12:26" 11 1 // old observation
1 B "02/06/2014 12:26" 23 1
1 A "02/06/2014 12:26" 99 1 // new observation
1 C "02/06/2014 13:17" 43 1
1 D "02/06/2014 13:17" 65 1
1 E "02/06/2014 13:17" 34 1
1 F "02/06/2014 13:17" 64 1
1 A "05/06/2014 15:14" 234 2
1 B "05/06/2014 15:14" 646 2
1 C "05/06/2014 16:50" 44 2
1 E "05/06/2014 16:50" 55 2
Then the reshape won't work. Stata complains:
values of variable test not unique within id missvar tdate
and with reason. The error is clear in signalling the problem. (If not clear, go back to help reshape and work out some exercises.) The request makes no sense given the functioning of the command.
Finally, note it's relatively easy to check if something will work or not: just try it! All that was necessary in this case was to modify a bit the example data. Go back to help files and manuals, if necessary.
The command is slightly misspecified. You want to reshape value. Look at the output you want and notice the observations are uniquely identified by id and test_date. Therefore, they should be in the i option.
reshape wide value, i(id test_date) j(test) string
This yields something close you what you want, you just need to rename a few variables to get exactly the output. Specifically:
rename test_date date
renpfix value
I have a something like :
test[1]
"[0 30.5 4.5 10.5 2 35 22.999999999999996 29 5.500000000000001 23.5 18 23.5 44.5 3 44.5 44.00000000000001 43 27 42 35.5 19.5 44.00000000000001 1 0 31 34 18 1.5 26 6 45.99999999999999 10.5 9.5 24 20 42.5 14.5 45.5 20.499999999999996 150 45.5 0 4.5 22.5 4 9 8 0 0 15.5 30.5 7 5.500000000000001 12.5 33.5 15 500 22.5 18 43 4.5 26 23.5 16 4.5 7.5 32 0 0 18.5 33 31 14.5 21.5 0 40 0 0 43.49999999999999 22.999999999999996]"
And I would like to remove [ and ] (first and last characters) of each line (test[1] test[2] ...) but keep points (22.9999).
I have tried some stringr functions, but I'm not so go with regex ...
Can you help me?
E
There's no need for packages for this. Just use something like the following:
gsub("\\[|\\]", "", test)
This basically says: "Look in test for "[" or (|) "]", and if you find it, replace it with nothing ("")."
Since [ and ] are special characters in regular expressions, they would need to be escaped.
If you're just removing the first and last character, you can also probably do something like:
substring(test, 2, nchar(test)-1)
This basically says, "Extract the part of the string starting from the second position and ending in the second-to-last position."
One easy way to remove [ and ] from a string is
x <- "[12345]"
gsub("[][]", "", x)
# [1] "12345"
Here, the outer [] means one of the characters in the brackets. The inner ][ represent the to-be-replaced characters.
I have a sequence of number from 1 000 000 to 9 999 999 (Total: 9,000,000). I've generated them in the excel and I would like to match them in following formats
Last 6 digits in:
1. XXX XXX (For example, 000 000 or 111 111 or 222 222)
2. X00 000 (For example, 100 000 or 200 000 or 300 000)
3. XYY YYY (For example, 122 222 or 233 333 or 411 111)
4. XY0 000 (For example, 230 000 or 750 000 or 120 000)
5. XYZ ZZZ (For example, 231 111 or 232 222 or 233 333)
6. X00 Y00 (For example, 200 300 or 100 400 or 500 600)
7. XXX Y00 (For example, 333 300 or 666 600 or 777 700)
8. XXX YYY (For example, 111 333 or 222 555 or 555 666)
9. XX YY ZZ (For example, 11 22 33 or 22 33 44 or 44 55 66)
10. X0 Y0 Z0 (For example, 10 20 30 or 30 40 50 or 60 70 80)
Would it be possible to do with regex or vba in excel 2013?
Since I don't have knowledge in Excel, should I seek someone's help for a simple program for such matching?
You can use VBA, but I believe you will need to set up each classification separately, and also ensure that they are in an order so as to not overlap.
Here is a partial example, showing a few VBA techniques, which you should be able to extend. I only dealt with the rightmost 6 digits and initially constructed a string; and also put each digit into an array element to make the testing formulas simpler.
Option Explicit
Function Classify(N As Long) As String
Dim I As Long
Dim S(1 To 6) As String
Dim sN As String
sN = Format(Right(N, 6), "000000")
For I = 1 To 6
S(I) = Mid(sN, I, 1)
Next I
If Left(sN, 3) = Right(sN, 3) Then
Classify = "XXX XXX"
ElseIf S(1) <> 0 And Mid(sN, 2) = 0 Then
Classify = "X00 000"
ElseIf S(1) <> 0 And Mid(sN, 2) Like WorksheetFunction.Rept(S(2), 5) Then
Classify = "XYY YYY"
ElseIf S(1) <> 0 And S(2) <> 0 And S(1) <> S(2) And Mid(sN, 3) = 0 Then
Classify = "XY0 000"
elseif ...
End If
End Function