I have a something like :
test[1]
"[0 30.5 4.5 10.5 2 35 22.999999999999996 29 5.500000000000001 23.5 18 23.5 44.5 3 44.5 44.00000000000001 43 27 42 35.5 19.5 44.00000000000001 1 0 31 34 18 1.5 26 6 45.99999999999999 10.5 9.5 24 20 42.5 14.5 45.5 20.499999999999996 150 45.5 0 4.5 22.5 4 9 8 0 0 15.5 30.5 7 5.500000000000001 12.5 33.5 15 500 22.5 18 43 4.5 26 23.5 16 4.5 7.5 32 0 0 18.5 33 31 14.5 21.5 0 40 0 0 43.49999999999999 22.999999999999996]"
And I would like to remove [ and ] (first and last characters) of each line (test[1] test[2] ...) but keep points (22.9999).
I have tried some stringr functions, but I'm not so go with regex ...
Can you help me?
E
There's no need for packages for this. Just use something like the following:
gsub("\\[|\\]", "", test)
This basically says: "Look in test for "[" or (|) "]", and if you find it, replace it with nothing ("")."
Since [ and ] are special characters in regular expressions, they would need to be escaped.
If you're just removing the first and last character, you can also probably do something like:
substring(test, 2, nchar(test)-1)
This basically says, "Extract the part of the string starting from the second position and ending in the second-to-last position."
One easy way to remove [ and ] from a string is
x <- "[12345]"
gsub("[][]", "", x)
# [1] "12345"
Here, the outer [] means one of the characters in the brackets. The inner ][ represent the to-be-replaced characters.
Related
I wanted to see if this was doable in SAS. I have a dataset of the members of congress and want to split full name into first and last. However, occasionally they seem to list their middle initial or name. It is from a .txt file.
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
Good day,
SAS is a bit clunky when it comes to Strings. However it can be done. As other have mentioned, it's the logic defined, which is the really hard part.
Begin with some raw data...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
first I select the leading names till the first bracket.
count the number of strings
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
Assumption: First sting is the last name.
If there are only 2 strings, the first and last are pretty easy with scan and substring:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
Assumption: there are only 3 strings.
approach 1. Middle name has dot in it. Filter it out.
approach 2. Middle name is shorter than real name:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;
For more see about subsring and scan:
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
I have a sequence of number from 1 000 000 to 9 999 999 (Total: 9,000,000). I've generated them in the excel and I would like to match them in following formats
Last 6 digits in:
1. XXX XXX (For example, 000 000 or 111 111 or 222 222)
2. X00 000 (For example, 100 000 or 200 000 or 300 000)
3. XYY YYY (For example, 122 222 or 233 333 or 411 111)
4. XY0 000 (For example, 230 000 or 750 000 or 120 000)
5. XYZ ZZZ (For example, 231 111 or 232 222 or 233 333)
6. X00 Y00 (For example, 200 300 or 100 400 or 500 600)
7. XXX Y00 (For example, 333 300 or 666 600 or 777 700)
8. XXX YYY (For example, 111 333 or 222 555 or 555 666)
9. XX YY ZZ (For example, 11 22 33 or 22 33 44 or 44 55 66)
10. X0 Y0 Z0 (For example, 10 20 30 or 30 40 50 or 60 70 80)
Would it be possible to do with regex or vba in excel 2013?
Since I don't have knowledge in Excel, should I seek someone's help for a simple program for such matching?
You can use VBA, but I believe you will need to set up each classification separately, and also ensure that they are in an order so as to not overlap.
Here is a partial example, showing a few VBA techniques, which you should be able to extend. I only dealt with the rightmost 6 digits and initially constructed a string; and also put each digit into an array element to make the testing formulas simpler.
Option Explicit
Function Classify(N As Long) As String
Dim I As Long
Dim S(1 To 6) As String
Dim sN As String
sN = Format(Right(N, 6), "000000")
For I = 1 To 6
S(I) = Mid(sN, I, 1)
Next I
If Left(sN, 3) = Right(sN, 3) Then
Classify = "XXX XXX"
ElseIf S(1) <> 0 And Mid(sN, 2) = 0 Then
Classify = "X00 000"
ElseIf S(1) <> 0 And Mid(sN, 2) Like WorksheetFunction.Rept(S(2), 5) Then
Classify = "XYY YYY"
ElseIf S(1) <> 0 And S(2) <> 0 And S(1) <> S(2) And Mid(sN, 3) = 0 Then
Classify = "XY0 000"
elseif ...
End If
End Function
I have large data and I want to extract two types of data based on two conditions. I wrote a tcl script to extract the data by using regex (newbie to regex).
I have used the following condition which works fine and produces part of the desired output:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time ] {
I'm using the variable time somewhere in the script. The above condition produces the following o/p(this is just a sample as the file is large):
+ 30.808352 1 2 tcp 40 ------- 30 6.7 2.30 81 2073
+ 30.808416 1 2 tcp 40 ------- 128 8.16 2.159 81 2069
+ 30.809513 1 2 tcp 40 ------- 156 12.19 2.187 1 2077
+ 30.809641 1 2 tcp 80 ------- 156 12.19 2.187 1 2078
+ 30.809878 1 2 tcp 40 ------- 151 7.18 2.182 41 2079
+ 30.813096 1 2 tcp 40 ------- 161 9.20 2.192 0 2083
+ 30.813352 1 2 tcp 40 ------- 157 13.19 2.188 1 2085
+ 30.81348 1 2 tcp 80 ------- 157 13.19 2.188 1 2086
+ 30.815362 1 2 tcp 40 ------- 148 12.18 2.179 41 2088
+ 30.815426 1 2 tcp 40 ------- 148 5.9 2.179 41 2089
+ 30.818096 1 2 tcp 40 ------- 162 10.20 2.193 0 2091
+ 30.818544 1 2 tcp 40 ------- 158 3.78 2.189 1 2093
+ 30.818672 1 2 tcp 80 ------- 158 14.19 2.189 1 2094
+ 30.820657 1 2 tcp 40 ------- 153 9.19 2.184 41 2096
+ 30.821579 1 2 tcp 40 ------- 154 10.19 2.185 41 2097
Then, inside the above if condition, I want check the 9th column :
//condition 1
if (9th between [3-6].*) ( such as 3.78,6.7, 5.9)
The second condition is :
//condition 2
if (9th between [7-14].*) ( such as 14.19,12.18,10.19, 9.19,.....)
I'm struggling with two conditions above. I tried the following, I didn't get an error, however, no matching occurred !!
condition 1:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([3-9])\..*/ } $line ] {
I know I'm repeating part of the main if condition, becuase I don't know how to skip the columns !!!
condition 2:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([7-9]|1[0-4])\..*/} $line ] {
Any suggestions !!!
Why don't you split on space? You can achieve pretty much the same outcome using a few more lines. It will be readable and can people will understand the code better:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time] {
set elements [split $line " "] ;# You can actually omit the " " in this case
set 9th [lindex $elements 8]
# Condition 1
if {$9th >= 3 && $9th < 7} { do something }
# Condition 2
if {$9th >= 7 && $9th < 15} { do something }
}
match 7-14 \+ ([0-9.]+) 1 2.*- \d+\s(?:[7-9]|1[0-4]) Demo
match 3-6 \+ ([0-9.]+) 1 2.*- \d+\s[3-6] Demo
I have the following string:
Giants 2 9 : 10 L.Tynes 22 yd . Field Goal ( 4 - - 3 , 1 : 20 ) 0 3 Cowboys 2 1 : 01 K.Ogletree 10 yd . pass from T.Romo ( D.Bailey kick ) ( 7 - 73 , 2 : 33 ) 7 3 Cowboys 3 10 : 24 K.Ogletree 40 yd . pass from T.Romo ( D.Bailey kick ) ( 9 - 80 , 4 : 36 ) 14 3 Giants 3 5 : 11 A.Bradshaw 10 yd . run ( L.Tynes kick ) ( 9 - 89 , 5 : 13 ) 14 10 Cowboys 3 0 : 40 D.Bailey 33 yd . Field Goal ( 8 - 65 , 4 : 31 ) 17 10 Cowboys 4 5 : 57 M.Austin 34 yd . pass from T.Romo ( D.Bailey kick ) ( 8 - 82 , 7 : 06 ) 24 10 Giants 4 2 : 36 M.Bennett 9 yd . pass from E.Manning ( L.Tynes kick ) ( 12 - 79 , 3 : 21 ) 24 17 Time : 2 : 53
The prefix to the subtrings will either be "Cowboys" or "Giants". The string always ends with a right parenthesis ) and two numbers.
I can't even imagine what Regex to use. I can use string functions and loop over the string, but a Regex would help me later on. Maybe I could use the split function, but that's over my head.
I suppose I could parse "Cowboys" then "Giants".
I think this RegEx gives what you want:
(Cowboys|Giants).*?\)\s\d+\s\d+
"Cowboys" or "Giants" followed by arbitrary characters until you get a right paren, a space, some digits, a space, and some more digits.
I don't know ColdFusion, but this does the job in python:
match = re.findall(re.compile('((Giants|Cowboys)(.(?!Cowboys|Giants))*.)', re.DOTALL), s)
where s is the provided string. re.DOTALL implies that . matches whitespace. re.findall means to do a global search, which reFindAll probably does as well.
The regex does this:
Create a spanning group
Look for "Giants" or "Cowboys" as the starting string
Look for any character (.) that's not followed by the string "Cowboys" or "Giants" and matches as many as possible (which means, match all characters until the one succeeded by "Cowboys" or "Giants".
Match another character.
Since there's three groups, the group you're interested in might be numbered differently in ColdFusion. In python, they're embedded in the parent group.
>>> match[0]
('Giants 2 9 : 10 L.Tynes 22 yd . Field Goal ( 4 - - 3 , 1 : 20 ) 0 3', 'Giants', '3')
>>> match[1]
('Cowboys 2 1 : 01 K.Ogletree 10 yd . pass from T.Romo ( D.Bailey kick ) ( 7 - 73 , 2 : 33 ) 7 3', 'Cowboys', '3')
>>> match[2]
('Cowboys 3 10 : 24 K.Ogletree 40 yd . pass from T.Romo ( D.Bailey kick ) ( 9 - 80 , 4 : 36 ) 14 3', 'Cowboys', '3')
I think in most other languages you would address match[1], match[4], match[7], ... instead.