Matching sequence of numbers with pattern in excel 2013 - regex

I have a sequence of number from 1 000 000 to 9 999 999 (Total: 9,000,000). I've generated them in the excel and I would like to match them in following formats
Last 6 digits in:
1. XXX XXX (For example, 000 000 or 111 111 or 222 222)
2. X00 000 (For example, 100 000 or 200 000 or 300 000)
3. XYY YYY (For example, 122 222 or 233 333 or 411 111)
4. XY0 000 (For example, 230 000 or 750 000 or 120 000)
5. XYZ ZZZ (For example, 231 111 or 232 222 or 233 333)
6. X00 Y00 (For example, 200 300 or 100 400 or 500 600)
7. XXX Y00 (For example, 333 300 or 666 600 or 777 700)
8. XXX YYY (For example, 111 333 or 222 555 or 555 666)
9. XX YY ZZ (For example, 11 22 33 or 22 33 44 or 44 55 66)
10. X0 Y0 Z0 (For example, 10 20 30 or 30 40 50 or 60 70 80)
Would it be possible to do with regex or vba in excel 2013?
Since I don't have knowledge in Excel, should I seek someone's help for a simple program for such matching?

You can use VBA, but I believe you will need to set up each classification separately, and also ensure that they are in an order so as to not overlap.
Here is a partial example, showing a few VBA techniques, which you should be able to extend. I only dealt with the rightmost 6 digits and initially constructed a string; and also put each digit into an array element to make the testing formulas simpler.
Option Explicit
Function Classify(N As Long) As String
Dim I As Long
Dim S(1 To 6) As String
Dim sN As String
sN = Format(Right(N, 6), "000000")
For I = 1 To 6
S(I) = Mid(sN, I, 1)
Next I
If Left(sN, 3) = Right(sN, 3) Then
Classify = "XXX XXX"
ElseIf S(1) <> 0 And Mid(sN, 2) = 0 Then
Classify = "X00 000"
ElseIf S(1) <> 0 And Mid(sN, 2) Like WorksheetFunction.Rept(S(2), 5) Then
Classify = "XYY YYY"
ElseIf S(1) <> 0 And S(2) <> 0 And S(1) <> S(2) And Mid(sN, 3) = 0 Then
Classify = "XY0 000"
elseif ...
End If
End Function

Related

Match values within 3 different tables in PowerBI

I have 3 tables in PowerBI
Table 1
Num
111
222
333
Table 2
Number Code
111 aa
333 cc
222 bb
444 ff
666 gg
These 2 tables are connected by the Number column
Which means the connected value looks like this-
Number Code
111 aa
222 bb
333 cc
Now on my table 3 I have the following -
Table 3
Number Code
111 aa
222 bc
222 bb
444 ff
666 gg
Now what I would like to do is to compare the code when the Number Matches. Means the Output should look like -
Number Code Result
111 aa Y
222 bc N
222 bb N
444 ff N
666 gg N
Do anyone knows any solution to solve this challenge!
I'm not sure how Table 1 is relevant but it seems like you can just do a lookup and check if it matches.
Result =
IF (
LOOKUPVALUE (
Table2[Code],
Table2[Number], Table3[Number]
) = Table3[Code],
"Y",
"N"
)

Divide full name into first and last when middle name is present

I wanted to see if this was doable in SAS. I have a dataset of the members of congress and want to split full name into first and last. However, occasionally they seem to list their middle initial or name. It is from a .txt file.
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
Good day,
SAS is a bit clunky when it comes to Strings. However it can be done. As other have mentioned, it's the logic defined, which is the really hard part.
Begin with some raw data...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
first I select the leading names till the first bracket.
count the number of strings
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
Assumption: First sting is the last name.
If there are only 2 strings, the first and last are pretty easy with scan and substring:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
Assumption: there are only 3 strings.
approach 1. Middle name has dot in it. Filter it out.
approach 2. Middle name is shorter than real name:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;
For more see about subsring and scan:

Delete or remove unexpected records and strings based on multiple criteria by python or R script

I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9

R remove only "[" "]" from string

I have a something like :
test[1]
"[0 30.5 4.5 10.5 2 35 22.999999999999996 29 5.500000000000001 23.5 18 23.5 44.5 3 44.5 44.00000000000001 43 27 42 35.5 19.5 44.00000000000001 1 0 31 34 18 1.5 26 6 45.99999999999999 10.5 9.5 24 20 42.5 14.5 45.5 20.499999999999996 150 45.5 0 4.5 22.5 4 9 8 0 0 15.5 30.5 7 5.500000000000001 12.5 33.5 15 500 22.5 18 43 4.5 26 23.5 16 4.5 7.5 32 0 0 18.5 33 31 14.5 21.5 0 40 0 0 43.49999999999999 22.999999999999996]"
And I would like to remove [ and ] (first and last characters) of each line (test[1] test[2] ...) but keep points (22.9999).
I have tried some stringr functions, but I'm not so go with regex ...
Can you help me?
E
There's no need for packages for this. Just use something like the following:
gsub("\\[|\\]", "", test)
This basically says: "Look in test for "[" or (|) "]", and if you find it, replace it with nothing ("")."
Since [ and ] are special characters in regular expressions, they would need to be escaped.
If you're just removing the first and last character, you can also probably do something like:
substring(test, 2, nchar(test)-1)
This basically says, "Extract the part of the string starting from the second position and ending in the second-to-last position."
One easy way to remove [ and ] from a string is
x <- "[12345]"
gsub("[][]", "", x)
# [1] "12345"
Here, the outer [] means one of the characters in the brackets. The inner ][ represent the to-be-replaced characters.

tcl:loop and extract with Regex

I have large data and I want to extract two types of data based on two conditions. I wrote a tcl script to extract the data by using regex (newbie to regex).
I have used the following condition which works fine and produces part of the desired output:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time ] {
I'm using the variable time somewhere in the script. The above condition produces the following o/p(this is just a sample as the file is large):
+ 30.808352 1 2 tcp 40 ------- 30 6.7 2.30 81 2073
+ 30.808416 1 2 tcp 40 ------- 128 8.16 2.159 81 2069
+ 30.809513 1 2 tcp 40 ------- 156 12.19 2.187 1 2077
+ 30.809641 1 2 tcp 80 ------- 156 12.19 2.187 1 2078
+ 30.809878 1 2 tcp 40 ------- 151 7.18 2.182 41 2079
+ 30.813096 1 2 tcp 40 ------- 161 9.20 2.192 0 2083
+ 30.813352 1 2 tcp 40 ------- 157 13.19 2.188 1 2085
+ 30.81348 1 2 tcp 80 ------- 157 13.19 2.188 1 2086
+ 30.815362 1 2 tcp 40 ------- 148 12.18 2.179 41 2088
+ 30.815426 1 2 tcp 40 ------- 148 5.9 2.179 41 2089
+ 30.818096 1 2 tcp 40 ------- 162 10.20 2.193 0 2091
+ 30.818544 1 2 tcp 40 ------- 158 3.78 2.189 1 2093
+ 30.818672 1 2 tcp 80 ------- 158 14.19 2.189 1 2094
+ 30.820657 1 2 tcp 40 ------- 153 9.19 2.184 41 2096
+ 30.821579 1 2 tcp 40 ------- 154 10.19 2.185 41 2097
Then, inside the above if condition, I want check the 9th column :
//condition 1
if (9th between [3-6].*) ( such as 3.78,6.7, 5.9)
The second condition is :
//condition 2
if (9th between [7-14].*) ( such as 14.19,12.18,10.19, 9.19,.....)
I'm struggling with two conditions above. I tried the following, I didn't get an error, however, no matching occurred !!
condition 1:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([3-9])\..*/ } $line ] {
I know I'm repeating part of the main if condition, becuase I don't know how to skip the columns !!!
condition 2:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([7-9]|1[0-4])\..*/} $line ] {
Any suggestions !!!
Why don't you split on space? You can achieve pretty much the same outcome using a few more lines. It will be readable and can people will understand the code better:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time] {
set elements [split $line " "] ;# You can actually omit the " " in this case
set 9th [lindex $elements 8]
# Condition 1
if {$9th >= 3 && $9th < 7} { do something }
# Condition 2
if {$9th >= 7 && $9th < 15} { do something }
}
match 7-14 \+ ([0-9.]+) 1 2.*- \d+\s(?:[7-9]|1[0-4]) Demo
match 3-6 \+ ([0-9.]+) 1 2.*- \d+\s[3-6] Demo