Stata Regex for 'standalone' numbers in string - regex

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.

Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Related

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Extract letters and numbers from string

I have the following strings:
KZ1,345,769.1
PKS948,123.9
XG829,823.5
324JKL,282.7
456MJB87,006.01
How can I separate the letters and numbers?
This is the outcome I expect:
KZ 1345769.1
PKS 948123.9
XG 829823.5
JKL 324282.7
MJB 45687006
I have tried using the split command for this purpose but without success.
#Pearly Spencer's answer is surely preferable, but the following kind of naive looping should occur to any programmer. Look at each character in turn and decide whether it is a letter; or a number or decimal point; or something else (implicitly) and build up answers that way. Note that although we loop over the length of the string, looping over observations too is tacit.
clear
input str42 whatever
"KZ1,345,769.1"
"PKS948,123.9"
"XG829,823.5"
"324JKL,282.7"
"456MJB87,006.01"
end
compress
local length = substr("`: type whatever'", 4, .)
gen letters = ""
gen numbers = ""
quietly forval j = 1/`length' {
local arg substr(whatever,`j', 1)
replace letters = letters + `arg' if inrange(`arg', "A", "Z")
replace numbers = numbers + `arg' if `arg' == "." | inrange(`arg', "0", "9")
}
list
+-----------------------------------------+
| whatever letters numbers |
|-----------------------------------------|
1. | KZ1,345,769.1 KZ 1345769.1 |
2. | PKS948,123.9 PKS 948123.9 |
3. | XG829,823.5 XG 829823.5 |
4. | 324JKL,282.7 JKL 324282.7 |
5. | 456MJB87,006.01 MJB 45687006.01 |
+-----------------------------------------+

Extracting specific words from a single cell containing text string

Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.

Matching five characters in Excel/VBA using RegEx, with first character being dependant on cell value

I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2

Add two different numbers in a single text field space separated in Access VBA

I am using Access and VBA to tidy up a database before a migration. One field is going from text to an INT. So I need to convert and possibly add some numbers which exist in a singular field.
Examples:
F/C 3 other 8 should become 11
Calender-7 should become 7
21 F/C and 1 other should become 22
29 (natural ways) should become 29
The second and fourth line are simple enough, just use the following regex in VBA
Dim rgx As New RegExp
Dim inputText As String
Dim outputText As String
rgx.Pattern = "[^0-9]*"
rgx.Global = True
inputText = "29 (natural ways)"
outputText = rgx.Replace(inputText, "")
The downside is if I use it on option 1 or 3:
F/C 3 other 8 will become 38
Calender-7 will become 7
21 F/C and 1 other will become 211
29 (natural ways) will become 29
This is simple enough in bash, I can just keep the spaces by adding one to [^0-9 ]* and then piping it into awk which will add every field using a space as a delimiter like so:
sed 's/[^0-9 ]*//g' | awk -F' ' 's=0; {for (i=1; i<=NF; i++) s=s+$i; print s}'
F/C 3 other 8 will become 11
21 F/C and 1 other will become 22
The problem is I cannot use bash, and there are far too many values to do it by hand. Is there any way to use VBA to accomplish this?
Instead of using the replace method, just capture and then add up all the numbers. For example:
Option Explicit
Function outputText(inputText)
Dim rgx As RegExp
Dim mc As MatchCollection, m As Match
Dim I As Integer
Set rgx = New RegExp
rgx.Pattern = "[0-9]+"
rgx.Global = True
Set mc = rgx.Execute(inputText)
For Each m In mc
I = I + CInt(m) 'may Need to be cast as an int in Access VBA; not required in Excel VBA
Next m
outputText = I
End Function
I'm not sure if there are any easier way for your question. Here I've wrote small function for you.
Requirement: add all numbers in a string, identify "consecutive" digits as one number.
pseudo:
Loop through given text
find the first number and check/loop if following chars are numbers
if following chars are numbers treat as one number else pass the
result
continue searching from last point and add the result to the total
in code:
Public Function ADD_NUMB(iText As String) As Long
Dim I, J As Integer
Dim T As Long
Dim TM As String
For I = 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, I, 1)) >= 1) Then
TM = Mid(iText, I, 1)
For J = I + 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, J, 1)) >= 1) Then
TM = TM & Mid$(iText, J, 1)
Else
Exit For
End If
Next J
T = T + Val(Nz(TM, 0))
I = J
End If
Next I
ADD_NUMB = T
End Function
usage:
dim total as integer
total = ADD_NUMB("21 F/C and 1 other")
not sure about performance but it will get you what you need :)