Extract letters and numbers from string - stata

I have the following strings:
KZ1,345,769.1
PKS948,123.9
XG829,823.5
324JKL,282.7
456MJB87,006.01
How can I separate the letters and numbers?
This is the outcome I expect:
KZ 1345769.1
PKS 948123.9
XG 829823.5
JKL 324282.7
MJB 45687006
I have tried using the split command for this purpose but without success.

#Pearly Spencer's answer is surely preferable, but the following kind of naive looping should occur to any programmer. Look at each character in turn and decide whether it is a letter; or a number or decimal point; or something else (implicitly) and build up answers that way. Note that although we loop over the length of the string, looping over observations too is tacit.
clear
input str42 whatever
"KZ1,345,769.1"
"PKS948,123.9"
"XG829,823.5"
"324JKL,282.7"
"456MJB87,006.01"
end
compress
local length = substr("`: type whatever'", 4, .)
gen letters = ""
gen numbers = ""
quietly forval j = 1/`length' {
local arg substr(whatever,`j', 1)
replace letters = letters + `arg' if inrange(`arg', "A", "Z")
replace numbers = numbers + `arg' if `arg' == "." | inrange(`arg', "0", "9")
}
list
+-----------------------------------------+
| whatever letters numbers |
|-----------------------------------------|
1. | KZ1,345,769.1 KZ 1345769.1 |
2. | PKS948,123.9 PKS 948123.9 |
3. | XG829,823.5 XG 829823.5 |
4. | 324JKL,282.7 JKL 324282.7 |
5. | 456MJB87,006.01 MJB 45687006.01 |
+-----------------------------------------+

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Test if all characters in string are not alphanumeric

The string below is probably the result of bad API call:
_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’
I am not sure which rows contain non-alphanumeric characters and my task is to identify which rows are problematic.
Another problem is that some non-alphanumeric characters appear with strings that I would like to still keep and search, like:
This sentence is fine and searchable, but a few non-alphanumeric äóî donäó»t popup
Is there a way to test if the entire contents of a string are non-alphanumeric?
You can use a regular expression to find all rows with only standard alphabetic and numeric characters including commas, periods, exclamation and question marks as well as spaces:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamation mark!"
end
generate tag = ustrregexm(var1, "^[A-Za-z0-9 ,.?!]*$")
. list tag, separator(0)
+-----+
| tag |
|-----|
1. | 0 |
2. | 0 |
3. | 1 |
4. | 1 |
5. | 1 |
6. | 1 |
+-----+
Another possibility is to use a regular expression to exclude any rows that do not have any alphabetic and numeric characters, a solution which in this case covers both required cases:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamantion mark!"
"¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ "
"¥Ï“ùü’ÄÛ hihuo"
end
generate tag = ustrregexm(var1, "^[^A-Za-z0-9]*$")
list tag, separator(0)
+-----+
| tag |
|-----|
1. | 1 |
2. | 0 |
3. | 0 |
4. | 0 |
5. | 0 |
6. | 0 |
7. | 1 |
8. | 0 |
+-----+

Spark - extracting numeric values from an alphanumeric string using regex

I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1
Here's one approach using a UDF that applies pattern matching to extract the string content into a case class. The pattern matching centers around the numeric value with Regex pattern [+-]?(?:\d*\.)?\d+ to extract the first occurrence of numbers like "1.23", ".99", "-100", etc. A subsequent check of numbers in the remaining substring captured in suffix determines whether the numeric substring in the original string is legitimate.
import org.apache.spark.sql.functions._
import spark.implicits._
case class RegexRes(prefix: String, suffix: String, value: Option[Double], pure_text: String)
val regexExtract = udf{ (s: String) =>
val pattern = """(.*?)([+-]?(?:\d*\.)?\d+)(.*)""".r
s match {
case pattern(pfx, num, sfx) =>
if (sfx.exists(_.isDigit))
RegexRes("", "", None, s)
else
RegexRes(pfx, sfx, Some(num.toDouble), "")
case _ =>
RegexRes("", "", None, s)
}
}
val df = Seq(
"11 H", "111L", ".9", "<.004", ">= 0.78", "val<=0.6", "xyz 100 abc", "1-9", "aaa 100.3.4", "a1q1"
).toDF("result")
df.
withColumn("regex_res", regexExtract($"result")).
select($"result", $"regex_res.prefix", $"regex_res.suffix", $"regex_res.value", $"regex_res.pure_text").
show
// +-----------+------+------+-----+-----------+
// | result|prefix|suffix|value| pure_text|
// +-----------+------+------+-----+-----------+
// | 11 H| | H| 11.0| |
// | 111L| | L|111.0| |
// | .9| | | 0.9| |
// | <.004| <| |0.004| |
// | >= 0.78| >= | | 0.78| |
// | val<=0.6| val<=| | 0.6| |
// |xyz 100 abc| xyz | abc|100.0| |
// | 1-9| | | null| 1-9|
// |aaa 100.3.4| | | null|aaa 100.3.4|
// | a1q1| | | null| a1q1|
// +-----------+------+------+-----+-----------+

Matching five characters in Excel/VBA using RegEx, with first character being dependant on cell value

I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2

Difference between (^|\\s)([A-Z]{1,3})(\\s|$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.
Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.
To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).
The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.