The string below is probably the result of bad API call:
_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’
I am not sure which rows contain non-alphanumeric characters and my task is to identify which rows are problematic.
Another problem is that some non-alphanumeric characters appear with strings that I would like to still keep and search, like:
This sentence is fine and searchable, but a few non-alphanumeric äóî donäó»t popup
Is there a way to test if the entire contents of a string are non-alphanumeric?
You can use a regular expression to find all rows with only standard alphabetic and numeric characters including commas, periods, exclamation and question marks as well as spaces:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamation mark!"
end
generate tag = ustrregexm(var1, "^[A-Za-z0-9 ,.?!]*$")
. list tag, separator(0)
+-----+
| tag |
|-----|
1. | 0 |
2. | 0 |
3. | 1 |
4. | 1 |
5. | 1 |
6. | 1 |
+-----+
Another possibility is to use a regular expression to exclude any rows that do not have any alphabetic and numeric characters, a solution which in this case covers both required cases:
clear
input str168 var1
"_±êµÂ’¥÷“_¡“__‘_Ó ’¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ “Ïã“_÷’¥Ï “µÏ“ÄÅ“ù÷ “Á¡ê±«“ùã ê¡Û“_ã “__’"
"This sentence is fine and searchable, but a few non unicode äóî donäó»t popup"
" This is a regular sentence of course"
" another sentence, but with comma"
" but what happens with question marks?"
" or perhaps an exclamantion mark!"
"¥Ï“ùü’ÄÛ“_« “_Ô“Ü“ù÷ "
"¥Ï“ùü’ÄÛ hihuo"
end
generate tag = ustrregexm(var1, "^[^A-Za-z0-9]*$")
list tag, separator(0)
+-----+
| tag |
|-----|
1. | 1 |
2. | 0 |
3. | 0 |
4. | 0 |
5. | 0 |
6. | 0 |
7. | 1 |
8. | 0 |
+-----+
Related
I have the following strings:
KZ1,345,769.1
PKS948,123.9
XG829,823.5
324JKL,282.7
456MJB87,006.01
How can I separate the letters and numbers?
This is the outcome I expect:
KZ 1345769.1
PKS 948123.9
XG 829823.5
JKL 324282.7
MJB 45687006
I have tried using the split command for this purpose but without success.
#Pearly Spencer's answer is surely preferable, but the following kind of naive looping should occur to any programmer. Look at each character in turn and decide whether it is a letter; or a number or decimal point; or something else (implicitly) and build up answers that way. Note that although we loop over the length of the string, looping over observations too is tacit.
clear
input str42 whatever
"KZ1,345,769.1"
"PKS948,123.9"
"XG829,823.5"
"324JKL,282.7"
"456MJB87,006.01"
end
compress
local length = substr("`: type whatever'", 4, .)
gen letters = ""
gen numbers = ""
quietly forval j = 1/`length' {
local arg substr(whatever,`j', 1)
replace letters = letters + `arg' if inrange(`arg', "A", "Z")
replace numbers = numbers + `arg' if `arg' == "." | inrange(`arg', "0", "9")
}
list
+-----------------------------------------+
| whatever letters numbers |
|-----------------------------------------|
1. | KZ1,345,769.1 KZ 1345769.1 |
2. | PKS948,123.9 PKS 948123.9 |
3. | XG829,823.5 XG 829823.5 |
4. | 324JKL,282.7 JKL 324282.7 |
5. | 456MJB87,006.01 MJB 45687006.01 |
+-----------------------------------------+
I have an alphanumeric column named "Result" that I'd like to parse into 4 different columns: prefix, suffix, value, and pure_text.
I'd like to solve this using Spark SQL using RLIKE and REGEX, but also open to PySpark/Scala
pure_text: contains only alphabets (or) if there are numbers present, then they should either have a special character "-" or multiple decimals (i.e. 9.9.0) or number followed by an alphabet and then a number again (i.e. 3x4u)
prefix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) before the 1st digit [0-9] needs to be extracted.
suffix: anything that can't be categorized into "pure_text" will be taken into consideration. any character(s) after the last digit [0-9] needs to be extracted.
value: anything that can't be categorized into "pure_text" will be taken into consideration. extract all numerical values including the decimal point.
Result
11 H
111L
<.004
>= 0.78
val<=0.6
xyz 100 abc
1-9
aaa 100.3.4
a1q1
Expected Output:
Result Prefix Suffix Value Pure_Text
11 H H 11
111L L 111
.9 0.9
<.004 < 0.004
>= 0.78 >= 0.78
val<=0.6 val<= 0.6
xyz 100 abc xyz abc 100
1-9 1-9
aaa 100.3.4 aaa 100.3.4
a1q1 a1q1
Here's one approach using a UDF that applies pattern matching to extract the string content into a case class. The pattern matching centers around the numeric value with Regex pattern [+-]?(?:\d*\.)?\d+ to extract the first occurrence of numbers like "1.23", ".99", "-100", etc. A subsequent check of numbers in the remaining substring captured in suffix determines whether the numeric substring in the original string is legitimate.
import org.apache.spark.sql.functions._
import spark.implicits._
case class RegexRes(prefix: String, suffix: String, value: Option[Double], pure_text: String)
val regexExtract = udf{ (s: String) =>
val pattern = """(.*?)([+-]?(?:\d*\.)?\d+)(.*)""".r
s match {
case pattern(pfx, num, sfx) =>
if (sfx.exists(_.isDigit))
RegexRes("", "", None, s)
else
RegexRes(pfx, sfx, Some(num.toDouble), "")
case _ =>
RegexRes("", "", None, s)
}
}
val df = Seq(
"11 H", "111L", ".9", "<.004", ">= 0.78", "val<=0.6", "xyz 100 abc", "1-9", "aaa 100.3.4", "a1q1"
).toDF("result")
df.
withColumn("regex_res", regexExtract($"result")).
select($"result", $"regex_res.prefix", $"regex_res.suffix", $"regex_res.value", $"regex_res.pure_text").
show
// +-----------+------+------+-----+-----------+
// | result|prefix|suffix|value| pure_text|
// +-----------+------+------+-----+-----------+
// | 11 H| | H| 11.0| |
// | 111L| | L|111.0| |
// | .9| | | 0.9| |
// | <.004| <| |0.004| |
// | >= 0.78| >= | | 0.78| |
// | val<=0.6| val<=| | 0.6| |
// |xyz 100 abc| xyz | abc|100.0| |
// | 1-9| | | null| 1-9|
// |aaa 100.3.4| | | null|aaa 100.3.4|
// | a1q1| | | null| a1q1|
// +-----------+------+------+-----+-----------+
I need a regular expression that checks the following :
a number with 7 digits having the following format : xxxyxxx
where y is different than 0 and is successor or predecessor (x+1 or x-1)
example:
4443444 --> match
4445444 --> match
4442444 --> doesn't match
I don't think there's a smart way of doing this with a RegExp.
You could simply force your way through, though :
1{3}21{3}|2{3}[13]2{3}|3{3}[24]3{3}|4{3}[35]4{3}|5{3}[46]5{3}|6{3}[57]6{3}|7{3}[68]7{3}|8{3}[79]8{3}|9{3}89{3}
See the demo.
1{3}21{3} `1` 3 times + `2` + `1` 3 times
| OR
2{3}[13]2{3} `2` 3 times + (`1` OR `3`) + `2` 3 times
| ...
3{3}[24]3{3}
|
4{3}[35]4{3}
|
5{3}[46]5{3}
|
6{3}[57]6{3}
|
7{3}[68]7{3}
|
8{3}[79]8{3}
|
9{3}89{3}
I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I know what puts and gets do, but I don't understand the meaning of this code.
int main(void) {
char s[20];
gets(s); //Helloworld
gets(s+2);//dog
sort(s+1,s+7);
puts(s+4);
}
Could you please help me to understand?
Draw it on paper, along these lines.
At first, twenty uninitialised elements:
| | | | | | | | | | | | | | | | | | | | |
gets(s):
|H|e|l|l|o|w|o|r|l|d|0| | | | | | | | | |
gets(s+2):
|H|e|d|o|g|0|o|r|l|d|0| | | | | | | | | |
^
|
s+2
sort(s+1, s+7):
|H|0|d|e|g|o|o|r|l|d|0| | | | | | | | | |
^ ^
| |
s+1 s+7
puts(s+4):
|H|0|d|e|g|o|o|r|l|d|0| | | | | | | | | |
^
|
s+4
The best thing to say about the code is that it is very bad. Luckily, it is short but it is vulnerable, unmaintainable and error prone.
However, since the previous is not really an answer, let's go through the code, assuming the standard include files were used and "using namespace std;":
char s[20];
This declares an array of 20 characters with the intent of filling it with a null-terminated string. If somehow, the string becomes larger, you're in trouble
gets(s); //Helloworld
This reads in a string from stdin. No checks can be done on the size. The comment assumes it will read in Helloworld, which should fit in s.
gets(s+2);//dog
This reads in a second string from stdin, but it will overwrite the previous string starting from the third character. So if the comment is write, s will contain the null-terminated string "Hedog".
sort(s+1,s+7);
This will sort the characters in asserting ascii value from the second up to the seventh character. With the given input, we already have a problem that the null-character is on the sixth position so it will be part of the sorted characters and thus will be second, so the null-terminated string will be "H".
puts(s+4);
Writes out the string from the fifth position on, so until the null-charater that was read in for "Helloworld", but then overwritten and half-sorted. Of course input can be anything, so expect surprises.
gets(s); //Helloworld -- reads a string from keyboard to s
gets(s+2);//dog -- reads a string from keyboard to s started with char 2
sort(s+1,s+7); -- sorts s in interval [1, 7]
puts(s+4); -- writes to console s from char 4
gets(s); //Helloworld --> s=Helloworld
gets(s+2);//dog --> s=Hedog
sort(s+1,s+7); --> s=Hdego
puts(s+4); --> console=Hdego