Divide cell in excel by 3 different string patterns using vba - regex

I have a list with a row in excel which has three different types (pattern) of string:
ABCD, CBFG, EXAL
For the first type it's for example: ABCD
So it will always be a combination of letters and no numbers or anything else.
ABCD (HOLA), CBFG (HOLA), EXAN (HOLA)
For the second type it's for example: ABCD (HOLA)
So it will be always a combination of letters and a combination of letters in brackets.
ABCD (HOLA), SOLE, CBFG (HOLA), SUPRA, EXAN (HOLA), ITAN
For the third type it's for example: ABCD (HOLA), SOLE
So it will be always a combination of letters and a combination of letters in brackets and a comma followed by another combination of letters
All these three types are listed with the following pattern.
Let's say we have: ABCD and ABCD (SOLE) and ABCD (HOLA), SOLE
Then my cell will look as follows:
ABCD, ABCD (HOLA), ABCD (HOLA), SOLE
Now I have the type 3 which also includes a comma. You can say there are two different commas here. One is used to divide each entry and the other one
is actually a part of one entry.
What I am trying to do now is to get these three types and paste them into another cell.
I don't know how to go about this. If someone can give me an advice on how to start that would be great.
Here is an example how I want my cell to divide the information:
In B2 is the information I have. The range H1 to J4 is what I want to have

Soshiribo,
I think that will would work for you. You would obviously want to rename and define all of the variables and change the limits on the loop but this should work.
Dim string2() As String
For I = 1 To 3
Erase string2()
string1 = ActiveSheet.Cells(I + 1, 2)
string2() = Split(string1, ",")
x = UBound(string2) - LBound(string2) + 1
Select Case x:
Case Is = 5
String3 = string2(0) + "," + string2(1)
String4 = string2(2)
String5 = string2(3) + "," + string2(4)
ActiveSheet.Cells(I + 1, 4) = String3
ActiveSheet.Cells(I + 1, 5) = String4
ActiveSheet.Cells(I + 1, 6) = String5
Case Is = 6
String3 = string2(0) + "," + string2(1)
String4 = string2(2) + "," + string2(3)
String5 = string2(4) + "," + string2(5)
ActiveSheet.Cells(I + 1, 4) = String3
ActiveSheet.Cells(I + 1, 5) = String4
ActiveSheet.Cells(I + 1, 6) = String5
End Select
Next I

If you're looking for a regex, here's an idea:
(\w+)( \(HOLA\))?,? ?(SOLE|SUPRA|ITAN)?
I don't fully understand all the requirements, so this is really just a starting point.

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Extracting specific words from a single cell containing text string

Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.

RegExp other patterns not working

I continue trying to perform string format matching using RegExp in VBScript & VB6. I am now trying to match a short, single-line string formatted as:
Seven characters:
a. Six alphanumeric plus one "-" OR
b. Five alphanumeric plus two "-"
Three numbers
Two letters
Literal "65"
A two-digit hex number.
Examples include 123456-789LM65F2, 4EF789-012XY65A5, A2345--789AB65D0 & 23456--890JK65D0.
The RegExp pattern ([A-Z0-9\-]{12})([65][A-F0-9]{2}) lumps (1) - (3) together and finds these OK.
However, if I try to:
c) Break (3) out w/ pattern ([A-Z0-9\-]{10})([A-Z]{2})([65][A-F0-9]{2}),
d) Break out both (2) & (3) w/ pattern ([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}), or
e) Tighten up (1) with alternation pattern ([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
it refuses to find any of them.
What am I doing wrong? Following is a VBScript that runs and checks these.
' VB Script
Main()
Function Main() ' RegEx_Format_sample.vbs
'Uses two paterns, TestPttn for full format accuracy check & SplitPttn
'to separate the two desired pieces
Dim reSet, EtchTemp, arrSplit, sTemp
Dim sBoule, sSlice, idx, TestPttn, SplitPttn, arrMatch
Dim arrPttn(3), arrItems(3), idxItem, idxPttn, Msgtemp
Set reSet = New RegExp
' reSet.IgnoreCase = True ' Not using
' reSet.Global = True ' Not using
' load test case formats to check & split
arrItems(0) = "0,6 nums + 1 '-',123456-789LM65F2"
arrItems(1) = "1,6 chars + 1 '-',4EF789-012XY65A5"
arrItems(2) = "2,5 chars + 2 '-',A2345--789AB65D0"
arrItems(3) = "3,5 nums + 2 '-',23456--890JK65D0"
SplitPttn = "([A-Z0-9]{5,6})[-]{1,2}([A-Z0-9]{9})" ' split pattern has never failed to work
' load the patterns to try
arrPttn(0) = "([A-Z0-9\-]{12})([65][A-F0-9]{2})"
arrPttn(1) = "([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2})"
arrPttn(2) = "([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
arrPttn(3) = "([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
For idxPttn = 0 To 3 ' select Test pattern
TestPttn = arrPttn(idxPttn)
TestPttn = TestPttn & "[%]" ' append % "ender" char
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
For idxItem = 0 To 3
reSet.Pattern = TestPttn ' set to Test pattern
sTemp = arrItems(idxItem )
arrSplit = Split(sTemp, ",") ' arrSplit is Split array
EtchTemp = arrSplit(2) & "%" ' append % "ender" char to Item sub (2) as the "phrase" under test
If reSet.Test(EtchTemp) = False Then
MsgBox("RegEx " & TestPttn & " false for " & EtchTemp & " as " & arrSplit(1) )
Else ' test OK; now switch to SplitPttn
reSet.Pattern = SplitPttn
Set arrMatch = reSet.Execute(EtchTemp) ' run Pttn as Exec this time
If arrMatch.Count > 0 then ' If test OK then Count s/b > 0
Msgtemp = ""
Msgtemp = "RegEx " & TestPttn & " TRUE for " & EtchTemp & " as " & arrSplit(1)
For idx = 0 To arrMatch.Item(0).Submatches.Count - 1
Msgtemp = Msgtemp & Chr(13) & Chr(10) & "Split segment " & idx & " as " & arrMatch.Item(0).submatches.Item(idx)
Next
MsgBox(Msgtemp)
End If ' Count OK
End If ' test OK
Next ' idxItem
Next ' idxPttn
End Function
Try this Regex:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Click for Demo
Explanation:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--) - matches either 6 Alphanumeric characters followed by a - or 5 Alphanumeric characters followed by a --
[0-9]{3} - matches 3 Digits
[A-Z]{2} - matches 2 Letters
65 - matches 65 literally
[0-9A-F]{2} - matches 2 HEX symbols
You can get some idea from the following code:
VBScript Code:
Option Explicit
Dim objReg, strTest
strTest = "123456-789LM65F2" 'Change the value as per your requirements. You can also store a list of values in an array and run the code in loop
set objReg = new RegExp
objReg.Global = True
objReg.IgnoreCase = True
objReg.Pattern = "(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}"
if objReg.test(strTest) then
msgbox strTest&" matches with the Pattern"
else
msgbox strTest&" does not match with the Pattern"
end if
set objReg = Nothing
Your patterns do not work because:
([A-Z0-9\-]{12})([65][A-F0-9]{2}) - matches 12 occurrences of either an AlphaNumeric character or - followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2}) - matches 10 occurrences of either an AlphaNumeric character or - followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches 7 occurrences of either an AlphaNumeric character or - followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches either 5 occurrences of an AlphaNumeric character followed by -- or 6 occurrences of an Alphanumeric followed by a -. This is then followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
Try this pattern :
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Or, if the last part doesn't like the [A-F]
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9ABCDEF]{2}
All, tanx again for your help!!
trincot, everything in each arrItems() between the commas, incl the the "plus", is merely part of a shorthand description of each item's characteristics, such as "5 characters plus 2 dashes".
Gurman, your pttn breakdowns were helpful, but, if I read it right, the addition of the ? prefix is a "Match zero or one occurrences" and this must match exactly one occurrence. Also, my 1st pattern (matches 12) actually DID work for all my test cases.
jNevill, & JMichelB your suggestions are very close to what I ended up with.
I was "over-classing". After some tinkering, I was able to get the Test Pttn to successfully recognize these test cases by taking the [65] out of the [] in my original Alternation pattern. That is I went from ([65]) to (65) and Zammo! it worked.
Orig pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
Wkg pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})(65)([A-F0-9]{2})
Oh, and I moved the
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
stmt up out of the For...Next loop. That helped w/ the splitting.
T-Bone

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

How do I cut out a few characters in the middle of a string? (Python 2.7.8)

I need to take a 6-significant-digit number and make it a 3-significant-digit number, but the string is structures as this:
string1 = 1.00466E+15
and I need
string2 = 1.00E+15
How do I cut out the 5th, 6th, and 7th character from this string?
string2 = string1[:4] + string1[7:]
See: https://docs.python.org/3/tutorial/introduction.html#strings - specifically 'slicing'