Convert letter to numeric number formatting in Excel using RegExp

Convert letter to numeric number formatting in Excel using RegExp - regex

I have an Excel sheet with given data in a cell. Want to convert letter numbering (e.g. a, c) or b)) to numeric number format(e.g. 1, 3) or 2))
Input Cell Data
Introduction and Basic wording
More lines here
a,c) A steps 1 and 3
- A sub-step) 1 of 1.
- A sub-step 2 of 1
...
b) A step 2.
- A sub-step 1 of 2
d-g) A step 4 and 6
h) A step 5
Note:
Inside cell there are few character in given format 'p)'
Output Cell Data
Introduction and Basic wording
More lines here
1,3) A steps 1 and 3
- A sub-step) 1 of 1.
- A sub-step 2 of 1
...
2) A step 2.
- A sub-step 1 of 2
4-7) A step 4 and 6
8) A step 5
I used below RegExp to get the proper matching of letter numbering
(^|\n)((\w(?=\)|\,|\-)(\,|\-)){0,3}\w?\))
To replace it to numeric numbering format, I am using below code
Set colMataches = RE.Execute(strRegExp)
For Each match In colMataches
WScript.Echo "Found:"&match.Value
Dim mt : mt = match.Value 'Output result: a,c), b)etc.
Select Case mt
...
Case "b)"
sProc = RE.Replace(strRegExp, 2)
...
Case Else
WScript.Echo "Not Found"
End Select
As per my understanding:Replace(expression , find , replacewith[ , start[ , count[ , compare]]])
where find is the substring you want to replace & replacewith is the substring you want to replace with.
This usually work good when we are trying to find global matches for a specific value.e.g. want to replace a character like 'a' or non character like '-' globally with a single & specific value like alphabet('b') or Numeric('1') or anything other than alphanumeric('#').
But i am failing to replace(a->1, b->2, c->3, d->4 etc) by use of Select Case statement. Is there a better way to do proper replacement?

Related

Can I use regular expressions, the Like operator, and/or Instr() find the index of a pattern within a larger string?

I have a large list (a table with one field) of non-standardized strings, imported from a poorly managed legacy database. I need to extract the single-digit number (surrounded by spaces) that occurs exactly once in each of those strings (though the strings have other multi-digit numbers sometimes too). For example, from the following string:
"Quality Assurance File System And Records Retention Johnson, R.M. 004 4 2999 ss/ds/free ReviewMo = Aug Effective 1/31/2012 FileOpen-?"
I would want to pull the number 4 (or 4's position in the string, i.e. 71)
I can use
WHERE rsLegacyList.F1 LIKE "* # *"
inside a select statement to find if each string has a lone digit, and thereby filter my list. But it doesn't tell me where the digit is so I can extract the digit itself (with mid() function) and start sorting the list. The goal is to create a second field with that digit by itself as a method of sorting the larger strings in the first field.
Is there a way to use Instr() along with regular expressions to find where a regular expression occurs within a larger string? Something like
intMarkerLocation = instr(rsLegacyList.F1, Like "* # *")
but that actually works?
I appreciate any suggestions, or workarounds that avoid the problem entirely.
#Lee Mac, I made a function RegExFindStringIndex as shown here:
Public Function RegExFindStringIndex(strToSearch As String, strPatternToMatch As String) As Integer
Dim regex As RegExp
Dim Matching As Match
Set regex = New RegExp
With regex
.MultiLine = False
.Global = True
.IgnoreCase = False
.Pattern = strPatternToMatch
Matching = .Execute(strToSearch)
RegExFindStringIndex = Matching.FirstIndex
End With
Set regex = Nothing
Set Matching = Nothing
End Function
But it gives me an error Invalid use of property at line Matching = .Execute(strToSearch)

Using Regular Expressions
If you were to use Regular Expressions, you would need to define a VBA function to instantiate a RegExp object, set the pattern property to something like \s\d\s (whitespace-digit-whitespace) and then invoke the Execute method to obtain a match (or matches), each of which will provide an index of the pattern within the string. If you want to pursue this route, here are some existing examples for Excel, but the RegExp manipulation will be identical in MS Access.
Here is an example function demonstrating how to use the first result returned by the Execute method:
Public Function RegexInStr(strStr As String, strPat As String) As Integer
With New RegExp
.Multiline = False
.Global = True
.IgnoreCase = False
.Pattern = strPat
With .Execute(strStr)
If .Count > 0 Then RegexInStr = .Item(0).FirstIndex + 1
End With
End With
End Function
Note that the above uses early binding and so you will need to add a reference to the Microsoft VBScript Regular Expressions 5.5 library to your project.
Example Immediate Window evaluation:
?InStr("abc 1 123", " 1 ")
4
?RegexInStr("abc 1 123", "\s\w\s")
4
Using InStr
An alternative using the in-built instr function within a query might be the following inelegant (and probably very slow) query:
select
switch
(
instr(rsLegacyList.F1," 0 ")>0,instr(rsLegacyList.F1," 0 ")+1,
instr(rsLegacyList.F1," 1 ")>0,instr(rsLegacyList.F1," 1 ")+1,
instr(rsLegacyList.F1," 2 ")>0,instr(rsLegacyList.F1," 2 ")+1,
instr(rsLegacyList.F1," 3 ")>0,instr(rsLegacyList.F1," 3 ")+1,
instr(rsLegacyList.F1," 4 ")>0,instr(rsLegacyList.F1," 4 ")+1,
instr(rsLegacyList.F1," 5 ")>0,instr(rsLegacyList.F1," 5 ")+1,
instr(rsLegacyList.F1," 6 ")>0,instr(rsLegacyList.F1," 6 ")+1,
instr(rsLegacyList.F1," 7 ")>0,instr(rsLegacyList.F1," 7 ")+1,
instr(rsLegacyList.F1," 8 ")>0,instr(rsLegacyList.F1," 8 ")+1,
instr(rsLegacyList.F1," 9 ")>0,instr(rsLegacyList.F1," 9 ")+1,
true, null
) as intMarkerLocation
from
rsLegacyList
where
rsLegacyList.F1 like "* # *"

How about:
select
instr(rsLegacyList.F1, " # ") + 1 as position
from rsLegacyList.F1
where rsLegacyList.F1 LIKE "* # *"

Regex to find 9 to 11 digit integer occuring anywhere closest to a keyword

In simple term, what I am looking for is this If there is a string, which has a keyword ZTFN00, then the regex shall be able to return the closest 9 to 11 digit number to the left or right side of the string.
I want to do this in REGEXP_REPLACE function of oracle.
Below are some of the sample strings:
The following error occurred in the SAP UPDATE_BP service as part of the combine:
(error:653, R11:186:Number 867278489 Already Exists for ID Type ZTFN00)
Expected result: 867278489
The following error occurred in the SAP UPDATE_BP service as part of the combine
(error:653, R11:186:Number ZTFN00 identification number 123456778 already exist)
Expected result: 123456778

I could not find a way to easily do this with regular expressions, but if you want to do the task without PL/SQL, you can do something like the following.
It's a little bit tricky, combining many calls to regexp functions to evaluate, for each occurrence of digit string, the distance from your keyword and then pick the nearest one.
with test(string, keyWord) as
( select
'(error:653, R11:186: 999999999 Number 0000000000 Already Exists for ID Type ZTFN00 hjhk 11111111111 kjh k222222222)',
'ZTFN00'
from dual)
select numberString
from (
select numberString,
decode (greatest (numberPosition, keyWordPosition),
keyWordPosition,
keyWordPosition - numberPosition - numberLength,
numberPosition,
numberPosition - keyWordPosition - keyWordLength
) as distance
from (
select regexp_instr(string, '[0-9]{9,11}', 1, level) as numberPosition,
instr( string, keyWord) as keyWordPosition,
length(regexp_substr(string, '[0-9]{9,11}', 1, level)) as numberLength,
regexp_substr(string, '[0-9]{9,11}', 1, level) as numberString,
length(keyWord) as keyWordLength
from test
connect by regexp_instr(string, '[0-9]{9,11}', 1, level) != 0
)
order by distance
)where rownum = 1
Looking at the single parts:
SQL> with test(string, keyWord) as
2 ( select
3 '(error:653, R11:186: 999999999 Number 0000000000 Already Exists for ID Type ZTFN00 hjhk 11111111111 kjh k222222222)',
4 'ZTFN00'
5 from dual)
6 select regexp_instr(string, '[0-9]{9,11}', 1, level) as numberPosition,
7 instr( string, keyWord) as keyWordPosition,
8 length(regexp_substr(string, '[0-9]{9,11}', 1, level)) as numberLength,
9 regexp_substr(string, '[0-9]{9,11}', 1, level) as numberString,
10 length(keyWord) as keyWordLength
11 from test
12 connect by regexp_instr(string, '[0-9]{9,11}', 1, level) != 0;
NUMBERPOSITION KEYWORDPOSITION NUMBERLENGTH NUMBERSTRING KEYWORDLENGTH
-------------- --------------- ------------ ---------------- -------------
22 77 9 999999999 6
39 77 10 0000000000 6
91 77 11 11111111111 6
108 77 9 222222222 6
This scans all the string, and iterates while insrt (...) != 0, that is while there are occurrences; the level is used to look for the first, second, ... occurrence, so that row 1 gives the first occurrence, row two the second and so on, while exists the nth occurrence.
This part is only used to evaluate some useful fields, tha we use to look both to the right and to the left of you keyword, exactly evaluating the distance between the string number and the keyword:
select numberString,
decode (greatest (numberPosition, keyWordPosition),
keyWordPosition,
keyWordPosition - numberPosition - numberLength,
numberPosition,
numberPosition - keyWordPosition - keyWordLength
) as distance
The inner query is ordered by distance, so that the first row contains the nearest string; that's why in the outermost query we only extract the row with
rownum = 1 to get the nearest row.
It can be re-written in a more compact way, but this is a bit more readable.
This should even work when you have multiple occurrences of the digit string, even on both sides of your keyword.

This regex works for me in RegexBuddy with Oracle mode selected (10g, 11g and 12c):
SELECT REGEXP_SUBSTR(mycolumn,
'\(error:[0-9]+,[ ]+
(
(
([0-9]{9,11})()
|
ZTFN00()
|
[^ ),]+
)
[ ),]+
)+
\4\5',
1, 1, 'cx', 3) FROM mytable;
The regex treats the main body of the string as a series of tokens matching the general pattern [^ ),]+ (one or more of any characters except space, right parenthesis, or comma). But there are two specific tokens that it tries to match first: the keyword (ZTFN00) and a valid ID number ([0-9]{9,11}).
The empty groups at the end of the first two alternatives serve as check boxes; the corresponding backreferences at the end (\4 and \5) will only succeed if those groups participated in the match, meaning both an ID number and the keyword were seen.
(This is an obscure "feature" that definitely doesn't work in many flavors, so I can't be positive it will work in Oracle. Please let me know if it doesn't.)
The ID number is captured in group #3, and that's what the REGEXP_SUBSTR command returns. (Since you only want to retrieve the number, there no call for REGEXP_REPLACE.)

Separating column using separate (tidyr) via dplyr on a first encountered digit

I'm trying to separate a rather messy column into two columns containing period and description. My data resembles the extract below:
set.seed(1)
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
Desired results
Desired results should look like that:
indicator period values
1 someindicator 2001 0.2655087
2 someindicator 2011 0.3721239
3 some text 20022008 0.5728534
4 another indicator 2003 0.9082078
Characteristics
Indicator descriptions are in one column
Numeric values (counting from first digit with the first digit are in the second column)
Code
require(dplyr); require(tidyr); require(magrittr)
dta %<>%
separate(col = indicator, into = c("indicator", "period"),
sep = "^[^\\d]*(2+)", remove = TRUE)
Naturally this does not work:
> head(dta, 2)
indicator period values
1 001 0.2655087
2 011 0.3721239
Other attempts
I have also tried the default separation method sep = "[^[:alnum:]]" but it breaks down the column into too many columns as it appears to be matching all of the available digits.
The sep = "2*" also doesn't work as there are too many 2s at times (example: 20032006).
What I'm trying to do boils down to:
Identifying the first digit in the string
Separating on that charter. As a matter of fact, I would be happy to preserve that particular character as well.

I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
(?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched

You could also use unglue::unnest() :
dta <- data.frame(indicator=c("someindicator2001", "someindicator2011",
"some text 20022008", "another indicator 2003"),
values = runif(n = 4))
# remotes::install_github("moodymudskipper/unglue")
library(unglue)
unglue_unnest(dta, indicator, "{indicator}{=\\s*}{period=\\d*}")
#> values indicator period
#> 1 0.43234262 someindicator 2001
#> 2 0.65890900 someindicator 2011
#> 3 0.93576805 some text 20022008
#> 4 0.01934736 another indicator 2003
Created on 2019-09-14 by the reprex package (v0.3.0)

Replace String B with String C if it contains (but not exactly matches) String A

I have a data frame match_df which shows "matching rules": the column old should be replaced with the colum new in the dataframes it is applied on.
old <- c("10000","20000","300ZZ","40000")
new <- c("Name1","Name2","Name3","Name4")
match_df <- data.frame(old,new)
old new
1 10000 Name1
2 20000 Name2
3 300ZZ Name3 # watch the letters
4 40000 Name4
I want to apply the matching rules above on a data frame working_df
id <- c(1,2,3,4)
value <- c("xyz-10000","20000","300ZZ-230002112","40")
working_df <- data.frame(id,value)
id value
1 1 xyz-10000
2 2 20000
3 3 300ZZ-230002112
4 4 40
My desired result is
# result
id value
1 1 Name1
2 2 Name2
3 3 Name3
4 4 40
This means that I am not looking for an exact match. I'd rather like to replace the whole string working_df$value as soon as it includes any part of the string in match_df$old.
I like the solution posted in R: replace characters using gsub, how to create a function?, but it works only for exact matches. I experimented with gsub, str_replace_all from stringr but I couldn't find a solution that works for me. There are many solutions for exact matches on SOF, but I couldn't find a comprehensible one for this problem.
Any help is highly appreciated.

I'm not sure this is the most elegant/efficient way of doing it but you could try something like this:
working_df$value <- sapply(working_df$value,function(y){
idx<-which(sapply(match_df$old,function(x){grepl(x,y)}))[1]
if(is.na(idx)) idx<-0
ifelse(idx>0,as.character(match_df$new[idx]),as.character(y))
})
It uses grepl to find, for each value of working_df, if there is a row of match_df that is partially matching and get the index of that row. If there is more than one, it takes the first one.

You need the grep function. This will return the indices of a vector that match a pattern (any pattern, not necessarily a full string match). For instance, this will tell you which of your "old" values match the "10000" pattern:
grep(match_df[1,1], working_df$value)
Once you have that information, you can look up the corresponding "new" value for that pattern, and replace it on the matching rows.

Here are 2 approaches using Map + <<- and a for loop:
working_df[["value2"]] <- as.character(working_df[["value"]])
Map(function(x, y){working_df[["value2"]][grepl(x, working_df[["value2"]])] <<- y}, old, new)
working_df
## id value value2
## 1 1 xyz-10000 Name1
## 2 2 20000 Name2
## 3 3 300ZZ-230002112 Name3
## 4 4 40 40
## or...
working_df[["value2"]] <- as.character(working_df[["value"]])
for (i in seq_along(working_df[["value2"]])) {
working_df[["value2"]][grepl(old[i], working_df[["value2"]])] <- new[i]
}

Use regexp_instr to get the last number in a string

If I used the following expression, the result should be 1.
regexp_instr('500 Oracle Parkway, Redwood Shores, CA','[[:digit:]]')
Is there a way to make this look for the last number in the string? If I were to look for the last number in the above example, it should return 3.

If you were using 11g, you could use regexp_count to determine the number of times that a pattern exists in the string and feed that into the regexp_instr
regexp_instr( str,
'[[:digit:]]',
1,
regexp_count( str, '[[:digit:]]')
)
Since you're on 10g, however, the simplest option is probably to reverse the string and subtract the position that is found from the length of the string
length(str) - regexp_instr(reverse(str),'[[:digit:]]') + 1
Both approaches should work in 11g
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select '500 Oracle Parkway, Redwood Shores, CA' str
3 from dual
4 )
5 select length(str) - regexp_instr(reverse(str),'[[:digit:]]') + 1,
6 regexp_instr( str,
7 '[[:digit:]]',
8 1,
9 regexp_count( str, '[[:digit:]]')
10 )
11* from x
SQL> /
LENGTH(STR)-REGEXP_INSTR(REVERSE(STR),'[[:DIGIT:]]')+1
------------------------------------------------------
REGEXP_INSTR(STR,'[[:DIGIT:]]',1,REGEXP_COUNT(STR,'[[:DIGIT:]]'))
-----------------------------------------------------------------
3
3

Another solution with less effort is
SELECT regexp_instr('500 Oracle Parkway, Redwood Shores, CA','[^[:digit:]]*$')-1
FROM dual;
this can be read as.. find the non-digits at the end of the string. and subtract 1. which will give the position of the last digit in the string..
REGEXP_INSTR('500ORACLEPARKWAY,REDWOODSHORES,CA','[^[:DIGIT:]]*$')-1
--------------------------------------------------------------------
3
which i think is what you want.
(tested on 11g)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Convert letter to numeric number formatting in Excel using RegExp - regex

Related

Can I use regular expressions, the Like operator, and/or Instr() find the index of a pattern within a larger string?

Regex to find 9 to 11 digit integer occuring anywhere closest to a keyword

Separating column using separate (tidyr) via dplyr on a first encountered digit

Replace String B with String C if it contains (but not exactly matches) String A

Use regexp_instr to get the last number in a string

Categories

Resources