If string contains REGEX match - regex

I have TEXT File and extracted every line of data from text file. The extracted data is stored to list of string then I iterate loop to List of string to manipulate and validate the data extracted. Now every line of string I extracted, I want to validate if that line of string is contain 1). I used RegEx for this but it gives me no luck. (Please see image below)
My Text File
Code
Dim strRegexPattern As String = "^\d{1,6}[)]\s$"
Dim myRegex As New Regex(strRegexPattern, RegexOptions.None)
Dim _strMatch As Match = myRegex.Match(line) '<-- i use for each line as string in listOfExtractedLines
If _strMatch.Success Then
MsgBox(_strMatch.Value)
End If
String extracted from text file(with formatting and spaces)
Title : 8015B DRO(C10-C28) - ORO (C18-C36)
Column01 Col2 Col3 Column04 Col5 Col06 Col(007)
--------------------------------------------------------------------------
Intxxxxx xxxxxxxxx
1) zzzzzzzzzzzzzzzzzz 4.464 168 212614 25.00 xyz 0.00
33) aaaaaaaaaaaaaaaaaaa 4.818 114 330529 25.00 xyz 0.00
51) bbbbbbbbbbbbbbbb 6.742 117 318044 25.00 xyz 0.00
64) cccccccccccccccccccccc 8.397 152 186712 25.00 xyz 0.00
21) Endosulfan Sulfa 12.51 13 918.2E6 840.8E6 106.315
22) Endrin Ketone 13.11 14 143.4E6 992.2E6 104.978

^.*?\s\d{1,6}[)]\s.*$
Try this to match the whole line.
Edit:
(?:^|\s+)\d{1,6}[)]\s.*$

Related

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

How to extract one number from a string

I am trying to extract the last number (the price) from these strings:
"1 601 15.01.2019 14.01.2022 21.224,00"
"1 601 01.01.2019 31.12.2021 38.354,00"
"1 601 01.01.2019 31.12.2021 1629,32"
My pattern:
.Pattern = "\s\d{1,3}\.\d{3}"
The expected result:
21.224,00
38.354,00
1629,32
You could use this pattern: \d{1,2}\.?\d{3,4},\d{2}/gm.
See this demo: https://regex101.com/r/F45CmK/1

Extract a number from a string of numbers and text

I have a data.frame in R with a column containing character string of the form {some letters}-{a number}{a letter}, e.g. x <- 'KFKGDLDSKFDSKJJFDI-4567W'. So I want for instance to get a column with the numbers eg '4567' for that particular example/row. Theres only one number but it can be of any reasonable length. How can I extract the number from each row in the data.frame?
Use regular expressions to extract substrings. Use as.numeric to convert the resulting character string to a number:
string = 'KFKGDLDSKFDSKJJFDI-4567W'
as.numeric(regmatches(string, regexpr('\\d+', string)))
# 4567
You can easily use this to create a new column in your data frame:
#data = data.frame(x = rep(string, 10))
transform(data, y = as.numeric(regmatches(x, regexpr('\\d+', x))))
# x y
# 1 KFKGDLDSKFDSKJJFDI-4567W 4567
# 2 KFKGDLDSKFDSKJJFDI-4567W 4567
# 3 KFKGDLDSKFDSKJJFDI-4567W 4567
# 4 KFKGDLDSKFDSKJJFDI-4567W 4567
…
Try this one:
gsub("[a-zA-Z]+-([0-9]+)[a-zA-Z]","\\1", "KFKGDLDSKFDSKJJFDI-4567W")

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924
m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1
1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.