Regex extract string based on String match - regex

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?

The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

Conditionally Remove Character of a Vector Element in R

I have (sometimes incomplete) data on addresses that looks like this:
data <- c("1600 Pennsylvania Avenue, Washington DC",
",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
I need to remove the first and/or last character if either one of them are a comma.
So far, I have:
for(i in 1:length(data)){
lastchar <- nchar(data[i])
sec2last <- nchar(data[i]) - 1
if(regexpr(",",data[i])[1] == 1){
data[i] <- substr(data[i],2, lastchar)
}
if(regexpr(",",data[i])[1] == nchar(data[i])){
data[i] <- substr(data[i],1, sec2last)
}
}
data
which works for the first character, but not the last character. How can I modify the second if statement or otherwise accomplish my goal?
You could try the below code which remove the comma present at the start or at the end,
> data <- c("1600 Pennsylvania Avenue, Washington DC",
+ ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
> gsub("(?<=^),|,(?=$)", "", data, perl=TRUE)
[1] "1600 Pennsylvania Avenue, Washington DC"
[2] "Siem Reap,FC"
[3] "11 Wall Street, New York, NY"
[4] "Addis Ababa,FC"
Pattern explanation:
(?<=^), In regex (?<=) called positive look-behind. In our case it asserts What precedes the comma must be a line start ^. So it matches the starting comma.
| Logical OR operator usually used to combine(ie, ORing) two regexes.
,(?=$) Lookahead aseerts that what follows comma must be a line end $. So it matches the comma present at the line end.

string padded with optional blank with max length

I have a problem building a regex. this is a sample of the text:
text 123 12345 abc 12 def 67 i 89 o 0 t 2
The numbers are sometimes padded with blanks to the max length (3).
e.g.:
"1" can be "1" or "1 "
"13" can be "13" or "13 "
My regex is at the moment this:
\b([\d](\s*)){1,3}\b
The results of this regex are the following: (. = blank for better visibility)
123.
12....
67.
89.
0....
2
But I need this: (. = blank for better visibility)
123
12.
67.
89.
0..
2
How can I tell the regex engine to count the blanks into the {1,3} option?
Try this:
\b(?:\d[\d\s]{0,2})(?:(?<=\s)|\b)
This will also cover strings like text 123 1 23 12345 123abc 12 def 67 i 89 o 0 t 2 and results in:
123
1.
23.
12.
67.
89.
0..
2
Does this do what you want?
\b(\d){1,3}\s*\b
This will also include whitespace (if available) after the selection.
I think you want this
\b(?:\d[\d\s]{0,2})(?!\d)
See it here on Regexr
the word boundary will not work at the end, because if the end of the match is a whitespace, there is no word boundary. Therefor I use a negative lookahead (?!\d) to ensure that there is no digit following.
But if you have a string like this "1 23". It will match only the "2" and the "23", but not the whitespace after the first "2".
Assuming you want to use the padded numbers somewhere else, break the problem apart into two; (simple) parsing the numbers, and (simple) formatting the numbers (including padding).
while ( $text =~ /\b(\d{1,3})\b/g ) {
printf( "%-3d\n", $1 );
}
Alternatively:
#padded_numbers = map { sprintf( "%-3d", $_ ) } ( $text =~ /\b(\d{1,3})\b/g )

Regex capitalize first letter every word, also after a special character like a dash

I use this to capitalize every first letter every word:
#(\s|^)([a-z0-9-_]+)#i
I want it also to capitalize the letter if it's after a special mark like a dash (-).
Now it shows:
This Is A Test For-stackoverflow
And I want this:
This Is A Test For-Stackoverflow
+1 for word boundaries, and here is a comparable Javascript solution. This accounts for possessives, as well:
var re = /(\b[a-z](?!\s))/g;
var s = "fort collins, croton-on-hudson, harper's ferry, coeur d'alene, o'fallon";
s = s.replace(re, function(x){return x.toUpperCase();});
console.log(s); // "Fort Collins, Croton-On-Hudson, Harper's Ferry, Coeur D'Alene, O'Fallon"
A simple solution is to use word boundaries:
#\b[a-z0-9-_]+#i
Alternatively, you can match for just a few characters:
#([\s\-_]|^)([a-z0-9-_]+)#i
If you want to use pure regular expressions you must use the \u.
To transform this string:
This Is A Test For-stackoverflow
into
This Is A Test For-Stackoverflow
You must put:
(.+)-(.+) to capture the values before and after the "-"
then to replace it you must put:
$1-\u$2
If it is in bash you must put:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
Actually dont need to match full string just match the first non-uppercase letter like this:
'~\b([a-z])~'
For JavaScript, here’s a solution that works across different languages and alphabets:
const originalString = "this is a test for-stackoverflow"
const processedString = originalString.replace(/(?:^|\s|[-"'([{])+\S/g, (c) => c.toUpperCase())
It matches any non-whitespace character \S that is preceded by a the start of the string ^, whitespace \s, or any of the characters -"'([{, and replaces it with its uppercase variant.
my solution using javascript
function capitalize(str) {
var reg = /\b([a-zÁ-ú]{3,})/g;
return string.replace(reg, (w) => w.charAt(0).toUpperCase() + w.slice(1));
}
with es6 + javascript
const capitalize = str =>
str.replace(/\b([a-zÁ-ú]{3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
/<expression-here>/g
[a-zÁ-ú] here I consider all the letters of the alphabet, including capital letters and with accentuation.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
[a-zÁ-ú]{3,} so I'm going to remove some letters that are not big enough
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
\b([a-zÁ-ú]{3,}) lastly i keep only words that complete which are selected. Have to use () to isolate the last expression to work.
ex: sábado de Janeiro às 19h. sexta-feira de janeiro às 21 e horas
after achieving this, I apply the changes only to the words that are in lower case
string.charAt(0).toUpperCase() + w.slice(1); // output -> Output
joining the two
str.replace(/\b(([a-zÁ-ú]){3,})/g, (w) => w.charAt(0).toUpperCase() + w.slice(1));
result:
Sábado de Janeiro às 19h. Sexta-Feira de Janeiro às 21 e Horas
Python solution:
>>> import re
>>> the_string = 'this is a test for stack-overflow'
>>> re.sub(r'(((?<=\s)|^|-)[a-z])', lambda x: x.group().upper(), the_string)
'This Is A Test For Stack-Overflow'
read about the "positive lookbehind"
While this answer for a pure Regular Expression solution is accurate:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\2/'
it should be noted when using any Case-Change Operators:
\l Change case of only the first character to the right lower case. (Note: lowercase 'L')
\L Change case of all text to the right to lowercase.
\u Change case of only the first character to the right to uppercase.
\U Change case of all text to the right to uppercase.
the end delimiter should be used:
\E
so the end result should be:
echo "This Is A Test For-stackoverflow" | sed 's/\(.\)-\(.\)/\1-\u\E\2/'
this will make
R.E.A.C De Boeremeakers
from
r.e.a.c de boeremeakers
(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])
using
Dim matches As MatchCollection = Regex.Matches(inputText, "(?<=\A|[ .])(?<up>[a-z])(?=[a-z. ])")
Dim outputText As New StringBuilder
If matches(0).Index > 0 Then outputText.Append(inputText.Substring(0, matches(0).Index))
index = matches(0).Index + matches(0).Length
For Each Match As Match In matches
Try
outputText.Append(UCase(Match.Value))
outputText.Append(inputText.Substring(Match.Index + 1, Match.NextMatch.Index - Match.Index - 1))
Catch ex As Exception
outputText.Append(inputText.Substring(Match.Index + 1, inputText.Length - Match.Index - 1))
End Try
Next