Conditionally Remove Character of a Vector Element in R - regex

I have (sometimes incomplete) data on addresses that looks like this:
data <- c("1600 Pennsylvania Avenue, Washington DC",
",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
I need to remove the first and/or last character if either one of them are a comma.
So far, I have:
for(i in 1:length(data)){
lastchar <- nchar(data[i])
sec2last <- nchar(data[i]) - 1
if(regexpr(",",data[i])[1] == 1){
data[i] <- substr(data[i],2, lastchar)
}
if(regexpr(",",data[i])[1] == nchar(data[i])){
data[i] <- substr(data[i],1, sec2last)
}
}
data
which works for the first character, but not the last character. How can I modify the second if statement or otherwise accomplish my goal?

You could try the below code which remove the comma present at the start or at the end,
> data <- c("1600 Pennsylvania Avenue, Washington DC",
+ ",Siem Reap,FC,", "11 Wall Street, New York, NY", ",Addis Ababa,FC,")
> gsub("(?<=^),|,(?=$)", "", data, perl=TRUE)
[1] "1600 Pennsylvania Avenue, Washington DC"
[2] "Siem Reap,FC"
[3] "11 Wall Street, New York, NY"
[4] "Addis Ababa,FC"
Pattern explanation:
(?<=^), In regex (?<=) called positive look-behind. In our case it asserts What precedes the comma must be a line start ^. So it matches the starting comma.
| Logical OR operator usually used to combine(ie, ORing) two regexes.
,(?=$) Lookahead aseerts that what follows comma must be a line end $. So it matches the comma present at the line end.

Related

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Remove everything except period and numbers from string regex in R

I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.
If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863
Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)
Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))

How to Trim a Leading and Trailing char in regular expressions?

I have a requirement to trim a leading and trailing character of a fixed length column.
Ex: I have column IdNumber which is of fixed length say 11, with below values
X3343438594
7743438534X
I want to trim the leading and trailing X, and result should look like this.
3343438594
7743438534
Try this:
Search: ^X(?=\d{11}$)|(?<=^\d{11})X$
Replace: <blank>
Regex breakdown:
^X means "start of input then X"
(?=\d{11}$) means "followed by 11 digits then end"
| means "logical OR"
(?<=^\d{11}) means "preceded by start then 11 digits"
X$ means "X then end of input"
You want to delete all matches, so replace them with nothing.
var re = /(?=^X|X$)(([A-Z])(\d{10})(\s)(\d{10})([A-Z]))/;
var str = 'X3343438594 7743438534X';
var subst = '$3$4$5';
var result = str.replace(re, subst);
alert(result);
The regex first asserts that the string should have an X at the beginning or at the end, regardless of the length of your data (not necessarily 11 characters). If that's the case, it tests for a pattern that starts with one letter, followed by 10 digits (totalling 11 characters), then a space, then ten digits followed by one letter (another 11 characters).

How to find the longest string in a text using regex in R

Given a string x, i can count the number of words (length) in this string using gregexpr("[A-Za-z]\w+", x) .
> x<-"\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
> sapply(gregexpr("[A-Za-z]\\w+", x), function(x) sum(x > 0))
[1] 11
However, how can i retrieve the number of words in the longest attached string (with space and not \n), using regex under R environnent
in this example it would be "Masters Universitaires et Prives au Maroc" which length is 6 .
Thanks in Advance .
I would solve it with
x <- "\n\n\n\n\n\nMasters Publics\n\n\n\n\n\n\n\n\n\n\n\n\nMasters Universitaires et Prives au Maroc\n\n\n\n\n\n\n\n\\n\n\n\n\nMasters Par Ville\n\n\n\n\n\n\n\n\n\n\n\n\n"
max(nchar(gsub("[^ ]+", "", unlist(strsplit(trimws(x), "\n+"))))) + 1
Split a trimmed string into lines, unlist the result, remove all characters other than a space, get the longest item and add one. The [^ ]+ is a regex that matches one or more (due to the + quantifier) characters other than (as [^...] is a negated character class) a space.
See IDEONE demo.
Load the package
library(stringr)
Create a new dataset, extracting and splitting the phrases
data <- unlist(str_split(x, pattern="\n", n = Inf))
index <- lapply(data, nchar)
index <- index !=0
# extract the maximum length of the phrase
max(sapply(gregexpr("\\W+", data[index]), length) + 1)
[1] 6
# just checking
data[index]
[1] "Masters Publics"
[2] "Masters Universitaires et Prives au Maroc"
[3] "\\n"
[4] "Masters Par Ville"

How to erase all non-letter characters before first letter (R vector of character strings)

I have a vector of character strings:
cities <- c("London", "001 London", "Stockholm", "002 Stockholm")
I need to erase anything in each string that precedes first letter so that I would have:
cities <- c("London", "London", "Stockholm", "Stockholm")
I've tried e.g. this
cities <- sub("^.*?[a-zA-Z]", "", cities)
but that erases the first letter too, which I don't want to happen.
Use
cities <- c("London", "001 London", "Stockholm", "002 Stockholm")
gsub("^\\P{L}*", "", cities, perl=T)
See IDEONE demo
The ^\\P{L}* regex means:
^ - Assert the beginning of the string
\\P{L}* - 0 or more characters other than a letter.
This solution is preferable if you have city names starting with Unicode letters.
Use a negated character class to match all the non-alphabetic characters which exists at the start.
cities <- sub("^[^a-zA-Z]*", "", cities)
or
Use capturing group to capture the first letter character.
cities <- sub("^.*?([a-zA-Z])", "\\1", cities)
Delete number:
gsub('\\d+','',cities)
[1] "London" " London" "Stockholm" " Stockholm"