How to get a match when only one letter difference is allowed? - regex

I want to look whether words in my dataset appear in a certain text. When using grepl you only get exact matches. With agrepl it is possible tot do partial matching. However, I don't get the desired results with it.
Example data:
dt <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("weg", "verte", "spiegelend", "spiegeld", "einde", "spiegel", "spiegelende", "weg", "spiegelend", "asfalt", "fata", "morgana")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))
With:
dt <- dt[, .(id, words,
match1=mapply(grepl, words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt"),
match2=mapply(agrepl, words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L)))]
I get:
> dt
id words match1 match2
1: 0 weg TRUE TRUE
2: 0 verte TRUE TRUE
3: 0 spiegelend TRUE TRUE
4: 0 spiegeld FALSE TRUE
5: 0 einde FALSE FALSE
6: 0 spiegel TRUE TRUE
7: 0 spiegelende FALSE TRUE
8: 1 weg TRUE TRUE
9: 1 spiegelend TRUE TRUE
10: 1 asfalt FALSE FALSE
11: 1 fata FALSE FALSE
12: 1 morgana FALSE FALSE
As you can see, the results from grepl and agrepl differ on rows 4 and 7. However, I only want a match when there is at maximum one letter difference. The match in row 4 for match2 should therefore be FALSE. Changing parameters like max.distance or costs doesn't lead to the desired result either. Moreover, both matches on row 6 should be FALSE as well.
For example: for the word "spiegelend" from the text, the word "spiegelende" should give a match (only one letter difference), but the word "spiegeld" (two letters difference) and the word "spiegel" (three letters difference) should not give a match.
The conditions are allowed (but not at the same time):
one letter more (e.g.: "spiegelende" should give a match), or
one letter less (e.g.: "spiegelen" should give a match), or
one spelling error (e.g.: "spiehelend" should give a match)
Any ideas on how to solve this problem?

two ways to solve it, matching the approaches by nongkrong and RHertel:
dt <- cbind(dt[,c("id", "words")],
match1=mapply(grepl, dt$words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt"),
match2=mapply(agrepl, dt$words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L)),
match3=mapply(agrepl, paste0("\\b",dt$words,"\\b"),
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L, fixed=F)),
match4=apply(adist( dt$words, unlist(strsplit("hoe komt het dat de weg in de verte soms spiegelend lijkt", split=" "))),
1, function (x) any(x<=1))
)
match3 uses the word boundary \\b, while match4 uses an edit distance (adist) of <=1 to single words in a vector

I thought about using adist() in this case with the condition < 2. But I'm not sure if it yields the expected output. Does this help?
idx <- which(adist(dt$words,dt2$words) < 2, arr.ind = T)
dt$match <- (dt$words %in% dt2$words[idx[,2]])
#> dt
# id words match
#1 0 weg TRUE
#2 0 verte TRUE
#3 0 spiegelend TRUE
#4 0 spiegeld FALSE
#5 0 einde FALSE
#6 0 spiegel FALSE
#7 0 spiegelende FALSE
#8 1 weg TRUE
#9 1 spiegelend TRUE
#10 1 asfalt FALSE
#11 1 fata FALSE
#12 1 morgana FALSE
data
dt <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("weg", "verte", "spiegelend", "spiegeld", "einde", "spiegel", "spiegelende", "weg", "spiegelend", "asfalt", "fata", "morgana")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))
dt2 <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("hoe", "komt", "het", "dat", "de", "weg", "in", "de", "verte", "soms", "spiegelend", "lijkt")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))

Related

Select value in column that matches value in list (UPDATED FOR CLARITY)

If I have a column of street addresses and want to select only the address's directional, what syntax would I use to accomplish that in Excel Power Query?
For instance, how do I get "NE" from "357 Pyrite Dr NE" even if the address is incorrectly formatted as "357 NE Pyrite Dr" or "357 Pyrite NE Dr"? Likewise, how would I get "NW" from "506 Mark NW St"?
As far as I can figure out, I would hit add column > custom column and enter a syntax similar to the following...
= if List.ContainsAny([Address], {"NE", "NW", "SE", "SW"}) = TRUE then Text.Select([Address], {"NE", "NW", "SE", "SW"} else null
...except I know that's not the correct syntax since it always produces an error. The same thing happens when I replace "Text.Select" with "List.Select" in the above formula.
For greater clarification, I'm posting the query as it stands now, whittled down to one column from a table with 100 columns and 4000 rows:
let
Source = q_NMAACC,
#"Removed Other Columns" = Table.SelectColumns(Source,{"Address - Street 1", "Address - Street 2"}),
#"Merged Columns" = Table.CombineColumns(#"Removed Other Columns",{"Address - Street 1", "Address - Street 2"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"Street Address"),
#"Trimmed Text" = Table.TransformColumns(#"Merged Columns",{{"Street Address", Text.Trim, type text}}),
#"Filtered Rows" = Table.SelectRows(#"Trimmed Text", each [Street Address] <> null and [Street Address] <> "")
in
#"Filtered Rows"
Here are the first 25 rows to give you some data to work off.
Street Address
PO Box 3416 Nr57 #165a
1016 Copper NE Ave Apt C
217 Garcia St NE
232 17th St SE
560 60th St NW
2935 Madeira Dr NE
9677 Eagle Ranch Rd NW Apt 415
5320 Roanoke Ave NW
17 Hwy 304
HCR 79 Box 46
6524 Camino Rojo
3518 Vail Ave SE
6412 Torreon Dr NE
6136 Flor de Rio Ct NW
1712 36th Street SE
734 Columbia Street
716 Morning Meadows Dr NE
6601 Tennyson St NE Apt 10207
Alamo - Rio Salado PO Box 804
206 Aragon Rd
6901 Verano Ct NW
6709 Siesta Pl NE
10 Meadow Hills Loop
98 Avenida Jardin
6903 Prairie Rd NE Apt 216
Try
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
List={"NE","NW","SW","SE"},
LocateTable = Table.FromList(List, null, {"Locate"}),
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
You could also use another table to contain the search criteria
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
the , Comparer.OrdinalIgnoreCase part is ignoring case for comparison, which you can remove if you want to match case

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

Find partial occurences in data frame based on a vector

I've got a dataframe a and a vector b (derived from another data frame). Now I want to find all occurences from vector b in a.
However, unfortunately vector b sometimes misses a leading character.
a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")
> a
# IDENTITY_CODE PIG_ID SEX_CODE
#1 01234567 129287133 Z
#2 65461341 120561144 Z
#3 NH1497 119265685 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
#6 TP5146 109371743 B
b <- c("65461341", "1234567", "ZH0080", "TP5146")
My expected output would be this:
a
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#1 01234567 129287133 Z
#2 65461341 120561144 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
When first removing the duplicates it solves one problem, however I still need a way to select all rows containing the values from vector b whereas I need more rows:
a <- a[!duplicated(a$GSN_IDENTITY_CODE),]
Unfortunately I cannot use %in% because it will bring in duplicates and miss out on the first line because it does not accept regex':
> a[a$GSN_IDENTITY_CODE %in% b,]
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#2 65461341 120561144 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
#6 TP5146 109371743 B
Using data.table's %like% would work only for the first string in vector b
library(data.table)
> setDT(a)
> a[a$GSN_IDENTITY_CODE %like% b,]
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#1: 65461341 120561144 Z
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
Is there a function in R that supports my needs here?
#Frank's attempt yields the following error:
a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")
b <- c("65461341", "1234567", "ZH0080", "TP5146")
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in `[.data.frame`(a, .(b), on = "GSN_IDENTITY_CODE", nomatch = FALSE, :
unused arguments (on = "GSN_IDENTITY_CODE", nomatch = FALSE, mult = "first")
> setDT(a)
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
x.'GSN_IDENTITY_CODE' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.
You can do something like this for close matches if the extra character might occur anywhere in the string:
library(stringdist)
library(purrr)
a$closest_match <- map(a$GSN_IDENTITY_CODE, ~stringdist(., b, method = "lv")) %>%
map_dbl(min)
a[a$closest_match < 2, ]
If the extra character is always at the beginning, I would do something like this:
library(stringr)
a$stripped_code <- str_replace(a$GSN_IDENTITY_CODE,"^\\d", "")
a$keep <- a$GSN_IDENTITY_CODE %in% b | a$stripped_code %in% b
a[a$keep, ]

R regular expression to split string column into multiple columns

I have a column as follows in a dataframe called PeakBoundaries:
chrom
chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509
I would like to separate out the columns so that the columns look like below in a dataframe:
chr chrStart chrEnd
chr11 69464719 69502928
chr7 55075808 55093954
chr8 128739772 128762863
chr3 169389459 169490555
etc.
I have tried a regular expression approach but am not getting anywhere in terms of getting the match to enter into a new column:
PeakBoundaries$chrOnly <- PeakBoundaries[grep("\\w+?=\\:"),PeakBoundaries$chrom]
I am met with the error:
Error in [.data.frame(PeakBoundaries, grep("\w+?=\:"), PeakBoundaries$chrom) :
undefined columns selected
Try this - no regex needed, just the strsplit function:
dat <- read.table(text="chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509", stringsAsFactors=FALSE)
dat[,2:4] <- matrix(unlist(strsplit(dat[,1],split = "\\:|\\-")), ncol=3, byrow=TRUE)
colnames(dat) <- c("chrom", "chr", "chrStart", "chrEnd")
# Convert last two columns from character to numeric:
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
Results
> res
chrom chr chrStart chrEnd
1 chr11:69464719-69502928 chr11 69464719 69502928
2 chr7:55075808-55093954 chr7 55075808 55093954
3 chr8:128739772-128762863 chr8 128739772 128762863
4 chr3:169389459-169490555 chr3 169389459 169490555
5 chr17:37848534-37877201 chr17 37848534 37877201
6 chr19:30306758-30316875 chr19 30306758 30316875
7 chr1:150496857-150678056 chr1 150496857 150678056
8 chr12:69183279-69260755 chr12 69183279 69260755
9 chr11:77610143-77641464 chr11 77610143 77641464
10 chr8:38191804-38260814 chr8 38191804 38260814
11 chr12:58135797-58156509 chr12 58135797 58156509
Edit
You could do everything using only your existing dataframe. Replace dat[,1] with PeakBoundaries$chrom and dat[,2:4] with PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] and you should have it!
Edit By OP
OK so I think there's something a bit odd with my dataset but I've sorted it with Dominic's help so that it it is now:
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries <- PeakBoundaries[-1,]
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries$PeakBoundaries <-
as.character(PeakBoundaries$PeakBoundaries)
PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] <-
matrix(unlist(strsplit(PeakBoundaries$PeakBoundaries,
split = "\\:|\\-")), ncol=3, byrow=TRUE)
A shorter version of Dominic's answer, making the insertion a one-liner:
dat <- data.frame(chrom = readLines(textConnection("chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509")) )
dat[, c('chr','chrStart','chrEnd')] <- t( sapply( dat$chrom, function(s) { str_split(s, '[:-]') [[1]] } ) )
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
We could try
library(tidyr)
extract(dat, chrom, into=c('chr', 'chrStart', 'chrEnd'),
'([^:]+):([^-]+)-(.*)', convert=TRUE)
# chr chrStart chrEnd
#1 chr11 69464719 69502928
#2 chr7 55075808 55093954
#3 chr8 128739772 128762863
#4 chr3 169389459 169490555
#5 chr17 37848534 37877201
#6 chr19 30306758 30316875
#7 chr1 150496857 150678056
#8 chr12 69183279 69260755
#9 chr11 77610143 77641464
#10 chr8 38191804 38260814
#11 chr12 58135797 58156509
Or a faster option using the devel version of data.table. We can install the v1.9.5 from here
library(data.table) # v1.9.5+
nm1 <- c('chr', 'chrStart', 'chrEnd')
res <- setDT(tstrsplit(dat$chrom, '[:-]', type.convert=TRUE))
setnames(res, nm1)
res
# chr chrStart chrEnd
# 1: chr11 69464719 69502928
# 2: chr7 55075808 55093954
# 3: chr8 128739772 128762863
# 4: chr3 169389459 169490555
# 5: chr17 37848534 37877201
# 6: chr19 30306758 30316875
# 7: chr1 150496857 150678056
# 8: chr12 69183279 69260755
# 9: chr11 77610143 77641464
#10: chr8 38191804 38260814
#11: chr12 58135797 58156509
Or
library(splitstackshape)
setnames(cSplit(dat, 'chrom', ':|-',fixed=FALSE,
type.convert=TRUE), nm1)[]
data
dat <- structure(list(chrom = structure(c(2L, 9L, 10L, 8L, 6L, 7L, 1L,
5L, 3L, 11L, 4L), .Label = c("chr1:150496857-150678056",
"chr11:69464719-69502928",
"chr11:77610143-77641464", "chr12:58135797-58156509",
"chr12:69183279-69260755",
"chr17:37848534-37877201", "chr19:30306758-30316875",
"chr3:169389459-169490555",
"chr7:55075808-55093954", "chr8:128739772-128762863",
"chr8:38191804-38260814"
), class = "factor")), .Names = "chrom", row.names = c(NA, -11L
), class = "data.frame")

Adding two decimal places

I have a column in a dataset as shown below
Col1
----------
249
250.8
251.3
250.33
648
1249Y4
X569X3
4459120
2502420
What I am trying to do is add two decimal places only to number that have only three digits , in other words, numbers that are in hundreds. For example, 249 should be converted to 249.00, 251.3 should be converted to 251.30 so on and not 4459120 or 2502420 or X569X3. The final output should look like this.
Col1
----------
249.00
250.80
251.30
250.33
648.00
1249Y4
X569X3
4459120
2502420
I have looked at many different functions so far none of those work because there are some strings in between the numbers, for example X569X3 and seven digit numbers 2502420
Actual dataset
structure(c(5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 57L, 58L, 59L, 84L, 86L, 87L, 88L, 99L, 100L, 101L,
102L, 103L, 104L, 105L, 106L, 107L, 108L, 110L, 5L, 12L, 14L,
16L, 20L, 24L, 36L, 40L, 44L, 48L, 52L, 56L, 83L, 85L, 75L, 112L,
66L, 68L, 96L, 93L, 77L, 80L, 81L, 70L, 95L, 78L, 109L, 94L,
63L, 67L, 98L, 73L, 79L, 76L, 90L, 111L, 69L, 97L, 64L, 92L,
89L, 82L, 62L, 74L, 60L, 65L, 71L, 91L, 61L, 72L, 4L, 1L, 2L,
3L, 113L), .Label = c("1234X1", "123871", "1249Y4", "146724",
"249", "249.01", "249.1", "249.11", "249.2", "249.21", "249.3",
"249.4", "249.41", "249.5", "249.51", "249.6", "249.61", "249.7",
"249.71", "249.8", "249.81", "249.9", "249.91", "250", "250.01",
"250.02", "250.03", "250.1", "250.11", "250.12", "250.13", "250.22",
"250.23", "250.32", "250.33", "250.4", "250.41", "250.42", "250.43",
"250.5", "250.51", "250.52", "250.53", "250.6", "250.61", "250.62",
"250.63", "250.7", "250.71", "250.72", "250.73", "250.8", "250.81",
"250.82", "250.83", "250.9", "250.91", "250.92", "250.93", "2502110",
"2502111", "2502112", "2502113", "2502114", "2502115", "2502210",
"2502310", "2502410", "2502420", "2502510", "2502610", "2502611",
"2502612", "2502613", "2502614", "2502615", "2506110", "2506120",
"2506130", "2506140", "2506150", "2506160", "251.3", "251.8",
"253.5", "258.1", "275.01", "277.39", "3640140", "3670110", "3670150",
"3748210", "3774410", "3774420", "4459120", "5379670", "5379671",
"6221340", "648", "648.01", "648.02", "648.03", "648.04", "648.8",
"648.81", "648.82", "648.83", "648.84", "7079180", "775.1", "7821120",
"7862120", "X569X3"), class = "factor")
Let's call your vector x:
numbers = !is.na(as.numeric(x))
x.num = x[numbers]
x[numbers] = ifelse(as.numeric(x.num) < 1000,
sprintf("%.2f", as.numeric(x.num)),
x.num)
x
# [1] "249.00" "250.80" "251.30" "250.33" "648.00"
# [6] "1249Y4" "X569X3" "4459120" "2502420"
Use formatC with a selection of only the values you wish to replace.
x <- c("249", "250.8", "251.3", "250.33", "648", "1249Y4", "X569X3", "4459120", "2502420")
sel <- which(as.numeric(x) < 1000)
replace(x, sel, formatC(as.numeric(x[sel]), digits=2, format="f"))
#[1] "249.00" "250.80" "251.30" "250.33" "648.00" "1249Y4" "X569X3"
#[8] "4459120" "2502420"
First, change your dataset to character:
x <- as.character(x)
Then perform the following:
ifelse(grepl("[[:alpha:]]", x) == FALSE & as.numeric(x) < 1000,
sprintf("%.2f", as.numeric(x)), x)
Or if your data is in Col1 in a dataframe:
df %>%
mutate(Col1 = ifelse(grepl("[[:alpha:]]", Col1) == FALSE & as.numeric(as.character(Col1)) < 1000,
sprintf("%.2f", as.numeric(as.character(Col1))), as.character(Col1)))