remove comma from a digits portion string - regex

How can I (fastest preferable) remove commas from a digit part of a string without affecting the rest of the commas in the string. So in the example below I want to remove the comas from the number portions but the comma after dog should remain (yes I know the comma in 1023455 is wrong but just throwing a corner case out there).
What I have:
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
Desired outcome:
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
Stipulation: must be done in base no add on packages.
Thank you in advance.
EDIT:
Thank you Dason, Greg and Dirk. Both your responses worked very well. I was playing with something close to Dason's response but had the comma inside the parenthesis. Now looking at it that doesn't even make sense. I microbenchmarked both responses as I need speed here (text data):
Unit: microseconds
expr min lq median uq max
1 Dason_0to9 14.461 15.395 15.861 16.328 25.191
2 Dason_digit 21.926 23.791 24.258 24.725 65.777
3 Dirk 127.354 128.287 128.754 129.686 154.410
4 Greg_1 18.193 19.126 19.127 19.594 27.990
5 Greg_2 125.021 125.954 126.421 127.353 185.666
+1 to all of you.

You could replace anything with the pattern (comma followed by a number) with the number itself.
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
gsub(",([[:digit:]])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
#or
gsub(",([0-9])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"

Using Perl regexp, and focusing on "digit comma digit" we then replace with just the digits:
R> x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
R> gsub("(\\d),(\\d)", "\\1\\2", x, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
R>

Here are a couple of options:
> tmp <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
> gsub('([0-9]),([0-9])','\\1\\2', tmp )
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
> gsub('(?<=\\d),(?=\\d)','',tmp, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
>
They both match a digit followed by a comma followed by a digit. The [0-9] and \d (the extra \ escapes the second one so that it makes it through to the regular epression) both match a single digit.
The first epression captures the digit before the comma and the digit after the comma and uses them in the replacement string. Basically pulling them out and putting them back (but not putting the comma back).
The second version uses zero-length matches, the (?<=\\d) says that there needs to be a single digit before the comma in order for it to match, but the digit itself is not part of the match. The (?=\\d) says that there needs to be a digit after the comma in order for it to match, but it is not included in the match. So basically it matches a comma, but only if preceded and followed by a digit. Since only the comma is matched, the replacement string is empty meaning delete the comma.

Related

Extracting a number following specific text in R

I have a data frame which contains a column full of text. I need to capture the number (can potentially be any number of digits from most likely 1 to 4 digits in length) that follows a certain phrase, namely 'Floor Area' or 'floor area'. My data will look something like the following:
"A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
"Newbuild flat. Floor Area: 30 sq.m"
"6 bed house with floor area 50 sqm, lot area 25 sqm"
If I try to extract just the number or if I look back from sqm I will sometimes get the lot area by mistake.If someone could help me with a lookahead regex or similar in stringr, I'd appreciate it. Regex is a weak point for me. Many thanks in advance.
A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub:
# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\\d+).*", "\\1", x))
# Extract the first number after a word:
as.integer(sub(".*?(\\d+)\\s*<WORD_OR_PATTERN_HERE>.*", "\\1", x))
NOTE: Replace \\d+ with \\d+(?:\\.\\d+)? to match int or float numbers (to keep consistency with the code above, remember change as.integer to as.numeric). \\s* matches 0 or more whitespace in the second sub.
For the current scenario, a possible solution will look like
v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\\bfloor area:?\\s*(\\d+).*", "\\1", v))
# [1] 50 30 50
See the regex demo.
You may also leverage a capturing mechanism with str_match from stringr and get the second column value ([,2]):
> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\\bfloor area:?\\s*(\\d+)")[,2])
[1] 50 30 50
See the regex demo.
The regex matches:
(?i) - in a case-insensitive way
\\bfloor area:? - a whole word (\b is a word boundary) floor area followed by an optional : (one or zero occurrence, ?)
\\s* - zero or more whitespace
(\\d+) - Group 1 (will be in [,2]) capturing one or more digits
See R demo online
The following regex may get you started:
[Ff]loor\s+[Aa]rea:?\s+(\d{1,4})
The DEMO.
use following regex with Case Insensitive matching:
floor\s*area:?\s*(\d{1,4})
You need lookbehind regex.
str_extract_all(x, "\\b[Ff]loor [Aa]rea:?\\s*\\K\\d+", perl=T)
or
str_extract_all(x, "(?i)\\bfloor area:?\\s*\\K\\d+", perl=T)
DEMO
Donno why the above code won't return anything. You may try sub also,
> sub(".*\\b[Ff]loor\\s+[Aa]rea:?\\s*(\\d+).*", "\\1", x)
[1] "50" "30" "50"
text<- "A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
unique(na.omit(as.numeric(unlist(strsplit(unlist(text), "[^0-9]+")))))
# [1] 3 50
Hope this helped.

R: gsub inserting whitespaces between capture groups

I'm desperately trying to insert whitespaces between capture groups. My naive approach was
c = c("WesternSaharaRegion", "ColumbiaState", "OneTwoThreeFourFiveSix")
gsub("(.+[a-z])([A-Z].+)","\\1 \\2", clist, perl=T)
which is only inserting a whitespaces between the last two capital-letter-words. Using
gsub("(?=([a-z][A-Z]))"," ", c, perl = T)
works not quite exactly for it's a one-character-shifted version
"Wester nSahar aRegion" "Columbi aState" "On eTw oThre eFou rFiv eSix"
How am I able to elegantly receive
"Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
strsplit() unfortunately doesn't keep the capture group :/
We can either use regex lookarounds
gsub('(?<=[a-z])(?=[A-Z])', ' ', c, perl=TRUE)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
Or use capture groups
gsub('([a-z])([A-Z])', '\\1 \\2', c)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"

Get more than 1 quotations in text paragraph in R regex

First: Find the texts that are inside the quotations "I want everything inside here".
Second: To extract 1 sentence before quotation.
I would like to achieve this output desirable by look behind regex in R if possible
Example:
Yoyo. He is sad. Oh no! "Don't sad!" Yeah: "Testing... testings," Boys. Sun. Tree... 0.2% green,"LL" "WADD" HOLA.
Desired Output:
[1] Oh no! "Don't sad!"
[2] Yeah: "Testing... testings"
[3] Tree... 0.2% green, "LL"
[4] Tree... 0.2% green, "LL" "WADD"
dput:
"Yoyo. He is sad. Oh no! \"Don't sad!\" Yeah: \"Testing... testings,\" Boys. Sun. Tree... 0.2% green,\"LL\" \"WAAD\" HOLA."
Tried using this but can't work:
str_extract(t, "(?<=\\.\\s)[^.:]*[.:]\\s*\"[^\"]*\"")
Also tried:
regmatches(t , gregexpr('^[^\\.]+[\\.\\,\\:]\\s+(.*(?:\"[^\"]+\\")).*$', t))
regmatches(t , gregexpr('\"[^\"]*\"(?<=\\s[.?][^\\.\\s])', t))
Tried your method #naurel:
> regmatches(t, regexpr("(?:\"? *([^\"]*))(\"[^\"]*\")", t, perl=T))
[1] " Yoyo. He is sad. Oh no! \"Don't sad!\""
Since you just want the last sentence I've cleared the regex for you : result
Explanation :
First you're looking for something that is between quotes. And if there is multiples quotes successively you want them to match as one.
(\"[^\"]*\"(?: *\"[^\"]*\")*)
Does the trick. Then you want to match the sentence before this group. A sentence is starting with a CAPITAL letter. So we will start the match to the first capital encounter before the previously defined group (ie : not followed by any other CAPITAL letter)
([A-Z](?:[a-z0-9\W\s])*)
Put it togeither and you obtain :
([A-Z](?:[a-z0-9\W\s])*)(\"[^\"]*\"(?: *\"[^\"]*\")*)

Match everything but numbers regular expression

I want to have a regular expression that match anything that is not a correct mathematical number. the list below is a sample list as input for regex:
1
1.7654
-2.5
2-
2.
m
2..3
2....233..6
2.2.8
2--5
6-4-9
So the first three (in Bold) should not get selected and the rest should.
This is a close topic to another post but because of it's negative nature, it is different.
I'm using R but any regular expression will do I guess.
The following is the best shot in the mentioned post:
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
grep(pattern="(-?0[.]\\d+)|(-?[1-9]+\\d*([.]\\d+)?)|0$", x=a)
which outputs:
\[1\] 1 2 3 4 5 7 8 9 10 11
You can use following regex :
^(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*|[^\d]+$
See demo https://regex101.com/r/tZ3uH0/6
Note that your regex engine should support look-ahead with variable length.and you need to use multi-line flag and as mentioned in comment you can use perl=T to active look-ahead in R.
this regex is contains 2 part that have been concatenated with an OR.first part is :
(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*
which will match a combination of digits that followed by anything except dot or by 2 or more dot.which the whole of this is within a capture group that can be repeat and instead of this group you can have a digit which followed by dot 2 or more time (for matching some strings like 2.3.4.) .
and at the second part we have [^\d]+ which will match anything except digit.
Debuggex Demo
a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)]
With a suggested edit from #Frank.
Speed Test
a <- rep(a, 1e4)
all.equal(a[is.na(as.numeric(a))], a[grep("^-?\\d+(\\.?\\d+)?$|^\\d+\\.$", a, invert=T)])
[1] TRUE
library(microbenchmark)
microbenchmark(dosc = a[is.na(as.numeric(a))],
plafort = a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dosc 27.83477 28.32346 28.69970 28.51254 28.76202 31.24695 100
# plafort 31.92118 32.14915 32.62036 32.33349 32.71107 35.12258 100
I think this should do the job:
re <- "^-?[0-9]+$|^-?[0-9]+\\.[0-9]+$"
R> a[!grepl(re, a)]
#[1] "2-" "2." "m" "2..3" "2....233..6" "2.2.8" "2--5"
#[8] "6-4-9"
The solution here is good. You only have to add the negative case [-] and invert the selection!
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a[grep(pattern="(^[1-9]\\d*(\\.\\d+)?$)|(^[-][1-9]\\d*(\\.\\d+)?$)",invert=TRUE, x=a)]
[1] "2-" "2." "m" "2..3" "2....233..6"
[6] "2.2.8" "2--5" "6-4-9"
Try this:
a[!grepl("^\\-?\\d?\\.?\\d+$", a)]
I like the simplicity of as.numeric(). This would be my suggestion:
require(stringr)
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a
a1 <- ifelse(str_sub(a, -1) == ".", "string filler", a)
a1
outvect <- is.na(as.numeric(a1))
outvect

Regular expression that both includes and excludes certain strings in R

I am trying to use R to parse through a number of entries. I have two requirements for the the entries I want back. I want all the entries that contain the word apple but don't contain the word orange.
For example:
I like apples
I really like apples
I like apples and oranges
I want to get entries 1 and 2 back.
How could I go about using R to do this?
Thanks.
Could do
temp <- c("I like apples", "I really like apples", "I like apples and oranges")
temp[grepl("apple", temp) & !grepl("orange", temp)]
## [1] "I like apples" "I really like apples"
Using a regular expression, you could do the following.
x <- c('I like apples', 'I really like apples',
'I like apples and oranges', 'I like oranges and apples',
'I really like oranges and apples but oranges more')
x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)]
# [1] "I like apples" "I really like apples"
The regular expression looks ahead to see if there's no character except a line break and no substring orange and if so, then the dot . will match any character except a line break as it is wrapped in a group, and repeated (0 or more times). Next we look for apple and any character except a line break (0 or more times). Finally, the start and end of line anchors are in place to make sure the input is consumed.
UPDATE: You could use the following if performance is an issue.
x[grepl('^(?!.*orange).*$', x, perl=TRUE)]
This regex is a bit smaller and much faster than the other regex versions (see comparison below). I don't have the tools to compare to David's double grepl so if someone can compare the single grep below vs the double grepl we'll be able to know. The comparison must be done both for a success case and a failure case.
^(?!.*orange).*apple.*$
The negative lookahead ensures we don't have orange
We just match the string, so long as it contains apple. No need for a lookahead there.
Code Sample
grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE);
Speed Comparison
#hwnd has now removed that double lookahead version, but according to RegexBuddy the speed difference remains:
Against I like apples and oranges, the engine takes 22 steps to fail, vs. 143 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 22 steps for ^((?!.*orange).)*apple.*$ (equal there but wait for point 2).
Against I really like apples, the engine takes 64 steps to succeed, vs. 104 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 538 steps for ^((?!.*orange).)*apple.*$.
These numbers were provided by the RegexBuddy debugger.