I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:
MEDIUM /REGULAR INSEAM
XX LARGE /SHORT INSEAM
SMALL /32" INSM
X LARGE /30" INSM
I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.
The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM].
The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?
Thanks in advance!
From your description, Wiktor's regex will fail on "XX LARGE/SHORT" due to the extra space. It is safer to capture everything before the forward slash as a group:
sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR" "XX LARGE /SHORT" "SMALL /32" "X LARGE /30"
It seems you can use
(\w+(?: \w+)?) */ *(\w+)
See the regex demo
Pattern details:
(\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
*/ * - a / enclosed with 0+ spaces
(\w+) - Group 2 capturing 1 or more word chars
R code with stringr:
> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
[,1] [,2] [,3]
[1,] "MEDIUM /REGULAR" "MEDIUM" "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"
[3,] "SMALL /32" "SMALL" "32"
[4,] "X LARGE /30" "X LARGE" "30"
Related
I have a data frame which contains a column full of text. I need to capture the number (can potentially be any number of digits from most likely 1 to 4 digits in length) that follows a certain phrase, namely 'Floor Area' or 'floor area'. My data will look something like the following:
"A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
"Newbuild flat. Floor Area: 30 sq.m"
"6 bed house with floor area 50 sqm, lot area 25 sqm"
If I try to extract just the number or if I look back from sqm I will sometimes get the lot area by mistake.If someone could help me with a lookahead regex or similar in stringr, I'd appreciate it. Regex is a weak point for me. Many thanks in advance.
A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub:
# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\\d+).*", "\\1", x))
# Extract the first number after a word:
as.integer(sub(".*?(\\d+)\\s*<WORD_OR_PATTERN_HERE>.*", "\\1", x))
NOTE: Replace \\d+ with \\d+(?:\\.\\d+)? to match int or float numbers (to keep consistency with the code above, remember change as.integer to as.numeric). \\s* matches 0 or more whitespace in the second sub.
For the current scenario, a possible solution will look like
v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\\bfloor area:?\\s*(\\d+).*", "\\1", v))
# [1] 50 30 50
See the regex demo.
You may also leverage a capturing mechanism with str_match from stringr and get the second column value ([,2]):
> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\\bfloor area:?\\s*(\\d+)")[,2])
[1] 50 30 50
See the regex demo.
The regex matches:
(?i) - in a case-insensitive way
\\bfloor area:? - a whole word (\b is a word boundary) floor area followed by an optional : (one or zero occurrence, ?)
\\s* - zero or more whitespace
(\\d+) - Group 1 (will be in [,2]) capturing one or more digits
See R demo online
The following regex may get you started:
[Ff]loor\s+[Aa]rea:?\s+(\d{1,4})
The DEMO.
use following regex with Case Insensitive matching:
floor\s*area:?\s*(\d{1,4})
You need lookbehind regex.
str_extract_all(x, "\\b[Ff]loor [Aa]rea:?\\s*\\K\\d+", perl=T)
or
str_extract_all(x, "(?i)\\bfloor area:?\\s*\\K\\d+", perl=T)
DEMO
Donno why the above code won't return anything. You may try sub also,
> sub(".*\\b[Ff]loor\\s+[Aa]rea:?\\s*(\\d+).*", "\\1", x)
[1] "50" "30" "50"
text<- "A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
unique(na.omit(as.numeric(unlist(strsplit(unlist(text), "[^0-9]+")))))
# [1] 3 50
Hope this helped.
Here's the thing:
test=" 2 15 3 23 12 0 0.18"
#I want to extract the 1st number separately
pattern="^ *(\\d+) +"
d=regmatches(test,gregexpr(pattern,test))
> d
[[1]]
[1] " 2 "
library(stringr)
f=str_extract(test,pattern)
> f
[1] " 2 "
They both bring whitespaces to the result despite usage of ()-brackets. Why? The brackets are for specifying which part of the matched pattern you want, am I wrong? I know I can trim them with trimws() or coerce them directly to numeric, but I wonder if I misunderstand some mechanics of patterns.
Using str_match (or str_match_all)
Since you want to extract a capture group, you can use str_match (or str_match_all). str_extract only extracts whole matches.
From R stringr help:
str_match Extract matched groups from a string.
and
str_extract to extract the complete match
R code:
library(stringr)
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
f=str_match(test,pattern)
f[[2]]
## [1] "2"
The f[[2]] will output the 2nd item that is the first capture group value.
Using regmatches
As it is mentioned in the comment above, it is also possible with regmatches and regexec:
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
res <- regmatches(test,regexec(pattern,test))
res[[1]][2] // The res list contains all matches and submatches
## [1] "2" // We get the item[2] from the first match to get "2"
See regexec help page that says:
regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" a vector giving the lengths of the matches (or -1 for no match).
OP task specific solution
Actually, since you only are interested in 1 integer number in the beginning of a string, you could achieve what you want with a mere gsub:
> gsub("^ *(\\d+) +.*", "\\1", test)
[1] "2"
This regex: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])* matches an expression using multiple groups. The point of the regex is that it captures patterns in pairs of two, where the first part of the regex has to be followed by the second part of the regex.
How can I extract each of these two groups?
library(stringr)
data <- c("A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7")
str_extract_all(data, "(.*?)(?:I[0-9]-)*I3(?:-I[0-9])*")
Gives me:
[[1]]
[1] "A-B-C-I1-I2-D-E-F-I1-I3" "-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7"
However, I would want something along the lines of:
[[1]]
[1] "A-B-C-I1-I2-D-E-F" [2] "I1-I3"
[[2]]
[1] "D-D-D-D" [2] "I1-I1-I2-I1-I1-I3-I3-I7"
The key here is that regex matches twice, each time containing 2 groups. I want every match to have a list of it's own, and that list to contain 2 elements, one for each group.
You need to wrap a capturing group around the second part of your expression and if you're using stringr for this task, I would use str_match_all instead to return the captured matches ...
library(stringr)
data <- c('A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7')
mat <- str_match_all(data, '-?(.*?)-((?:I[0-9]-)*I3(?:-I[0-9])*)')[[1]][,2:3]
colnames(mat) <- c('Group 1', 'Group 2')
# Group 1 Group 2
# [1,] "A-B-C-I1-I2-D-E-F" "I1-I3"
# [2,] "D-D-D-D" "I1-I1-I2-I1-I1-I3-I3-I7"
I want to have a regular expression that match anything that is not a correct mathematical number. the list below is a sample list as input for regex:
1
1.7654
-2.5
2-
2.
m
2..3
2....233..6
2.2.8
2--5
6-4-9
So the first three (in Bold) should not get selected and the rest should.
This is a close topic to another post but because of it's negative nature, it is different.
I'm using R but any regular expression will do I guess.
The following is the best shot in the mentioned post:
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
grep(pattern="(-?0[.]\\d+)|(-?[1-9]+\\d*([.]\\d+)?)|0$", x=a)
which outputs:
\[1\] 1 2 3 4 5 7 8 9 10 11
You can use following regex :
^(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*|[^\d]+$
See demo https://regex101.com/r/tZ3uH0/6
Note that your regex engine should support look-ahead with variable length.and you need to use multi-line flag and as mentioned in comment you can use perl=T to active look-ahead in R.
this regex is contains 2 part that have been concatenated with an OR.first part is :
(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*
which will match a combination of digits that followed by anything except dot or by 2 or more dot.which the whole of this is within a capture group that can be repeat and instead of this group you can have a digit which followed by dot 2 or more time (for matching some strings like 2.3.4.) .
and at the second part we have [^\d]+ which will match anything except digit.
Debuggex Demo
a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)]
With a suggested edit from #Frank.
Speed Test
a <- rep(a, 1e4)
all.equal(a[is.na(as.numeric(a))], a[grep("^-?\\d+(\\.?\\d+)?$|^\\d+\\.$", a, invert=T)])
[1] TRUE
library(microbenchmark)
microbenchmark(dosc = a[is.na(as.numeric(a))],
plafort = a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dosc 27.83477 28.32346 28.69970 28.51254 28.76202 31.24695 100
# plafort 31.92118 32.14915 32.62036 32.33349 32.71107 35.12258 100
I think this should do the job:
re <- "^-?[0-9]+$|^-?[0-9]+\\.[0-9]+$"
R> a[!grepl(re, a)]
#[1] "2-" "2." "m" "2..3" "2....233..6" "2.2.8" "2--5"
#[8] "6-4-9"
The solution here is good. You only have to add the negative case [-] and invert the selection!
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a[grep(pattern="(^[1-9]\\d*(\\.\\d+)?$)|(^[-][1-9]\\d*(\\.\\d+)?$)",invert=TRUE, x=a)]
[1] "2-" "2." "m" "2..3" "2....233..6"
[6] "2.2.8" "2--5" "6-4-9"
Try this:
a[!grepl("^\\-?\\d?\\.?\\d+$", a)]
I like the simplicity of as.numeric(). This would be my suggestion:
require(stringr)
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a
a1 <- ifelse(str_sub(a, -1) == ".", "string filler", a)
a1
outvect <- is.na(as.numeric(a1))
outvect
I am trying to write a program using the lynx command on this page "http://www.rottentomatoes.com/movie/box_office.php" and I can't seem to wrap my head around a certain problem.... getting the title by itself. My problem is a title can contain special characters, numbers, and all titles are variable in length. I want to write a regex that could parse the entire page and find lines like this....
(I added spaces between the title and the next number, which is how many weeks it has been out, to distinguish between title and weeks released)
1 -- 30% The Vow 1 $41.2M $41.2M $13.9k 2958
2 -- 53% Safe House 1 $40.2M $40.2M $12.9k 3119
3 -- 42% Journey 2: The Mysterious Island 1 $27.3M $27.3M $7.9k 3470
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655
5 1 86% Chronicle 2 $12.1M $40.0M $4.2k 2908
the regex I have started out with is:
/(\d+)\s(\d+|\-\-)\s(\d+\%)\s
If someone can help me figure out how to grab the title successfully that would be much appreciated! Thanks in advanced.
Capture all the things!!
^(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+(.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)$
Explained:
^ <- Start of the line
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\d+|\-\-)\s+ <- Numbers [or "--"] (captured) followed by as many spaces as you want
(\d+\%)\s+ <- Numbers [with '%'] (captured) followed by as many spaces as you want
(.*)\s+ <- Anything you can match [don't be greedy] (captured) followed by as many spaces as you want
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\d+) <- Numbers (captured)
$ <- End of the line
So to be serious this is what I've done, I cheated a bit and captured everything (as I think you'll do in the end) to get a lookahead for the title capture.
In a non-greedy regex (.*) [or (.*?) if you want to force the "ungreedyness"] will capture the least possible characters, and the end of the regex tries to capture everything else.
Your regex ends up capturing only the title (the only thing left).
What you can do is using an actual lookahead and make assertions.
Resources:
regular-expressions.info - Lookaround
regexr.com - This regex tested