Extracting clock time from string - regex

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")

We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.

You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

Related

How to exclude a regex match when I have comma without space after

I have a regex working with 99% of my situations. But one is not working
My input is like that
MyString=[XXXXXX:XX XX XX XX, XXXXX:332.83, XXXXX:XXX-XX-XX XX:XX:XX, XXXX:0.0, XXXX:2, XXXX:0, XXXX:-256, XXXXX:5, XXXX:136935, XXXX:0, XXXX:XX XXX XXX, XXXX:0.5, XXXXX:true, XXXX:0.509375, XXX:0.0, [XXXX:2022-06-14 06:45:00], 2022-09-17,XXXXX:1]
This regex allows to match all key:value in between the first [ and last ]
(?<=(?<!: *)\[).*?(?=,)|(?<=, *(?=[^ \r\n]))(?:.*?(?=,)|[^,\r\n\[\]]+?(?=\])|[^,\r\n]+\](?= *\]))
It split all key:value when a comma is detected, but I have specific data where I have a comma without space after. For example, the last date of my example is split because of , I search to exclude this split match to match all 2022-09-17,XXXXX:1
I search for a regex that match only data that a separate by , and not ,
Here is the example with the split of the last data I search to prevent
https://regex101.com/r/8fd7Xv/1
You can add a space, or 1 or more spaces at the positions that you assert a comma.
(?<=(?<!: *)\[).*?(?=, )|(?<=, +(?=[^ \r\n]))(?:.*?(?=, )|[^,\r\n\[\]]+?(?=\])|[^,\r\n]+\](?= *\]))
See the updated pattern https://regex101.com/r/Dlx8Xi/1

Extracting plus (+), minus (-), and period (.) characters and all numbers [0-9] from a character string in R

I have many character strings that look like this, with a large variance of numbers:
tester1 <- "{\"fullgame\":\"-303\"}"
tester2 <- "{\"fullgame\":\"+7.5\"}"
I would like to extract plus(+), minus(-), and period(.) characters, along with all numbers[0-9] from my strings. I would also like to preserve the current order of each of these elements as they appear in the string.
I want the resulting strings to be:
formatted1 = "-303"
formatted2 = "+7.5"
I know that functions like gsub, strsplit, and regex would be ideal for this application, but for the life of me, I can't figure out the Perl syntax =(.
Any help would be greatly appreciated! Thank you guys!
Your strings look like json a lot. See the wiki page for more info. You shouldn't try to parse them with regex; rely instead on a specific json parser. There are many in R. Extracting the desired quantity is as easy as:
require(jsonlite) #other libraries: rjson, RJSONIO
fromJSON(tester1)
#$fullgame
#[1] "+7.5"
I don't know about "Perl" but this regex pattern pulls your numbers out:
> gsub( "([^-+]+)([+-]{0,1}[0-9.]+)(.+)", "\\2", c(tester1,tester2) )
[1] "-303" "+7.5"
Breaking apart the pattern which has three capture sections:
([^-+]+) : uses the negation operator in a character class to match any sequence that is
not a plus or minus sign
([+-]{0,1}[0-9.]+) : the second capture class allows (but does not require) a single
+/- sign, followed by any number of digits or decimal point/period
(.+) : is the third capture class ... anything else that is trailing
This one is a bit more specific about what form the numbers and decimal points can assume by adding optional single decimal point and subsequent digits:
gsub( "([^-+]+)([+-]{0,1}[0-9]+[.]{0,1}[0-9]*)(.+)", "\\2", c(tester1,tester2) )
I'm pretty sure that there are earlier postings that covered extraction of signed decimal numbers.
You can gsub out anything that doesn't match the type of characters you're looking for:
gsub('[^+-.0-9]', '', tester1)
# [1] "-303"
gsub('[^+-.0-9]', '', tester2)
# [1] "+7.5"
[^ ... ] defines a set of characters ... where the ^ tells it to match anything except for them. gsub then replaces all those characters with nothing, leaving you with what you want.

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

R regex: specifying output selections from wider string matches

One for the regex enthusiasts. I have a vector of strings in the format:
<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" STYLE="font-size: 10px" size="10" COLOR="#FF0000" LETTERSPACING="0" KERNING="0">Desired output string containing any symbols</FONT></P></TEXTFORMAT>
I'm aware of the perils of parsing this sort of stuff with regex. It would however be useful to know how to efficiently extract an output sub-string of a larger string match - i.e. the contents of angle quotes >...< of the font tag. The best I can do is:
require(stringr)
strng = str_extract(strng, "<FONT.*FONT>") # select font statement
strng = str_extract(strng, ">.*<") # select inside tags
strng = str_extract(strng, "[^/</>]+") # remove angle quote symbols
What would be the simplest formula to achieve this in R?
Use str_match, not str_extract (or maybe str_match_all). Wrap the part that you want to extract match in parentheses.
str_match(strng, "<FONT[^<>]*>([^<>]*)</FONT>")
Or parse the document and extract the contents that way.
library(XML)
doc <- htmlParse(strng)
fonts <- xpathSApply(doc, "//font")
sapply(fonts, function(x) as(xmlChildren(x)$text, "character"))
As agstudy mentioned, xpathSApply takes a function argument that makes things easier.
xpathSApply(doc, "//font", xmlValue)
You can also do it with gsub but I think there are too many permutations to your input vector that may cause this to break...
gsub( "^.*(?<=>)(.*)(?=</FONT>).*$" , "\\1" , x , perl = TRUE )
#[1] "Desired output string containing any symbols"
Explanation
^.* - match any characters from the start of the string
(?<=>) - positive lookbehind zero-width assertion where the subsequent match will only work if it is preceeded by this, i.e. a >
(.*) - then match any characters (this is now a numbered capture group)...
(?=</FONT>) - ...until you match "</FONT>"
.*$ - then match any characters to the end of the string
In the replacement we replace all matched stuff by numbered capture group \\1, and there is only one capture group which is everything between > and </FONT>.
Use at your peril.