how to find if pattern is in string - regex

I have some strings and I'd like to convert each string in a number, so I'd like to use regular expression. My strings can be one of like:
["star"]
["near-star"]
["shared"]
["near-shared"]
["complete"]
["near-complete"]
["null"]
["near-null"]
my problem is that both these statements are true:
> grepl("star", "[\"near-star\"]")
[1] TRUE
> grepl("near-star", "[\"near-star\"]")
[1] TRUE
and this applies also to the other labels... any advice on how to write the right code to match each label is much appreciated.
best regards,
Simone

Trying to answer what I think might be your real problem (convert each string "to" a number)...
Given data:
> strings = c('["star"]', '["near-stat"]', '["shared"]', '["near-shared"]')
> data = sample(strings,20,TRUE)
such that:
> head(data)
[1] "[\"near-stat\"]" "[\"star\"]" "[\"near-shared\"]"
[4] "[\"near-shared\"]" "[\"shared\"]" "[\"star\"]"
Simply do:
> dataf=factor(data)
> as.numeric(dataf)
[1] 2 4 1 1 3 4 1 2 2 1 2 3 4 4 3 4 4 1 1 4
the mapping being given by:
> levels(dataf)
[1] "[\"near-shared\"]" "[\"near-stat\"]" "[\"shared\"]"
[4] "[\"star\"]"

Others have mentioned just using factors or the fixed argument (either of which will work fine for your stated question). But in general if you want to match a string or pattern, but only if it is not preceded by a given string then you can use negative look behind, an extension in Perl regular expressions:
> test <- c('star','near-star')
> grepl('(?<!near-)star', test, perl=TRUE )
[1] TRUE FALSE
The regular expression here say to match the string "star", but only if not preceded by the string "near-". The help page ?regexp has details (you need to scroll almost all the way to the bottom).

You can include the square brackets and quotes in your pattern. Furthermore, you can use fixed = TRUE for matching the string as is.
> grepl("[\"star\"]", "[\"near-star\"]", fixed = TRUE)
[1] FALSE
> grepl("[\"star\"]", "[\"star\"]", fixed = TRUE)
[1] TRUE

Related

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1
I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here
If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.

R regmatches() and stringr str_extract() dragging whitespaces along

Here's the thing:
test=" 2 15 3 23 12 0 0.18"
#I want to extract the 1st number separately
pattern="^ *(\\d+) +"
d=regmatches(test,gregexpr(pattern,test))
> d
[[1]]
[1] " 2 "
library(stringr)
f=str_extract(test,pattern)
> f
[1] " 2 "
They both bring whitespaces to the result despite usage of ()-brackets. Why? The brackets are for specifying which part of the matched pattern you want, am I wrong? I know I can trim them with trimws() or coerce them directly to numeric, but I wonder if I misunderstand some mechanics of patterns.
Using str_match (or str_match_all)
Since you want to extract a capture group, you can use str_match (or str_match_all). str_extract only extracts whole matches.
From R stringr help:
str_match Extract matched groups from a string.
and
str_extract to extract the complete match
R code:
library(stringr)
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
f=str_match(test,pattern)
f[[2]]
## [1] "2"
The f[[2]] will output the 2nd item that is the first capture group value.
Using regmatches
As it is mentioned in the comment above, it is also possible with regmatches and regexec:
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
res <- regmatches(test,regexec(pattern,test))
res[[1]][2] // The res list contains all matches and submatches
## [1] "2" // We get the item[2] from the first match to get "2"
See regexec help page that says:
regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" a vector giving the lengths of the matches (or -1 for no match).
OP task specific solution
Actually, since you only are interested in 1 integer number in the beginning of a string, you could achieve what you want with a mere gsub:
> gsub("^ *(\\d+) +.*", "\\1", test)
[1] "2"

Split sentence by words with regex in R

I'm using (or I'd like to use) R to extract some information. I have the following sentence and I'd like to split. In the end, I'd like to extract only the number 24.
Here's what I have:
doc <- "Hits 1 - 10 from 24"
And I want to extract the number "24". I know how to extract the number once I can reduce the sentence in "Hits 1 - 10 from" and "24". I tried using this:
n_docs <- unlist(str_split(key_n_docs, ".\\from"))[1]
But this leaves me with: "Hits 1 - 10"
Obviously the split works somehow, but I'm interested in the part after "from" not the one before. All the help is appreciated!
If you want to extract from a single character string:
strsplit(key_n_docs, "from")[[1]][2]
or the equivalent expression used by #BastiM (sorry I saw your answer after I submitted mine)
unlist(strsplit(key_n_docs, "from"))[2]
If you want to extract from a vector of character strings:
sapply(strsplit(key_n_docs, "from"),`[`, 2)
Usually the result of str_split would contain the number you're searching for at index 1, but since you wrap it with unlist it seems you have to increment the index by one. Using
unlist(strsplit("Hits 1 - 10 from 24", "from"))[2]
works like a charm for me.
demo # ideone
You can use str_extract from stringr:
library(stringr)
numbers <- str_extract(doc, "[0-9]+$")
This will give only the numbers in the end of the sentence.
numbers
"24"
You can use sub to extract the number:
sub(".*from *(\\d+).*", "\\1", doc)
# [1] "24"

regexpr and only matching prices and not other digits

I'm trying to come up with code that will extract only the price from a line of text.
Motivated by RegEx for Prices?, I came up with the following command:
gregexpr('\\d+(\\.\\d{1,2})', '23434 34.232 asdf 3.12 ')
[[1]]
[1] 7 19
attr(,"match.length")
[1] 5 4
attr(,"useBytes")
[1] TRUE
However, in my case, I would only like 3.12 to match and not 34.232. Any suggestions?
I think this should work:
'\\d+\\.\\d{1,2}(?!\\d)'
\\d+\\.\\d{1,2}(?!\\d)
I'm not 100% sure that negative lookahead is supported in r, so here is an alternative:
\\d+\\.\\d{1,2}(?:[^\\d]|$)
one or more digits followed by a point, followed by 1 or 2 digits, followed by white space or end of string
\\d+\\.\\d{1,2}(\w|$)
Edit: as per comments, R uses double-escape

price regex help

how to make regex below to detect also prices like just £7 not only everything > 9
/\d[\d\,\.]+/is
thanks
to match a single digit, you can change it to
/\d[\d,.]*/
the + means require one or more, so that's why the whole thing won't match just a 7. The * is 0 or more, so an extra digit or , or . becomes optional.
The longer answer might be more complicated. For example, in the book Regular Expression Cookbook, there is an excerpt: (remove the ^ and $ if you want it to match the 2 in apple $2 each) but note that when the number is 1000 or more, the , is needed. For example, the first regex won't match 1000.33
(unsourced image from a book removed)
Your expression would allow 123...3456... I think you might want something like (£|$|€)?\d\d+((,|.)\d{2})?
This will require the source have a currency symbol, and two digits for cents with a separator.
You might look at a regex more like the following.
/(?:\d+[,.]?\d*)|(?:[,.]\d+)/
Test Set:
5.00
$7.00
6123.58
$1
.75
Result Set:
[0] => 5.00
[1] => 7.00
[2] => 6123.58
[3] => 1
[4] => .75
EDIT: Additional Case added