Split sentence by words with regex in R - regex

I'm using (or I'd like to use) R to extract some information. I have the following sentence and I'd like to split. In the end, I'd like to extract only the number 24.
Here's what I have:
doc <- "Hits 1 - 10 from 24"
And I want to extract the number "24". I know how to extract the number once I can reduce the sentence in "Hits 1 - 10 from" and "24". I tried using this:
n_docs <- unlist(str_split(key_n_docs, ".\\from"))[1]
But this leaves me with: "Hits 1 - 10"
Obviously the split works somehow, but I'm interested in the part after "from" not the one before. All the help is appreciated!

If you want to extract from a single character string:
strsplit(key_n_docs, "from")[[1]][2]
or the equivalent expression used by #BastiM (sorry I saw your answer after I submitted mine)
unlist(strsplit(key_n_docs, "from"))[2]
If you want to extract from a vector of character strings:
sapply(strsplit(key_n_docs, "from"),`[`, 2)

Usually the result of str_split would contain the number you're searching for at index 1, but since you wrap it with unlist it seems you have to increment the index by one. Using
unlist(strsplit("Hits 1 - 10 from 24", "from"))[2]
works like a charm for me.
demo # ideone

You can use str_extract from stringr:
library(stringr)
numbers <- str_extract(doc, "[0-9]+$")
This will give only the numbers in the end of the sentence.
numbers
"24"

You can use sub to extract the number:
sub(".*from *(\\d+).*", "\\1", doc)
# [1] "24"

Related

Julia - Extract number from string using regex

I have a list of strings each telling me after how many iterations an algorithm converged.
string_list = [
"Converged after 1 iteration",
"Converged after 20 iterations",
"Converged after 7 iterations"
]
How can I extract the number of iterations? The result woudl be [1, 20, 7]. I tried with regex. Apparently (?<=after )(.*)(?= iteration*) will give me anything in between after and iteration but then this doesn't work:
occursin(string_list[1], r"(?<=after )(.*)(?= iteration*)")
There's a great little Julia package that makes creating regexes easier called ReadableRegex, and as luck would have it the first example in the readme is an example of finding every integer in a string:
julia> using ReadableRegex
julia> reg = #compile look_for(
maybe(char_in("+-")) * one_or_more(DIGIT),
not_after = ".",
not_before = NON_SEPARATOR)
r"(?:(?<!\.)(?:(?:[+\-])?(?:\d)+))(?!\P{Z})"
That regex can now be broadcast over your list of strings:
julia> collect.(eachmatch.(reg, string_list))
3-element Vector{Vector{RegexMatch}}:
[RegexMatch("1")]
[RegexMatch("20")]
[RegexMatch("7")]
To extract information out of a regex, you want to use match and captures:
julia> convergeregex = r"Converged after (\d+) iteration"
r"Converged after (\d+) iteration"
julia> match(convergeregex, string_list[2]).captures[1]
"20"
julia> parse.(Int, [match(convergeregex, s).captures[1] for s in string_list])
3-element Vector{Int64}:
1
20
7
\d+ matches a series of digits (so, the number of iterations here), and the parantheses around it indicates that you want the part of the string matched by that to be placed in the results captures array.
You don't need the lookbehind and lookahead operators (?<=, ?=) here.

Go through set of numbers and getting all possible matches

I'm trying to go through a set of numbers like "123456789123456" and I want to be able to find every single combination of numbers I can, that is 8 long, and it's starting point increasing by 1 for every match.
I'll use [] as where the expression starts, and then counts from.
Example:
First match: [1]23456789123456 would find: 12345678
Second match: 1[2]3456789123456 would find: 23456789
Third match: 12[3]456789123456 would find: 34567891
and so on...
I'm fairly new to Regex so I don't have a ton of experience in it.
You don't really need regex for this. Just a simple loop should do:
Dim input As String = "123456789123456"
For i As Integer = 0 To input.Length - 8
Console.WriteLine(input.Substring(i, 8))
Next
12345678
23456789
34567891
45678912
56789123
67891234
78912345
89123456

I want to replace the second occurrence of the number in the string

I have a string say a url like below
"www.regexperl.com/1234/34/firstpage/home.php"
Now i need to replace the 34 number that is the second occurrence of a number in the string with 2.
The resultant string should be like
"www.regexperl.com/1234/2/firstpage/home.php"
The challenge I m facing is when i try to store the value 34 and replace it , It is replacing the 34 in the number 1234 and gives the result like below
"www.regexperl.com/122/34/firstpage/home.php"
Kindly let me know a proper regex to solve the problem.
Use \K.
^.*?\d+\b.*?\K\d+
Replace by your string.See demo.
https://regex101.com/r/lW2kK1/1
Well if the positions are constant then you can find and replace as follows.
Regex: (\.com\/\d+)(\/\d+)
Input string: www.regexperl.com/1234/34/firstpage/home.php
Replacement to do: Replace with \1/ followed by number of your choice. For example \1/2.
Output string: www.regexperl.com/1234/2/firstpage/home.php
Regex101 Demo

R - split string before two last digits in each column cell

I have a csv with usernames in a column, followed by each user's feedback rating, out of 100.
E.g. James89
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex. Is this possible? And/or is there a better way to do this?
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
The code above is a tiny excerpt of the data I'm dealing with. I hope to get this output:
ARCH 78
AUSFUNGUY 91
AddiesAndXans 96
AfroBro 79
I've tried this code (inspired from this answer:
df1$P2 <- gsub("(.*?)(..)", "\\1", df1$Product)
It seems to be working, but there's something wrong with the output:
ARCH78 AR
AUSFUNGUY91 AUUNY
AddiesAndXans96 AdesdXs
AfroBro79 AfBr9
As for the following:
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex.
You can achieve it with a mere
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
gsub("(\\d{2})$",",\\1",df1$Product)
## => [1] "ARCH,78" "AUSFUNGUY,91" "AddiesAndXans,96" "AfroBro,79"
See IDEONE demo
You can further adjust the replacement ",\\1" that features a backreference \1 to the last 2 digits.

Using gregexpr to get position in a string

What I want to do is to extract the position of a certain expression in a character string (length is 22588). I tried it in the following way:
This is the pattern I'm looking for:
\n,null,[null,null,12.27,800.54]\n,
\n,null,[null,null,12.58,670.84]\n,
\n,null,[null,null,13.45,750.25]\n,
And so on.
I try to give an example:
test = "some other stuff \n,null,[null,null,12.27,800.54]\n, other stuff a lot of characters \n,null,[null,null,12.58,670.84]\n, and again \n,null,[null,null,13.45,750.25]\n,"
Now I want to get the positions of the expressions. which have this pattern:
\n,null,[null,null,"decimal numbers""comma between decimal numbers""decimal numbers"]\n,
This is what I tried:
mypattern = "\\\\n,null,\\[\null,null,[:alnum:]\\]\\\\\n,"
gg = gregexpr(mypattern,datalines)
Unfortunately this does not work. In the middle I always have these coordinates. So I need a wildcard for them and I also gues R has a problem to read the metacharacter.
Thanks in advance!
You can try with this pattern:
"\\\n,null,\\[null,null,\\d+\\.\\d+\\,\\d+\\.\\d+\\]\\\n"
or this pattern if the numbers of digits before and after each "." stay the same:
"\\\n,null,\\[null,null,\\d{2}\\.\\d{2}\\,\\d{3}\\.\\d{2}\\]\\\n"
With your example:
gregexpr("\\\n,null,\\[null,null,\\d+\\.\\d+\\,\\d+\\.\\d+\\]\\\n",test)
gregexpr("\\\n,null,\\[null,null,\\d{2}\\.\\d{2}\\,\\d{3}\\.\\d{2}\\]\\\n",test)
#[[1]]
#[1] 18 84 129
#attr(,"match.length")
#[1] 32 32 32
#attr(,"useBytes")
#[1] TRUE