Find repeated pattern in a string of characters using R

Find repeated pattern in a string of characters using R - regex

I have a large text that contains expressions such as: "aaaahahahahaha that was a good joke". after processing, I want the "aaaaahahahaha" to disappear, or at least, change it to simply "ha".
At the moment, I am using this:
gsub('(.+?)\\1', '', str)
This works when the string with the pattern is at the beginning of the sentence, but not where is located anywhere else. So:
str <- "aaaahahahahaha that was a good joke"
gsub('(.+?)\\1', '', str)
#[1] "ha that was a good joke"`
But
str <- "that was aaaahahahahaha a good joke"
gsub('(.+?)\\1', '', str)
#[1] "that was aaaahahahahaha a good joke"
This question might relate to this: find repeated pattern in python, but I can't find the equivalence in R.
I am assuming is very simple and perhaps I am missing something trivial, but since regular expressions are not my strength and I have already tried a bunch of things that have not worked, I was wondering if someone could help me. The question is: How to find, and substitute, repeated patterns in a string of characters in R?
Thanks in advance for your time.

\b(\S+?)\1\S*\b
Use this.See demo.
https://regex101.com/r/sJ9gM7/46
For r use \\b(\\S+?)\\1\\S*\\b with perl=TRUE option.

Related

Scala. Regexp can't remove symbol ^

I need split sentence to words removing redundant characters.
I prepared regexp for that:
val wordCharacters = """[^A-z'\d]""".r
right now I have rule which can be used to handle task in next way:
wordCharacters.split(words)
.filterNot(_.isEmpty)
where words any sentence I need to parse.
But issue is that in case I try to handle "car: carpet, as,,, java: javascript!!&#$%^&" I get one more word ^. Trying to change my regex and without ^ I'm getting much more issues for different cases...
Is any ideas how to solve it?
P.S.
If somebody want to play with it try link or code below please:
val wordCharacters = """[^A-z'\d]""".r
val stringToInt =
wordCharacters.split("car: carpet, as,,, java: javascript!!&#$%^&")
.filterNot(_.isEmpty)
.toList
println(stringToInt)
Expected result is:
List(car, carpet, as, java, javascript)

The part A-z is not exactly what you want. Probably you assume that lower a comes immediately after upper Z, but there are some other characters in between, and one of them is ^.
So, correcting the regex as
"""[^A-Za-z'\d]""".r
would fix the issue.
Have a look at the order of characters:
https://en.wikipedia.org/wiki/List_of_Unicode_characters

I'd be tempted to start with \W and expand from there.
"\\W+".r.split("car: carpet, as,,, java: javascript!!&#$%^&")
//res0: Array[String] = Array(car, carpet, as, java, javascript)

How can I identify sentences within a text?

I have text that looks like this:-
"I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "
Here, "ASP.NET" and "Node.js" are to be treated as words. Also, there is no space before "But I...", but it should be treated as a separate sentence.
The expected output is:
["I am an engineer"," I am skilled in ASP.NET","I also know Node.js","But I don't have much experience"]
Is there a way of doing this?

For your current input you may use the following approach with re.split() function and specific regex pattern:
import re
s = "I am an engineer. I am skilled in ASP.NET. I also know Node.js.But I don't have much experience. "
result = re.split(r'\.(?=\s?[A-Z][^.]*? )', s)
print(result)
The output:
['I am an engineer', ' I am skilled in ASP.NET', ' I also know Node.js', "But I don't have much experience. "]
(?=\s?[A-Z][^.]*? ) - lookahead positive assertion, ensures that sentence delimiter . is followed by word from next sentence

How to split array of strings from two sides?

I have an array of strings (n=1000) in this format:
strings<-c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz",
"GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz",
"GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
I'm wondering what may be a easy way to get this:
strings2<-c(2201_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL,
2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL,
2203_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL)
which means to trim off "GSM1234567" from the front and ".gz" from the end.

Just a gsub solution that matches strings that starts ^ with digits and alphabetical symbols, zero or more times *, until a _ is encountered and (more precisely "or") pieces or strings that have .gz at the end $.
gsub("^([[:alnum:]]*_)|(\\.gz)$", "", strings)
[1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
[2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
[3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"
Edit
I forget to escape the second point.

strings <- c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz", "GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz", "GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
strings2 <- lapply(strings, function (x) substr(x, 12, 58))

You can do this using sub:
sub('[^_]+_(.*)\\.gz', '\\1', strings)
# [1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
# [2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
# [3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"

Try:
gsub('^[^_]+_|\\.[^.]*$','',strings)

I strongly suggest doing this in two steps. The other solutions work but are completely unreadable: they don’t express the intent of your code. Here it is, clearly expressed:
trimmed_prefix = sub('^GSM\\d+_', '', strings)
strings2 = sub('\\.gz$', '', trimmed_prefix)
But admittedly this can be expressed in one step, and wouldn’t look too badly, as follows:
strings2 = sub('^GSM\\d+_(.*)\\.gz$', '\\1', strings)
In general, think carefully about the patterns you actually want to match: your question says to match the prefix “GSM1234567” but your example contradicts that. I’d generally choose a pattern that’s as specific as possible to avoid accidentally matching faulty input.

Need help removing HTML tags, certain punctuation, and ending periods

Suppose I have this test string:
test.string <- c("This is just a <test> string. I'm trying to see, if a FN will remove certain things like </HTML tags>, periods; but not the one in ASP.net, for example.")
I want to:
Remove anything contained within an html tag
Remove certain punctuation (,:;)
Period that end a sentence.
So the above should be:
c("This is just a string I'm trying to see if a FN will remove certain things like periods but not the one in ASP.net for example")
For #1, I've tried the following:
gsub("<.*?>", "", x, perl = FALSE)
And that seems to work OK.
For #2, I think it's simply:
gsub("[:#$%&*:,;^():]", "", x, perl = FALSE)
Which works.
For #3, I tried:
gsub("+[:alpha:]?[.]+[:space:]", "", test.string, perl = FALSE)
But that didn't work...
Any ideas on where I went wrong? I totally suck at RegExp, so any help would be much appreciated!!

Based on your provided input and rules for what you want removed, the following should work.
gsub('\\s*<.*?>|[:;,]|(?<=[a-zA-Z])\\.(?=\\s|$)', '', test.string, perl=T)
See Working Demo

Try this:
test.string <- "There is a natural aristocracy among men. The grounds of this are virtue and talents. "
gsub("\\.\\s*", "", gsub("([a-zA-Z0-9]). ([A-Z])", "\\1 \\2", test.string))
# "There is a natural aristocracy among men The grounds of this are virtue and talents

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?

I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1

^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.

While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find repeated pattern in a string of characters using R - regex

\b(\S+?)\1\S\b Use this.See demo. https://regex101.com/r/sJ9gM7/46 For r use \\b(\\S+?)\\1\\S\\b with perl=TRUE option.

Related

Scala. Regexp can't remove symbol ^

How can I identify sentences within a text?

How to split array of strings from two sides?

Need help removing HTML tags, certain punctuation, and ending periods

Regular expression any character with dynamic size

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find repeated pattern in a string of characters using R - regex

\b(\S+?)\1\S*\b Use this.See demo. https://regex101.com/r/sJ9gM7/46 For r use \\b(\\S+?)\\1\\S*\\b with perl=TRUE option.

Related

Scala. Regexp can't remove symbol ^

How can I identify sentences within a text?

How to split array of strings from two sides?

Need help removing HTML tags, certain punctuation, and ending periods

Regular expression any character with dynamic size

Categories

Resources

\b(\S+?)\1\S\b Use this.See demo. https://regex101.com/r/sJ9gM7/46 For r use \\b(\\S+?)\\1\\S\\b with perl=TRUE option.