Add space between two letters in a string in R [duplicate] - regex

This question already has answers here:
Use regex to insert space between collapsed words
(2 answers)
Closed 5 years ago.
Suppose I have a string like
s = "PleaseAddSpacesBetweenTheseWords"
How do I use gsub in R add a space between the words so that I get
"Please Add Spaces Between These Words"
I should do something like
gsub("[a-z][A-Z]", ???, s)
What do I put for ???. Also, I find the regular expression documentation for R confusing so a reference or writeup on regular expressions in R would be much appreciated.

You just need to capture the matches then use the \1 syntax to refer to the captured matches. For example
s = "PleaseAddSpacesBetweenTheseWords"
gsub("([a-z])([A-Z])", "\\1 \\2", s)
# [1] "Please Add Spaces Between These Words"
Of course, this just puts a space between each lower-case/upper-case letter pairings. It doesn't know what a real "word" is.

Related

RegEx for Dutch ING bankstatement [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Is there anyone who can help me to get the marked pieces out of this file (see image below) with a regular expression? As you can see, it's difficult because the length is not always the same and the part before my goal is sometimes broken down and sometimes not.
Thank you in advance.
Text:
:61:200106D48,66NDDTEREF//00060100142533
/TRCD/01028/
:86:/EREF/SLDD-0705870-5658387529//MARF/11514814-001//CSID/NL59ZZZ390
373820000//CNTP/NL96ABNA0123456789/ABCANL2A/XXXXXXX123///REMI/UST
D//N00814760/
:61:200106D1840,55NDDTEREF//00060100142534
/TRCD/01028/
:86:/EREF/SLDD-0705869-5658387528//MARF/11514814-001//CSID/NL59ZZZ390
373820000//CNTP/NL96ABNA0123456789/ABCANL2A/XXX123XXXX///REMI/UST
D//N00814759/
:61:200106C236,31NTRFEREF//00060100142535
/TRCD/00100/
:86:/EREF/05881000010520//CNTP/NL19INGB0123456789/ABCBNL2A/XX123XXXX//
/REMI/USTD//KLM REF 1000000022/
The length is not always the same but it does not really matter in your case. You can check for a particular pattern at the end of a string.
(?<=\/\/)([\u2022a-zA-Z0-9]+)(?=\/$)
this regex will look for a string of caracter containing bullet (•), numbers, letters (uppercase and lowercase), that followes two front slash (//) and is followed by a slash (/) and the end of the string ( $ ).
You can test more cases here

Python regex to parse '#####' text in description field [duplicate]

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

Regex find sting in the middle of two strings [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I want to get the time in the following line. I want to get the string
2017-07-07 08:30:00.065156
in
[ID] = 0,[Time] = 2017-07-07 08:30:00.065156,[access]
I tried this
(?<=[Time] = )(.*?)(?=,)
Where i want to get the string in-between the time tag and the first comma but this doesn't work.
[Time] inside a regex means a T, an i, an m, or an e, unless you escape your square brackets.
You can drop the reluctant quantifier if you use [^,]* in place of .*:
(?<=\[Time\] = )([^,]*)(?=,)

Replacing surroundings of text in brackets when it occurs multiple times in a string [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I have a string containing LaTeX code, for example \emph{some words here} and I want to get Markdown syntax, for example,*some words here*. I tried:
s <- "some text in \\emph{italics} and some more ..."
pattern <- "\\\\emph\\{(.*)\\}"
gsub(pattern,"*\\1*", s)
> "some text in *italics* and some more ..."
However, I do not succeed at handling multiple occurences in one string.
s <- "some text in \\emph{italics} and some \\emph{more italics} and ..."
gsub(pattern,"*\\1*", s)
> "some text in *italics} and some \\emph{more italics* and ..."
I guess I need a non-greedy version which handles multiple occurrences, but I am not sure how to do it. Any ideas?
Use lazy ? quantifier like this.
Regex: \\\\emph{(.*?)}
Regex101 Demo