Use regex to insert space between collapsed words - regex

I'm working on a choropleth in R and need to be able to match state names with match.map(). The dataset I'm using sticks multi-word names together, like NorthDakota and DistrictOfColumbia.
How can I use regular expressions to insert a space between lower-upper letter sequences? I've successfully added a space but haven't been able to preserve the letters that indicate where the space goes.
places = c("NorthDakota", "DistrictOfColumbia")
gsub("[[:lower:]][[:upper:]]", " ", places)
[1] "Nort akota" "Distric olumbia"

Use parentheses to capture the matched expressions, then \n (\\n in R) to retrieve them:
places = c("NorthDakota", "DistrictOfColumbia")
gsub("([[:lower:]])([[:upper:]])", "\\1 \\2", places)
## [1] "North Dakota" "District Of Columbia"

You want to use capturing groups to capture to matched context so you can refer back to each matched group in your replacement call. To access the groups, precede two backslashes \\ followed by the group #.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('([[:lower:]])([[:upper:]])', '\\1 \\2', places)
# [1] "North Dakota" "District Of Columbia"
Another way, switch on PCRE by using perl=T and use lookaround assertions.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('[a-z]\\K(?=[A-Z])', ' ', places, perl=T)
# [1] "North Dakota" "District Of Columbia"
Explanation:
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Basically ( throws away everything that it has matched up to that point. )
[a-z] # any character of: 'a' to 'z'
\K # '\K' (resets the starting point of the reported match)
(?= # look ahead to see if there is:
[A-Z] # any character of: 'A' to 'Z'
) # end of look-ahead

Related

Conditional replace depending on which character is found

This is NOT a duplicate of How to use conditionals when replacing in Notepad++ via regex as I am asking something very specific here which I cannot implement following the info in that question. So kindly allow this question.
I want to replace a range of characters with a corresponding range of characters. So far, I can only do it with multiple operations.
For example, match any word that starts with a capital Latin character in the range [ABEZHIKMNOPTYXZ] and is followed by a Greek lowercase letter [α-ωά-ώ] and replace the character in the first matched group with a similar-looking character but in the Greek range [ΑΒΕΖΗΙΚΜΝΟΡΤΥΧΖ] (note, they look the same but are different characters).
What I came up so far was multiple replacements, ie.
(A)([α-ωά-ώ])
Α\2
(B)([α-ωά-ώ])
Β\2
....
So that for example:
Aνθρώπινος would become Ανθρώπινος
Bάτος would become Βάτος
Preferably this should work in EmEditor, Notepad++ being the 2nd option.
Notepad++ supports conditional replacement, you can use it like:
Find what: (?:(A)|(B)|(E)|(Z)|(H)|(I)|(K)|(M)|(N)|(O)|(P)|(T)|(Y)|(X)|(Z))(?=[α-ωά-ώ])
Replace with: (?{1}Α:(?{2}Β:(?{3}Ε:(?{4}Ζ:)))) add the other Greek letters similarly
Replacement:
(?: # start non capture group
(?{1} # if group 1 exists "A"
Α # replace with greek letter
: # else
(?{2} # if group 2 exists "B"
Β # replace with greek letter
: # else
(?{3} # and so on ...
Ε
:
(?{4}
Ζ
:
)
)
)
)
) # end non capture group
(?= # positive lookahead, make sure we have after:
[α-ωά-ώ] # a small greek letter
) # end lookahead
I've made a test but for only for 2 letters "A" and "B" and replace them with more visual different letters "X" and "Y" just to show the way it works.
Screen capture (before):
Screen capture (after):

Filter out a expression from Regex match

I have a regex query which works fine for most of the input patterns but few.
Regex query I have is
("(?!([1-9]{1}[0-9]*)-(([1-9]{1}[0-9]*))-)^(([1-9]{1}[0-9]*)|(([1-9]{1}[0-9]*)( |-|( ?([1-9]{1}[0-9]*))|(-?([1-9]{1}[0-9]*)){1})*))$")
I want to filter out a certain type of expression from the input strings i.e except for the last character for the input string every dash (-) should be surrounded by the two separate integers i.e (integer)(dash)(integer).
Two dashes sharing 3 integers is not allowed even if they have integers on either side like (integer)(dash)(integer)(dash)(integer).
If the dash is the last character of input preceded by the integer that's an acceptable input like (integer)(dash)(end of the string).
Also, two consecutive dashes are not allowed. Any of the above-mentioned formats can have space(s) between them.
To give the gist, these dashes are used in my input string to provide a range.
Some example of expressions that I want to filter out are
1-5-10, 1 - 5 - 10, 1 - - 5, -5
Update - There are some rules which will drive the input string. My job is to make sure I allow only those input strings which follow the format. Rules for the format are -
1. Space (‘ ‘) delimited numbers. But dash line doesn’t need to have a space. For example, “10 20 - 30” or “10 20-30” are all valid values.
2. A dash line (‘-‘) is used to set range (from, to). It also can used to set to the end of job queue list. For example, “100-150 200-250 300-“ is a valid value.
3. A dash-line without start job number is not allowed. For example, “-10” is not allowed.
Thanks
You might use:
^(?:(?:[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]*|[1-9][0-9]*)(?: (?:[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]*|[1-9][0-9]*))*(?: [1-9][0-9]*-)?|[1-9][0-9]*-?)[ ]*$
Regex demo
Explanation
^ Assert start of the string
(?: Non capturing group
(?: Non capturing group
[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]* Match number > 0, an optional space, a dash, an optional space and number > 0. The space is in a character class [ ] for clarity.
| Or
[1-9][0-9]* Match number > 0
) Close non capturing group
(?:[ ] Non capturing group followed by a space
(?: Non capturing group
[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]* Match number > 0, an optional space, a dash, an optional space and number > 0.
| Or
[1-9][0-9]* Match number > 0
) close non capturing group
)* close non capturing group and repeat zero or more times
(?: [1-9][0-9]*-)? Optional part that matches a space followed by a number > 0
| Or
[1-9][0-9]*-? Match a number > 0 followed by an optional dash
) close non capturing group
[ ]*$ Match zero or more times a space and assert the end of the string
NoteIf you want to match zero or more times a space instead of an optional space, you could update [ ]? to [ ]*. You can write [1-9]{1} as [1-9]
After the update the question got quite a lot of complexity. Since some parts of the regex are reused multiple times I took the liberty of working this out in Ruby and cleaned it up afterwards. I'll show you the build process so the regex can be understood. Ruby uses #{variable} for regex and string interpolation.
integer = /[1-9][0-9]*/
integer_or_range = /#{integer}(?: *- *#{integer})?/
integers_or_ranges = /#{integer_or_range}(?: +#{integer_or_range})*/
ending = /#{integer} *-/
regex = /^(?:#{integers_or_ranges}(?: +#{ending})?|#{ending})$/
#=> /^(?:(?-mix:(?-mix:(?-mix:[1-9][0-9]*)(?: *- *(?-mix:[1-9][0-9]*))?)(?: +(?-mix:(?-mix:[1-9][0-9]*)(?: *- *(?-mix:[1-9][0-9]*))?))*)(?: +(?-mix:(?-mix:[1-9][0-9]*) *-))?|(?-mix:(?-mix:[1-9][0-9]*) *-))$/
Cleaning up the above regex leaves:
^(?:[1-9][0-9]*(?: *- *[1-9][0-9]*)?(?: +[1-9][0-9]*(?: *- *[1-9][0-9]*)?)*(?: +[1-9][0-9]* *-)?|[1-9][0-9]* *-)$
You can replace [0-9] with \d if you like, but since you used the [0-9] syntax in your question I used it for the answer as well. Keep in mind that if you do replace [0-9] with \d you'll have to escape the backslash in string context. eg. "[0-9]" equals "\\d"
You mention in your question that
Any of the above-mentioned formats can have space(s) between them.
I assumed that this means space(s) are not allowed before or after the actual content, only between the integers and -.
Valid:
15 - 10
1234 -
Invalid:
15 - 10
123
If this is not the case simply add * to the start and end.
^ *... *$
Where ... is the rest of the regex.
You can test the regex in my demo, but it should be clear from the build process what the regex does.
var
inputs = [
'1-5-10',
'1 - 5 - 10',
'1 - - 5',
'-5',
'15-10',
'15 - 10',
'15 - 10',
'1510',
'1510-',
'1510 -',
'1510 ',
' 1510',
' 15 - 10',
'10 20 - 30',
'10 20-30',
'100-150 200-250 300-',
'100-150 200-250 300- ',
'1-2526-27-28-',
'1-25 26-2728-',
'1-25 26-27 28-',
],
regex = /^(?:[1-9][0-9]*(?: *- *[1-9][0-9]*)?(?: +[1-9][0-9]*(?: *- *[1-9][0-9]*)?)*(?: +[1-9][0-9]* *-)?|[1-9][0-9]* *-)$/,
logInputAndMatch = input => {
console.log(`input: "${input}"`);
console.log(input.match(regex))
};
inputs.forEach(logInputAndMatch);

capture repetition of letters in a word with regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)
You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)
If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

Cleaning strings in R: add punctuation w/o overwriting last character

I'm new to R and unable to find other threads with a similar issue.
I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.
Sample code:
Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)
The contents of Data2:
[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?
Capturing group:
Use a capturing group around your character class and reference the group inside of your replacement.
gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
^ ^ ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Lookarounds:
You can switch on PCRE by using perl=T and use lookarounds to achieve this.
gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
The negated Unicode property \pP class matches any character except any kind of punctuation character.
Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.
There are several ways to do it:
Capture group:
gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)
Positive lookbehind (non-capturing group):
gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)
EDIT: fixed the backslashes and removed the uncertainty about R support for these.

Remove all text between two brackets

Suppose I have some text like this,
text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")
and I would like to remove (edit: get rid of) all of the text between the [ and ] (and the brackets themselves). What's the best way to do this? Here is my feeble attempt using regex and the stingr package:
str_extract(text, "\\[[a-z]*\\]")
Thanks for any help!
With this:
gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);
What the regex means:
\[ # '['
[^\]]* # any character except: '\]' (0 or more
# times (matching the most amount possible))
\] # ']'
The following should do the trick. The ? forces a lazy match, which matches as few . as possible before the subsequent ].
gsub('\\[.*?\\]', '', text)
Here'a another approach:
library(qdap)
bracketX(text, "square")
I think this technically answers what you've asked, but you probably want to add a \\: to the end of the regex for prettier text (removing the colon and space).
library(stringr)
str_replace_all(text, "\\[.+?\\]", "")
#> [1] ": We need tax policies that respect the wage earners..."
vs...
str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..."
Created on 2018-08-16 by the reprex package (v0.2.0).
No need to use a PCRE regex with a negated character class / bracket expression, a "classic" TRE regex will work, too:
subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some here and "
See the online R demo
Details:
\\[ - a literal [ (must be escaped or used inside a bracket expression like [[] to be parsed as a literal [)
[^][]* - a negated bracket expression that matches 0+ chars other than [ and ] (note that the ] at the start of the bracket expression is treated as a literal ])
] - a literal ] (this character is not special in both PCRE and TRE regexps and does not have to be escaped).
If you want to only replace the square brackets with some other delimiters, use a capturing group with a backreference in the replacement pattern:
gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"
See another demo
The (...) parenthetical construct forms a capturing group, and its contents can be accessed with a backreference \1 (as the group is the first one in the pattern, its ID is set to 1).