To better clean my forum message corpus, I would like to remove the leading spaces before punctuation and add one after if needed, using two regular expressions. The latter was no problem ((?<=[.,!?()])(?! )) but I've some problem with the first at least.
I used this expression: \s([?.!,;:"](?:\s|$))
But it's by far not flexible enough:
It matches even if there's already a space(or more) before the punctuation character
It doesn't match if there's not a space after the punctuation character
It doesn't match any unlisted punctuation character (but I guess I can use [:punct:] for that, at the end of the day)
Finally, both matches the decimal points (while they should not)
How can I eventually rewrite the expression to meet my needs?
Example Strings and expected output
This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing)
This is the end, Hello world! # This is the end, Hello world! (ok!)
This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing)
This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)
Use \p{P} to match all the punctuations. Use \h* instead of \s* because \s would match newline characters also.
(?<!\d)\h*(\p{P}+)\h*(?!\d)
Replace the matched strings by \1<space>
DEMO
> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"
Here's an expression that detects the substrings that need to be replaced:
\s*\.\s*(?!\d)
You need to replace these by: . (a dot and a space)
Here's a demo link of how this works: http://regex101.com/r/zB2bY3/1
Explanation of the regex:
\s* - matches whitespace, any number of chars (0 - unbounded)
\. - matches a dot
\s* - same as above
(?!\d) - negative lookahead. It means that the string, in order to be matched, must not be followed by a digit (this handles your last test case).
Related
Example: "this is the #example te#s5t #abc17$ ++ 123"
Will match: "this is the te ++ 123"
You could use this regex to match the things you don't want, and then replace any matches with an empty string:
\s*#\S*
The regex matches an optional number of spaces, followed by an #, then some number of non-space characters. For your sample data, this will give this is the te ++ 123 as the output.
Demo on regex101
Note the reason to remove spaces at the beginning (if the # starts a word) is so that when a whole word is removed (e.g. #example in your sample data) you don't get left with two spaces next to each other in the output string.
So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers
I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)
A question came across talkstats.com today in which the poster wanted to remove the last period of a string using regex (not strsplit). I made an attempt to do this but was unsuccessful.
N <- c("59.22.07", "58.01.32", "57.26.49")
#my attempts:
gsub("(!?\\.)", "", N)
gsub("([\\.]?!)", "", N)
How could we remove the last period in the string to get:
[1] "59.2207" "58.0132" "57.2649"
Maybe this reads a little better:
gsub("(.*)\\.(.*)", "\\1\\2", N)
[1] "59.2207" "58.0132" "57.2649"
Because it is greedy, the first (.*) will match everything up to the last . and store it in \\1. The second (.*) will match everything after the last . and store it in \\2.
It is a general answer in the sense you can replace the \\. with any character of your choice to remove the last occurence of that character. It is only one replacement to do!
You can even do:
gsub("(.*)\\.", "\\1", N)
You need this regex: -
[.](?=[^.]*$)
And replace it with empty string.
So, it should be like: -
gsub("[.](?=[^.]*$)","",N,perl = TRUE)
Explanation: -
[.] // Match a dot
(?= // Followed by
[^.] // Any character that is not a dot.
* // with 0 or more repetition
$ // Till the end. So, there should not be any dot after the dot we match.
)
So, as soon as a dot(.) is matched in the look-ahead, the match is failed, because, there is a dot somewhere after the current dot, the pattern is matching.
I'm sure you know this by now since you use stringi in your packages, but you can simply do
N <- c("59.22.07", "58.01.32", "57.26.49")
stringi::stri_replace_last_fixed(N, ".", "")
# [1] "59.2207" "58.0132" "57.2649"
I'm pretty lazy with my regex, but this works:
gsub("(*)(.)([0-9]+$)","\\1\\3",N)
I tend to take the opposite approach from the standard. Instead of replacing the '.' with a zero-length string, I just parse the two pieces that are on either side.
Can you please provide me with a regular expression that would
Allow only alphanumeric
Have definitely only one hyphen in the entire string
Hyphen or spaces not allowed at the front and back of the string
no consecutive space or hyphens allowed.
hypen and one space can be present near each other
Valid - "123-Abc test1","test- m e","abc slkh-hsds"
Invalid - " abc ", " -hsdj sdsd hjds- "
Thanks for helping me out on the same. Your help is much appreciated
/^([a-zA-Z0-9] ?)+-( ?[a-zA-Z0-9])+$/
See demo here.
EDIT:
If there can't be a space on both sides of the hyphen, then there needs to be a little more:
/^([a-zA-Z0-9] ?)+-(((?<! -) )?[a-zA-Z0-9])+$/
^^^^^^^^ ^
Alternatively, if negative lookbehind assertions aren't supported (e.g. in JavaScript), then an equivalent regex:
/^([a-zA-Z0-9]( (?!- ))?)+-( ?[a-zA-Z0-9])+$/
^ ^^^^^^^ ^
Only alphanumeric (hyphen and space included, otherwise it'd make no sense):
^[\da-zA-Z -]+$
This is the main part that will match the string and makes sure that every character is in the given set. I.e. digits and ASCII letters as well as space and hyphen (the use of which will be restricted in the following parts).
Only one hyphen and none at the start or end of the string:
(?=^[^-]+-[^-]+$)
This is a lookahead assertion making sure that the string starts and ends with at least one non-hyphen character. A single hyphen is required in the middle.
No space at the start or end or the string:
(?=^[^ ].*[^ ]$)
Again a lookahead, similar to the one above. They could be combined into one, but it looks much messier and is harder to explain.
No consecutive spaces (consecutive hyphens are ruled out already by 2. above):
(?!.* )
Putting it all together:
(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$
Quick PowerShell test:
PS> $re='(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$'
PS> "123-Abc test1","test- m e","abc slkh-hsds"," abc ", " -hsdj sdsd hjds- " -match $re
123-Abc test1
test- m e
abc slkh-hsds
Use this regex:
^(.+-.+)[\da-zA-Z]+[\da-zA-Z ]*[\da-zA-Z]+$