How to find and replace space between digits in a string column? - regex

I need to find and replace any space between digits in a long string using regular expression.
I have tried to use regular expression such as [0-9][\s][0-9] and then regexp_replace such as .withColumn('free_text', regexp_replace('free_text', '[0-9][\s][0-9]', '')).
However, the regex matches 1(space)4 where I would like to have only (space)
Here is an example:
What I have:
"Hello. I am Marie. My number is 768 990"
What I would like to have:
"Hello. I am Marie. My number is 768990"
Thanks,

Here is one way to do this, using capture groups:
.withColumn('free_text', regexp_replace('free_text', '([0-9])\s([0-9])', '$1$2'))
The idea here is to match and capture the two digits separated by a whitespace character in between them. Then, we can replace by just the two digits adjacent.

Your pattern matches a digit, whitespace character and a digit. Note that \s also matches a newline.
If supported, you could use lookarounds instead of matching the digits:
(?<=[0-9])\s(?=[0-9])
.withColumn('free_text', regexp_replace('free_text', '(?<=[0-9])\s(?=[0-9])', ''))

Related

Regex for replacing anything other than characters, more than one spaces and number only in end with empty char

I want to replace anything other than character, spaces and number only in end with empty string or in other words: we replace any number or spaces comes in-starting or in-middle of the string replace with empty string.
Example
**Input** **Output**
Ndd12 Ndd12
12Ndd12 Ndd12
Ndd 12 Ndd 12
Nav G45up Nav Gup
Attempted Code
regexp_replace(df1[col_name]), "(^[A-Za-z]+[0-9 ])", ""))
You may use:
\d+(?!\d*$)|[^\w\n]+(?!([A-Z]|$))
RegEx Demo
Explanation:
\d+(?!\d*$): Match 1+ digits that are not followed by 0+ digits and end of line
|: OR
[^\w\n]+(?!([A-Z]|$)): Match 1+ non-word characters that are not followed by an uppercase letter or and end of line
if you use python, you can use regular expressions.
You can use the re module.
import re
new_string = re.sub(r"[^a-zA-Z0-9]","",s)
Where ^ means exclusion.
Regular expressions exist in other languages. So it would be helpful to find a regular expression.
I came up with this regex to capture all characters that you want to remove from the string.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+
Do
regexp_replace(df1[col_name]), "^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+", ""))
Regex Demo
Explanation:
^\d+ - captures all digits in a sequence from the start.
(?<=\w)\d+(?![\d\s]) - Positive look behind for a word character with a negative look ahead for a number followed by space and capturing a sequence of digits in the middle. (Captures digits in G45up)
(?<=\s)\s+ - positive look behind for a space followed by one or more spaces, capturing all additional spaces.
Note : This regex could be inefficient when matching large strings as it uses expensive look-arounds.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+|(?<=\w)\W|\W(?=\w)|(?<!\w)\W|\W(?!\w)

How to match digits and dots. It has to start with digits first

For this example hello.1.2.3.4.world I want to match a result which gives me 1.2.3.4. Number of digits between dots doesn't matter. As long as it follow digit.digit pattern
My part solution was following regular-expression [\d.]+.[^.a-z], which gives me .1.2.3.4 as result. And I strip the first dot by using trim or similar method.
Any regexp master who can tell me how to rid the first dot with one regular expression only?
How about this: \.(\d(?:\.\d)*)\.\D
EDIT:
(\d+(?:\.\d+)*)
Demo
If you want to use your current regex you can put a lookahead at the start, and escape the literal dot when not inside a character group (?=\d)[\d.]+\.[^.a-z]
The lookahead (?=\d) will make sure the first character matched is a digit.
Demo here

Regex to capture Letters & Spaces

I have a spec that says a particular field will be alpha-text, right-padded with spaces to be 10 characters long, and I want to capture the alpha-part of the match.
This expression captures the entire section:
"([[:alpha:][:s:]]{10})"
However, I only want to capture the alpha-part, and still match (but not capture) on the remaining white-space. So if the alpha is 3-characters long, the next match needs to 7 white-spaces.
How can I do this?
I would say your best bet is to use 2 regular expressions. Regex doesn't really have support for what you're trying to do.
The first regular expression would get all strings length 10 right padded by spaces
([a-zA-Z\s]{10})
After that, just capture the word part. We know each string is only 10 characters at this point.
(\w+)\s*
This regex pattern will match a string, starting with (optional) [A-Za-z] characters, ending with upto 10 spaces, for a total string length of 10.
"^([A-Za-z]+)?\\ {0,10}"
Then, I added a positive lookahead to ensure the pattern only matches when the string length is 10.
"^(?=.{10}$)([A-Za-z]+)?\\ {0,10}$"
Edit: Try this using the [:alpha:] and [:space:]
"^(?=.{10}$)([:alpha:]+)?[:space:]{0,10}$"

Regex for this particular pattern

I have three different things
xxx
xxx>xxx
xxx>xxx>xxx
Where xxx can be any combination of letters and number
I need a regex that can match the first two but NOT the third.
To match ASCII letters and digits try the following:
^[a-zA-Z0-9]{3}(>[a-zA-Z0-9]{3})?$
If letters and digits outside of the ASCII character set are required then the following should suffice:
^[^\W_]{3}(>[^\W_]{3})?$
^\w+(?:>\w+)?$
matches an entire string.
\w+(?:>\w+)?\b(?!>)
matches strings like this in a larger substring.
If you want to exclude the underscore from matching, you can use [\p{L]\p{N}] instead (if your regex engine knows Unicode), or [^\W_] if it doesn't, as a substitute for \w.

regex string end with .log and contains chars numbers and -

Can someone tell me the regex pattern to match everything that ends with .log and contains chars, numbers and -
for example:
"syslog-12-10-2011.log"
You can try:
^[a-z0-9-]+\.log$
The regexp you're looking for is
^[A-Za-z0-9-]*\.log$
note that dot requires escaping and dash must be the first or last character inside square brackets (otherwise it denotes character range).
Note that this matches filename '.log'. Replace the star with a plus to have it match filenames with at least one character before the dot in '.log'.
This is a regex that you can use:
^[a-zA-Z0-9\-]+\.log$
With a case insensitive regular expression:
^[A-Z]+-([0-9]{2}-){2}[0-9]{4}\.log$
It's a bit more precise than what you asked (it matches text-nn-nn-nnnn.log, where n is a digit). If you are using POSIX regex (like in grep for instance), you will have to escape parenthesis and brackets:
[A-Z]+-\([0-9]\{2\}-\)\{2\}[0-9]\{4\}\.log$