How to clean data in python for Special Characters through Regex?

How to clean data in python for Special Characters through Regex? - regex

Name Id Salary Desgn
0 Mike B1230 3000 Engg
1 John !##2 3000 !##&
2 Lucy B1230 3000 %##B
3 ###& ###& ###& ###&
4 snow B1230 3000 Engg
5 Lily #&-# 3000 Engg
Output:
Name Id Salary Desgn
0 Mike B1230 3000 Engg
1 John !##2 3000
2 Lucy B1230 3000 %##B
3
4 snow B1230 3000 Engg
5 Lily 3000 Engg
I was trying to clean data where if a cell contains pure special characters(without any numbers or alphabets) it will replace those value as a null value by Regular Expression.

You can use this pattern r'\B([!##%&-]+)\s' and should work for the examples you provided (except for the %##B because it contains a letter, contrary to the description you gave).
In python this would be:
import re
patt = r'\B([!##%&-]+)\s'
re.sub(patt, '', your_string)
If using pandas you can use apply: df['new_column'] = df['string_column'].apply(lambda x: re.sub(patt, '', x))

Here is a solution:
(?<!\S)[^\w\s]+(?!\S)
Using lookarounds, this regex matches exactly the unwanted strings. This might help preserve the formatting of the text (e.g replace each match with " ".
The problem with \W* or \W+ in this case is that it will match all non-word characters even if they are adjacent to word characters so we need to be a little more specific.
EDIT: !\S in negative lookarounds above cannot be replaced with a \s and positive lookarounds because \s does not match start and end of the string and it will lead to a more complex regex in order to to match patterns at start and end positions.

You can use this regex to match text that doesn't contain at least one alphabet or number and replace it with empty string,
(?!\S*[a-zA-Z0-9]\S*)(?<!\S)\S+\s*
Here this (?!\S*[a-zA-Z0-9]\S*) negative look ahead rejects a token if it doesn't contain at least one alphabet or number, then (?<!\S) ensures the match doesn't start partially from a token that may have alphanumeric character just preceding to it and \S+ matches that token and \s* at the end consumes trailing space(s) after the removed token as you posted in your expected output.
Check out this demo

Related

Ms Exchange Online | Regex in Transport Rule

I want to create a Regex Transport rule for the email with Subject 1365 1049126 9003175245 19382_ST03
I want to check if the subject contains 900 in the third group of characters and the numbers after 900 are dynamic and is a single word.
For example:
Subject 1 - 1365 1049126 9003175245 19382_ST03
Subject 2 - 2455 3245626 9003175000 19382_ST03
Subject 3 - 4567 4449126 9003005030 19382_ST03
How do I set this in regular expression? Any help will be appreciated.

You could use something like this: ^[0-9]{4}\s[0-9]{7}\s900[0-9]{7}\s\w{10}$
^ Matches start of strings
[0-9]{4}\s Matches 4 digits and whitespace after them
[0-9]{7}\s Does the same but for 7 digits
900[0-9]{7}\s <-- This matches 900 and 7 digits and whitespace
\w{10}s Matches 10 word characters
$ Matches end of string
If these groups are not ONLY digits, you can replace [0-9] with \w

Match all type of numbers

I need regular expression which extracts all numbers with different delimiters (single whitespace, comma, dot). Each number can use none or all of them.
Example:
text: 'numbers: 3.14 2 544 345,345.55 506 test 120 100 100'
output: '3.14', '2 544', '345,345.55', '506', '120 100 100'
I created re: \d+[(.|,|\s)\d+]+, but it not works properly.

I assume the numbers you need to extract are separated with 2 or more whitespaces, else it would be impossible to differentiate between the end of the previous number and the start of a new one.
If you need to extract the numbers in the formats as shown above, XXX XXX.XXX or XXX,XXX,XXX.XX or XXX or XXX XXX XXX, you may use
\b\d{1,3}(?:[, ]\d{3})*(?:\.\d+)?\b
See the regex demo
Details:
\b - leading word boundary
\d{1,3} - 1 to 3 digits
(?:[, ]\d{3})* - 0+ sequences of a comma or space ([, ]) and 3 digits (\d{3})
(?:\.\d+)? - an optional sequence of a dot followed with 1+ digits
\b - trailing word boundary
A less restrictive pattern would be the same as above, but with limiting quantifiers replaced with a +:
\b\d+(?:[, ]\d+)*(?:\.\d+)?\b
See this regex demo
It will also match numbers like 1234566 and 124354354.343344.

Regex preg_match to neutralize a pricelist, keeping only digits, dots and commas*

I am using preg_match (PHP version 5.5.*) and want to ignore all alphabetic letters [a-zA-Z] and special symbols such as $ and -, only to match numbers, commas, dots. Whitespaces between numbers such as 6 000 should be matched. Commas after a number that is not followed by another number should be ignored, such as 6, would only match 6
Note that this is used in a single string and never in a list, like the sample below. I use the list to show what input and desired output is, "per line".
Sample input:
1
1,99
1.99
10
100
5999 dollars
2 USD
$2,99
Our price 2.99
Price: $ 20
200 $
20,-
6 999 USD
Desired output:
1
1,99
1.99
10
100
5999
2
2,99
2.99
20
200
20
6 999
I have tried /([0-9.,\s]+)/ but the output of 6 999 USD becomes 6.
Edit
The code we are using looks like this:
preg_match($regex, $value, $extractions);
array_shift($extractions);
$this->persist($extractions);
Demo

Update:
If you have   instead of spaces, you can do two things..my recommended is to just do a str_replace() first:
str_replace(' ', ' ', $number);
The other option is to also check for   with the [\s,] group:
[\d.](?:[\d.]|(?:[\s,]| )(?=\d))*
Example:
preg_match('/[\d.](?:[\d.]|[\s,](?=\d))*/', $number, $matches);
$number = reset($matches);
Explanation:
So I classified the valid characters (digits, spaces, commas, and periods) into two groups: [\d.] and [\s,]. A number must start with a digit or a period ($.99 == .99 != 99). Then we use a repeated non-capturing group (?:...)* to take care of our alternation and lookahead assertions. Anytime there is a [\d.] we match it with now questions asked. Otherwise (|), it it is a [\s,] we assert that it is followed with a digit using a lookahead ((?=...)).
Demo
Example:
preg_replace('/\s*[^\d\s,.]+\s*|,(?!\d)/', '', $number);
Explanation:
[^\d\s,.]+ will match 1+ characters that are not either a digit, whitespace, a comma, or a period. We put \s* on either side to grab any extra whitespace around these unwanted characters (like in "Our price "). The only unwanted character this doesn't match is a trailing comma. We use an alternation (|), then look for a comma, and then make sure that it is not followed by a digit using a negative lookahead ((?!...)).
Demo

Regular expression to extract the year from a string

This should be simple but i cant seem to get it going. The purpose of it is to extract v3 tags from mp3 file names in Mp3tag.
I have these strings I want to extract the year.
Test String 1 (1994) -> extract 1994
34 Test String 2 (1995)" -> extract 1995
Test (String) 3 (1996)" -> extract 1996
I had ^(.+)\s\(([0-9]*)\)$ but obviously its not giving me the results i was expecting. You can say that im not very good with regular expressions.
Thanks in advance

A suggestion for a more generic solution, not sure if that is what you need. Valid years will always have the form 19xx or 20xx, and the years will be separated with a word-break character (something other than a number or a letter):
\b(19|20)\d{2}\b
This doesn't really care where in the tag the year appears. A simpler version that doesn't assume anything more than 4 digits in the year would be this expression:
\b\d{4}\b
The key here is the \b escape sequence, which matches any non-word character (word charaters are letters, digits and underscores), including parenthesis, of course.
Would also like to recommend this site:
http://www.regular-expressions.info/

You can use something like this \((\d{4})\)$. The first group will have your match.
Explanation
\( # Match the character “(” literally
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
{4} # Exactly 4 times
)
\) # Match the character “)” literally
$ # Assert position at the end of a line (at the end of the string or before a line break character)

You need to escape the parentheses. Also you can restrict that a year has only got 4 numbers:
^(.+)\s\(([0-9]{4})\)$
The year is in matchgroup 2.

I'd go with
^(.*)\s\(([0-9]{4})\)$
(assuming all years have 4 digits, use [0-9]+ if you have an unknown number of digits, but at least one, or [0-9]* if there could be no digits)

You're almost there with your regular expression.
What you really need is:
\s\((\d{4})\)$
Where:
\s is some whitespace
\( is a literal '('
( is the start of the match group
\d is a digit
{4} means four of the previous atom (i.e. four digits)
) is the end of the match group
\) is a literal ')'
$ is the end of the string
For best results, put into a function:
>>> def get_year(name):
... return re.search('\s\((\d{4})\)$', name).groups()[0]
...
>>> for name in "Test String 1 (1994)", "34 Test String 2 (1995)", "Test (String) 3 (1996)":
... print get_year(name)
...
1994
1995
1996

Regex matching multiple inputs

I am trying to do a smart input field for UK style weight input, e.g. "6 stone and 3 lb" or "6 st 11 pound", capturing the 2 numbers in groups.
For now I got: ([0-9]{1,2}).*?([0-9]{1,2}).*
Problem is it matches "12 stone" in 2 groups, 1 and 2 instead of just 12. Is it possible to make a regex which captures correctly in both cases?

You need to make the first part possessive so it never gets backtracked into.
([0-9]{1,2}+).*?([0-9]{1,2})

Because . matches everythig including numbers.. try this:
/(\d{1,2})\D+(\d{1,2})?/

Something like this?
\b(\d+)\b.*?\b(\d+)\b
Groups 1 and 2 will have your numbers in either case.
Explanation :
"
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 2
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
"

This works, then look at capture groups 1 and 3:
([0-9]{1,2})[^0-9]+(([0-9]{1,2})?.+)?
The idea is to make a number and text manditory, but make a second number and text optional.

Here is my suggestion for a regex to match both variants you showed:
(?<stone>\d+\s(?:stone|st))(?:\s(and)?\s?)(?<pound>\d+\s(?:pound|lb))

It's a bit vague at the moment, this works:
/([0-9]{1,2})(?:[^0-9]+([0-9]{1,2}).*)?/
for this data:
6 stone and 3 lb
6 st 11 pound
12 stone
12 st and 11lbs

Seeing as everyone is having a go, here's mine:
(\d+)(?:\D+(\d+)?)
It's definitely the concisest so far. This will match one or two groups of digits anywhere:
"12": ("12", null)
"12st": ("12", null)
"12 st": ("12", null)
"12st 34 lb": ("12", "34")
"cabbage 12st 34 lb": ("12", "34")
"12 potato 34 moo": ("12", "34")
The next step would be making it catch the name of the units that were used.
Edit: as pointed out above, we don's know what language you're using, and not all regex functionality is available in all implementations. However as far as I know, \d for digits and \D for non-digits is fairly universal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to clean data in python for Special Characters through Regex? - regex

Related

Ms Exchange Online | Regex in Transport Rule

Match all type of numbers

Regex preg_match to neutralize a pricelist, keeping only digits, dots and commas*

Regular expression to extract the year from a string

Regex matching multiple inputs

Categories

Resources