Detecting Special Characters with Regular Expression in python? - regex

df
Name
0 ##
1 R##
2 ghj##
3 Ray
4 *#+
5 Jack
6 Sara123#
7 ( 1234. )
8 Benjamin k 123
9 _
10 _!##_
11 _#_&#+-
12 56##!
Output:
Bad_Name
0 ##
1 *#+
2 _
3 _!##_
4 _#_&#+-
I need to detect the special character through regular expression. If a string contains any alphabet or Number then that string is valid else it will consider as bad string.
I was using '^\W*$' RE, everything was working fine except when the string contains '_'( underscore) it is not treating as Bad String.

Use pandas.Series.str.contains:
df[~df['Name'].str.contains('[a-z0-9]', False)]
Output:
Name
0 ##
4 *#+
9 _
10 _!##_
11 _#_&#+-

Related

Remove spaces and punctuations from Chinese string column in Python [duplicate]

This question already has answers here:
Stripping everything but alphanumeric chars from a string in Python
(16 answers)
Closed 3 years ago.
In order to drop duplicates from the following dataframe by news column, I try to remove all spaces and punctuations from this column.
date news
0 2017-08 北京写字楼租金哪家高? 金融街、CBD、亚奥居TOP3
1 2017-08 租金一直涨,到底是谁租走了北京最贵的写字楼(附名单)
2 2017-09 北京三季度写字楼租金继续保持平稳
3 2017-09 戴德梁行:第三季度北京写字楼市场租金保持平稳
4 2018-01 北京豪华公寓销量大涨76.5% 金融街写字楼租金创35季度新高
5 2010-11 楼市下行,高租金的商住和写字楼能不能投?
I have trying the following solutions:
df.news = df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).replace(' ', '')
df.news = df.news.str.replace('[^\w\s]', '').str.strip()
Both generate an output with space inside the strings:
0 北京写字楼租金哪家高 金融街CBD亚奥居TOP3 ---> space in the phrase
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765 金融街写字楼租金创35季度新高 ---> space in the phrase
5 楼市下行高租金的商住和写字楼能不能投
The following code remove the second part of news phrases.
df.news = df.news.str.extract('(\w+)', expand = False)
0 北京写字楼租金哪家高
1 租金一直涨
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行
4 北京豪华公寓销量大涨76
5 楼市下行
How can I get the expected result as follows for news column? Thank you.
0 北京写字楼租金哪家高金融街CBD亚奥居TOP3
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765金融街写字楼租金创35季度新高
5 楼市下行高租金的商住和写字楼能不能投
This seems works:
df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).str.replace(' ', '')
Output:
0 北京写字楼租金哪家高金融街CBD亚奥居TOP3
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765金融街写字楼租金创35季度新高
5 楼市下行高租金的商住和写字楼能不能投

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

regex - capture space delimited word

I have a string:
2001 970451 4 l 97 0451 iver b y c 3 0 1 8 4 1 4 hundred 2001 970451 nama 4 l 97 0451 iver hundred blah
I need an appropriate regular expression to capture series of characters and spaces such as b y c 3 0 1 8 4 1 4?
I have tried:
(\b[a-z0-9]{1}\s{1})+ - I get l
EDIT:
To further explain what I need, I need to be able to capture similar series of text where a single alphanum character is continuously/repeatedly followed by single space character to a point where this is no longer true.
Is regexp a hard requirement?
It would be far simpler for you, in the long term, to just use something like strings.Fields and filter the resulting array by length (you can apply any other requirements too).
Example:
(Give it a try on the playground! https://play.golang.org/p/Ue2wO5d-Te)
package main
import (
"fmt"
"strings"
)
func CaptureGroups(input string) (output [][]string) {
fields := strings.Fields(input)
var group []string
for _, field := range fields {
if len(field) == 1 {
group = append(group, field)
} else {
if len(group) > 0 {
output = append(output, group)
group = make([]string, 0)
}
}
}
return
}
func main() {
input := "2001 970451 4 l 97 0451 iver b y c 3 0 1 8 4 1 4 hundred 2001 970451 nama 4 l 97 0451 iver hundred blah"
output := CaptureGroups(input)
fmt.Printf("Groups: %q", output)
}
i think this will work: (( [^ ])+ )
your string will be in capture group 1
\040 matches exactly the space character.
so to match something like `b y c 3 0 1 8 4 1 4 you need
[a-z]\040[a-z]\040[a-z]\040[0-9]\040[0-9]\040[0-9]\040[0-9]

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924
m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1
1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.

R + reshape: using colsplit w/regex

I am trying to use colsplit to break up a vector in a dataframe. The fact that we have regular expression as an arg to colsplit makes me think it can be flexible, but I am having trouble (it might just be that I'm not understanding regex in R).
Here's the problem:
let's create a vector...
> library(reshape)
> my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123","x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_123","x02_ccc_123"))
I would like to split it into two columns upon the first underscore.
In other words, I want my end result to be this...
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123
I am trying to find the right regex inside of colspan that will do it, but no luck. Here's the closest I can get...
> colsplit(my_var_1, split="_", c("x","whatever"))
x whatever NA.
1 x00 aaa 123
2 x00 bbb 123
3 x00 ccc 123
4 x01 aaa 123
5 x01 bbb 123
6 x01 ccc 123
7 x02 aaa 123
8 x02 bbb 123
9 x02 ccc 123
That uses the split regex as a simple delimiter and it gives me three columns. I would like to not split the second underscore (to make it worse, in my real data I have an arbitrary number of underscores not just two).
Is there an expression I can use for "split" that will give what I want?
I had hoped that the regex in colsplit would allow me to match on groups and the group matches would be the content of splits but that does not appear to be the case.
* edit (thanks to #Joshuaulrich) colsplit works "as intended" when using the newer reshape2 !!!
Your code throws an error for me:
> colsplit(my_var_1, split="_", c("x","whatever"))
Error in colsplit(my_var_1, split = "_", c("x", "whatever")) :
unused argument(s) (split = "_")
split isn't an argument to colsplit. The argument you want is pattern, or you can just rely on positional matching:
> colsplit(my_var_1, "_", c("x","whatever"))
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123