Remove spaces and punctuations from Chinese string column in Python [duplicate]

Remove spaces and punctuations from Chinese string column in Python [duplicate] - regex

This question already has answers here:
Stripping everything but alphanumeric chars from a string in Python
(16 answers)
Closed 3 years ago.
In order to drop duplicates from the following dataframe by news column, I try to remove all spaces and punctuations from this column.
date news
0 2017-08 北京写字楼租金哪家高? 金融街、CBD、亚奥居TOP3
1 2017-08 租金一直涨,到底是谁租走了北京最贵的写字楼(附名单)
2 2017-09 北京三季度写字楼租金继续保持平稳
3 2017-09 戴德梁行:第三季度北京写字楼市场租金保持平稳
4 2018-01 北京豪华公寓销量大涨76.5% 金融街写字楼租金创35季度新高
5 2010-11 楼市下行,高租金的商住和写字楼能不能投?
I have trying the following solutions:
df.news = df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).replace(' ', '')
df.news = df.news.str.replace('[^\w\s]', '').str.strip()
Both generate an output with space inside the strings:
0 北京写字楼租金哪家高 金融街CBD亚奥居TOP3 ---> space in the phrase
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765 金融街写字楼租金创35季度新高 ---> space in the phrase
5 楼市下行高租金的商住和写字楼能不能投
The following code remove the second part of news phrases.
df.news = df.news.str.extract('(\w+)', expand = False)
0 北京写字楼租金哪家高
1 租金一直涨
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行
4 北京豪华公寓销量大涨76
5 楼市下行
How can I get the expected result as follows for news column? Thank you.
0 北京写字楼租金哪家高金融街CBD亚奥居TOP3
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765金融街写字楼租金创35季度新高
5 楼市下行高租金的商住和写字楼能不能投

This seems works:
df.news.apply(lambda x: re.sub(r'[^\w\s]', '', x)).str.replace(' ', '')
Output:
0 北京写字楼租金哪家高金融街CBD亚奥居TOP3
1 租金一直涨到底是谁租走了北京最贵的写字楼附名单
2 北京三季度写字楼租金继续保持平稳
3 戴德梁行第三季度北京写字楼市场租金保持平稳
4 北京豪华公寓销量大涨765金融街写字楼租金创35季度新高
5 楼市下行高租金的商住和写字楼能不能投

Related

Detecting Special Characters with Regular Expression in python?

df
Name
0 ##
1 R##
2 ghj##
3 Ray
4 *#+
5 Jack
6 Sara123#
7 ( 1234. )
8 Benjamin k 123
9 _
10 _!##_
11 _#_&#+-
12 56##!
Output:
Bad_Name
0 ##
1 *#+
2 _
3 _!##_
4 _#_&#+-
I need to detect the special character through regular expression. If a string contains any alphabet or Number then that string is valid else it will consider as bad string.
I was using '^\W*$' RE, everything was working fine except when the string contains '_'( underscore) it is not treating as Bad String.

Use pandas.Series.str.contains:
df[~df['Name'].str.contains('[a-z0-9]', False)]
Output:
Name
0 ##
4 *#+
9 _
10 _!##_
11 _#_&#+-

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?

Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

Regular Expression: Find repeated patterns

Having this string s=";123;;123;;456;;124;;123;;567;" in R, which shows some Ids separated by ";", I want to find the repeated IDs, so in this case ";123;" is repeated. I used the following command in R:
gregexpr("(;[1-9]+;).*\1", s)
but it doesn't find the repeated patterns. Any idea what is wrong?
One example of a long string:
1760381;;1774536;;1774614;;1774617;;1774705;;1774723;;1775013;;1902321;;1928678;;2105486;;2105514;;2105544;;2105575;;2105585;;2279115;;2379236;;290927;;542280;;555749;;641540;;683822;;694934;;713228;;713248;;713249;;726949;;727204;;731434;;754522;;7693856;;100095;;1003838;;1045582;;1079057;;1108697;;1231229;;124087;;1249672;;1328126;;1412065;;1419930;;1441743;;1470580;;1476585;;1502106;;1556149;;1637775;;1643922;;1655644;;1755547;;1759001;;1760295;;1760296;;1760320;;1760326;;1760338;;1760348;;1760349;;1760350;;1760353;;1760375;;1760376;;1760377;;1760378;;1760388;;1760401;;1760402;;1760403;;1760410;;1760421;;1760425;;1760426;;1760642;;1760654;;1770463;;1774365;;1774366;;1774394;;1774449;;1774453;;1774454;;1774455;;1774456;;1774457;;1774458;;1774461;;1774462;;1774463;;1774464;;1774466;;1774469;;1774504;;1774505;;1774506;;1774519;;1774520;;1774525;;1774527;;1774529;;1774532;;1774533;;1774539;;1774542;;1774593;;1774595;;1774604;;1774610;;1774616;;1774617;;1774641;;1774660;;1774671;;1774674;;1774684;;1774687;;1774694;;1774704;;1774706;;1774713;;1774717;;1774722;;1774723;;1774726;;1774733;;1774745;;1774750;;1774753;;1774754;;1774766;;1774784;;1774786;;1774795;;1774799;;1774800;;1774803;;1774809;;1774813;;1774835;;1774849;;1774852;;1774853;;1774854;;1774857;;1774858;;1774861;;1774862;;1774867;;1774868;;1774869;;1774870;;1774877;;1774878;;1774880;;1774884;;1774885;;1774886;;1774902;;1774905;;1774934;;1774935;;1774937;;1774939;;1774946;;1774949;;1774950;;1774958;;1774959;;1774960;;1774961;;1774962;;1774964;;1774965;;1774966;;1774967;;1774969;;1774971;;1774972;;1774973;;1774975;;1774977;;1774978;;1774999;;1775000;;1775003;;1775005;;1775006;;1775009;;1775013;;1775014;;1775017;;1775024;;1775026;;1775033;;1775038;;1775040;;1775041;;1775044;;1775087;;1785544;;1811645;;1837210;;1864356;;1928674;;1928678;;1932882;;1954203;;2066856;;2076876;;2105349;;2105351;;2105458;;2105464;;2105476;;2105480;;2105482;;2105484;;2105489;;2105496;;2105500;;2105510;;2105514;;2105518;;2105532;;2105545;;2105550;;2172257;;2172762;;218438;;2228198;;2229827;;2247909;;2262250;;2263135;;2287260;;2335872;;2335873;;2335874;;2335877;;2338682;;2352560;;2420902;;263946;;265370;;303060;;330571;;338764;;387492;;387750;;388362;;431807;;436056;;436442;;444058;;458026;;491696;;504783;;513098;;529228;;539799;;549649;;559957;;562574;;563116;;576418;;582851;;592273;;599952;;614463;;626416;;645122;;652363;;665854;;668048;;682877;;683822;;688317;;709795;;710684;;723114;;724447;;724526;;725177;;731389;;731434;;876958;;879962;;947924;;987322;;987446;;61326;;1025952;;1095970;;1338018;;1349990;;1373122;;1419930;;1760310;;1760320;;1774705;;1774706;;1774708;;1774712;;1774952;;1774954;;1774963;;1774972;;1774977;;1775077;;1901075;;2022080;;2117779;;2143723;;441554;;450517;;549649;;1010402;;113311;;1148258;;1374348;;1419930;;1606449;;1606515;;1606608;;1606610;;1760320;;1760338;;1760618;;1760642;;1774504;;1774520;;1774595;;1774705;;1774909;;1774977;;1775011;;1775043;;179542;;1928678;;2105598;;2105721;;2188303;;2335873;;340762;;387759;;436442;;504783;;588336;;646185;;682877;;715644;;725080;;741661;;760924

m<-gregexpr("[0-9]+",s)
n<-regmatches(s,m)
[[1]]
[1] "123" "123" "456" "124" "123" "567"
data.frame(table(unlist(n)))
Var1 Freq
1 123 3
2 124 1
3 456 1
4 567 1
The code works for your long form string too: Here is the head and tail of the output:
head(data.frame(table(unlist(n))),10)
Var1 Freq
1 100095 1
2 1003838 1
3 1010402 1
4 1025952 1
5 1045582 1
6 1079057 1
7 1095970 1
8 1108697 1
9 113311 1
10 1148258 1
tail(data.frame(table(unlist(n))),10)
Var1 Freq
316 731434 2
317 741661 1
318 754522 1
319 760924 1
320 7693856 1
321 876958 1
322 879962 1
323 947924 1
324 987322 1
325 987446 1

1) In the examples the ids are all the same length so we assume that is a general feature. Try this pattern where (?=...) is a zero width lookahead expression (see ?regex)
pat <- ";([1-9]+);(?=.*\\1)"
gregexpr(pat, s, perl = TRUE)
or this:
library(gsubfn)
strapply(s, pat, perl = TRUE)[[1]]
## [1] "123" "123"
This lists each id one fewer times than its occurrence (zero times for ids not duplicated) in s so to list each duplicated id uniquely try unique(st) where st is the result of this last line of code above.
Note: In the second example in the question, i.e. the long string, there is no ; at the end of the string so the last id can never be matched by the expression unless we first paste a ; onto the end.
2) Instead of matching the contents we could match the delimiters instead:
strsplit(s, ";")[[1]])[-1]
If st is the result of this line of code then st is just a vector of all the ids so unique(st[duplicated[st]) uniquely lists each duplicated id and involves no regular expressions.

regex for position matching with OR condition

Newbie to regex and looking for help in creating regexp to seek out following:
The data items consists of six character strings as shown in example below
1) "100100"
2) "110011"
3) "010000"
4) "110011"
5) "111111"
6) "000111"
Need to use regexp to find data with say
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1 in 2nd position: Items 2,4 ad 5 should be matched
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched

Given your samples, these will work:
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1.....|...1...
1 in 2nd position: Items 2,4 ad 5 should be matched
.1....
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched
....11
Or if you want to match any of these rules, combine them with the | (or) operator.
Example:
http://regexpal.com/?flags=g&regex=(1.....%7C...1...%7C.1....%7C....11)&input=100100%0A%0A110011%0A%0A010000%0A%0A110011%0A%0A111111%0A%0A000111

If it is always strings with only 1s and 0s, you should treat them as binary numbers and use logical operators to find the matches.

Try this regex
([1][0-1]{2}[1][0-1]{2})|([0-1][1][0-1]{4})|([0-1]{4}[1]{2})
Find the explanation and demo here http://www.regex101.com/r/vD9jE7

Here's an example. Change dots with zeros if necessary. /^(11..|.1.1)11$/
^ # beginning of string
( # either
11.. # 11 and any 2 char
| # or
.1.1 # any char, 1, any char, 1
)
11
$ # end of string

R + reshape: using colsplit w/regex

I am trying to use colsplit to break up a vector in a dataframe. The fact that we have regular expression as an arg to colsplit makes me think it can be flexible, but I am having trouble (it might just be that I'm not understanding regex in R).
Here's the problem:
let's create a vector...
> library(reshape)
> my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123","x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_123","x02_ccc_123"))
I would like to split it into two columns upon the first underscore.
In other words, I want my end result to be this...
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123
I am trying to find the right regex inside of colspan that will do it, but no luck. Here's the closest I can get...
> colsplit(my_var_1, split="_", c("x","whatever"))
x whatever NA.
1 x00 aaa 123
2 x00 bbb 123
3 x00 ccc 123
4 x01 aaa 123
5 x01 bbb 123
6 x01 ccc 123
7 x02 aaa 123
8 x02 bbb 123
9 x02 ccc 123
That uses the split regex as a simple delimiter and it gives me three columns. I would like to not split the second underscore (to make it worse, in my real data I have an arbitrary number of underscores not just two).
Is there an expression I can use for "split" that will give what I want?
I had hoped that the regex in colsplit would allow me to match on groups and the group matches would be the content of splits but that does not appear to be the case.
* edit (thanks to #Joshuaulrich) colsplit works "as intended" when using the newer reshape2 !!!

Your code throws an error for me:
> colsplit(my_var_1, split="_", c("x","whatever"))
Error in colsplit(my_var_1, split = "_", c("x", "whatever")) :
unused argument(s) (split = "_")
split isn't an argument to colsplit. The argument you want is pattern, or you can just rely on positional matching:
> colsplit(my_var_1, "_", c("x","whatever"))
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove spaces and punctuations from Chinese string column in Python [duplicate] - regex

Related

Detecting Special Characters with Regular Expression in python?

Removing special characters while retaining alpha numeric words

Regular Expression: Find repeated patterns

regex for position matching with OR condition

R + reshape: using colsplit w/regex

Categories

Resources