python: Removing all kinds of quotation marks - regex

I have the following string:
txt="Daniel's car é à muito esperto"
I am trying to remove all kinds of quotation marks.
I tried:
txt=re.sub(r"\u0022\u201C\u201D\u0027\u2019\u2018\u2019\u0060\u00B4\'\"", ' ', txt)
I expected:
"Daniel s car é à muito esperto"
but actually nothing is happening.

The reason that the regex does not work is that it matches only a single string
r"\u0022\u201C\u201D\u0027\u2019\u2018\u2019\u0060\u00B4\'\""
To fix that one could use either alteration between each character or a character set.
txt=re.sub(r"[\u0022\u201C\u201D\u0027\u2019\u2018\u2019\u0060\u00B4\'\"]", ' ', txt)
One might need to pass the re.UNICODE flag. Untested.

Related

Vi: Substitution pattern SQL file. Issue with the Regex

I have to modify a SQL file with vi to delete columns that we do not use. As we have a lot of data, I use the search and replace option with a Regex Pattern.
For instance we have :
(1,2956,2026442,4,NULL,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',NULL,1,'27229',NULL,NULL,NULL,NULL,NULL,' Rue DU LUXEMBOURG, 9999 EVREUX',NULL,NULL,NULL,NULL,
NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'2020-07-08 16:34:40',NULL,NULL)
So we have 40 columns and I keep 13 ones. My regex is :
(1),2,(3),4-5,(6-14),15-22,(23),24-39,(40)
:%s/(\(.\{-}\),.\{-},\(.\{-}\),.\{-},.\{-},\(.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-}\),.\{-},
.\{-}, .\{-},.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\),.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},
.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\))/(\1,\2,\3,\4,\5)/g
I enclose in my parenthesis the parts that interest me by putting them in parenthesis (I only get the values in parenthesis on the line above my regex ). Then with the replace I recover these groups.
So normally my result is suppose to be :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',' Rue DU LUXEMBOURG, 9999 EVREUX',NULL)
But Because in ' Rue DU LUXEMBOURG, 9999 EVREUX' there is a comma (,). My result become :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,'9999','EVREUX',' Rue DU LUXEMBOURG',NULL,NULL)
Does Someone who is good in Regex can help me ? thanks in advance. If I wasn't clear tell me too, i will try to explain better next time.
I suggest matching fields that can be strings with a %('[^']*'|\w*) pattern, that is, a non-capturing group that finds either ' + zero or more non-'s and then a ' char, or any zero or more alphanumeric characters.
Also, the use of non-capturing groups (in Vim, it is %(...) in very magic mode, or \%(...\) in a regular mode) and very magic mode can help shorten the pattern.
The whole pattern will look like
:%s/\v\(([^,]*),[^,]*,([^,]*),[^,]*,[^,]*,(%('[^']*'|\w*)%(,%('[^']*'|\w*)){8})%(,%('[^']*'|\w*)){8},('[^']*'|\w*)%(,%('[^']*'|\w*)){16},([^,]*)\)/(\1,\2,\3,\4,\5)/g
See the regex demo converted to a PCRE regex.
Note some fields that are not strings are matched with [^,]* that matches zero or more chars other than a comma. The %(,%('[^']*'|\w*)){8} like patterns match (here) 8 occurrences of a sequence of a , char + '...' substring or zero or more word chars.

Remove spaces (apostrophes) around quotes with regex in ruby

I'm trying to remove all spaces around quotes with one Ruby regex. (not the same question as this)
Input: l' avant ou l 'après ou encore ' maintenant'
Output: l'avant ou l'après ou encore 'maintenant'
What I tried:
(/'\s|\s'/, '')
It's matching a few cases, but not all.
How to perform this ? Thanks.
TLDR:
I assume the spaces were inserted by some automation software and there can only be single spaces around the words.
s = "l' avant ou l 'apres ou encore ' maintenant' ou bien 'ceci ' et ' encore de l ' huile ' d 'accord d' accord d ' accord Je n' en ai pas .... s ' entendre Je m'appelle Victor"
first_rx = /(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i
# If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc],
# i.e. first letters of word that are usually contracted
second_rx = /\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/
puts s.gsub(first_rx, "'")
.gsub(second_rx) { $~[1] ? "'#{$~[1]}'" : "" }
Output:
l'avant ou l'apres ou encore 'maintenant' ou bien 'ceci' et 'encore de l'huile' d'accord d'accord d'accord Je n'en ai pas .... s'entendre Je m'appelle Victor
Explanation
The problem is really complex. There are several words that can be abbreviated and used with an apostrophe in French, de, le/la, ne, se, me, te, ce to name a few, but these are all consonants. You may remove all spaces between a single, standalone consonant, apostrophe and the next word using
s.gsub(/(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i, "'")
If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc], i.e. first letters of word that are usually contracted. See the regex demo.
Next step is to remove spaces after initial and before trailing apostrophes. This is tricky:
s.gsub(/\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/) { $~[1] ? "'#{$~[1]}'" : "" }
where \b'\b is meant to match all apsotrophes in between word chars, those that we fixed at the previous step. See this regex demo. As there is no (*SKIP)(*F) support in Onigmo regex, the regex is a bit simplified but the replacement is a conditional one: if Group 1 matched, replace with ' + Group 1 value ($1) + ', else, replace with an empty string (since \K reset the match, dropped all text from the match memory buffer).
NOTE: this approach can be extended to handle some specific cases like aujourd'hui, too.
To remove all whitespace around the ', use gsub!, applied in several steps for proper whitespace removal:
str = "l' avant ou l 'apres ou encore ' maintenant'"
str.gsub!(/\b'\s+\b/, "'").gsub!(/\b\s+'\b/, "'").gsub!(/\b(\s+')\s+\b/, '\1')
puts str
# l'avant ou l'apres ou encore 'maintenant'
Here,
\b : word boundary,
\s+ : 1 or more whitespace,
string.gsub!(regex, replacement_string) : replace in the string argument regex with specified replacement_string (during this, the original string is changed),
\1 : in the replacement string, this refers to the first group captured in parenthesis in the regex: (...).
So if you have alot of data like this, all the answers I have seen are wrong, and will not work. No regex can guess wether the preceding word should have a space or not. Unless you came up with a list of words (or patterns) that either do or don't.
The problem is, sometimes a space should be left, sometimes not. The only way to script that is to find a pattern which describes when the space should be there, or when not. You must teach your regex French grammar. It may be possible lol. But probably not, or difficult.
If this is a one off, my advice is to create regexes for 2 or 3 different situations, and use something like vim, to go through the data, and select manually yes or no to substitute each occurrence.
There may be some cases you can run - eg remove all spaces to the right of quotes? - but unfortunately I don't think you can automate this process.
I believe the following should work for you
s.gsub(/'.*?'/){ |e| "'#{e[1...-1].strip}'" }
The regex portion lazy matches all text within single quotes (including quotes). Then, for each match you substitute for the quoted text with leading and trailing whitespace removed, and return this text in quotes.

Hive regexp_replace

my use case is the follow:
String text_string: "text1:message1,text3:message3,text2:message,..."
select regexp_replace(text_string, '[^:]*:([^,]*(,|$))', '$1')
Correct output: message1,message3,message2,...
The pattern work, but the problem is that if there is a character ":" o "," in the message the replace doesn't work.
So I tried to use "::" and ",," characters as a separators in the string
String text_string: "text1::message1,,text3::message3,,text2::message2,..."
select regexp_replace(text_string, '[^::]*::([^,,]*(,,|$))', '$1')
Correct output: message1,,message3,,message2,,...
but also in this case, if there is one ":" or "," character in the string (in the text or in the message) the replace command doesn't work.
How should the regular expression be modified to work?
Delimiters cannot be characters that are likely to be in the data. Since you have control over it, use pipes '|' or tildes '~' maybe. Only you can come up with the right characters by analyzing the data.
If you can't do that, then you'll need to put quotes around the data that contains the delimiter character and come up with a way to deal with that.

Remove all the non words except quote in regex

I have the following:
string = re.sub("[^A-Za-z]]", ' ', string)
This works to remove all the non words. Now I would like to do almost the same but keep the single quotes in my string this time. How do I need to change my regex?
Example: Queen's son is sleeping, but he will wake up.
Result: queen's son is sleeping but he will wake up
You can just include the single quote escaped in your group:
([^A-Za-z\'])
Including it in your example:
string = re.sub("[^A-Za-z\']", ' ', string)
Edit: You don't need to escape single quote so:
string = re.sub("[^A-Za-z']", ' ', string)

Remove all special characters from a string in R?

How to remove all special characters from string in R and replace them with spaces ?
Some special characters to remove are : ~!##$%^&*(){}_+:"<>?,./;'[]-=
I've tried regex with [:punct:] pattern but it removes only punctuation marks.
Question 2 : And how to remove characters from foreign languages like : â í ü Â á ą ę ś ć ?
Answer : Use [^[:alnum:]] to remove~!##$%^&*(){}_+:"<>?,./;'[]-= and use [^a-zA-Z0-9] to remove also â í ü Â á ą ę ś ć in regex or regexpr functions.
Solution in base R :
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-="
gsub("[[:punct:]]", "", x) # no libraries needed
You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all from the stringr package, though gsub from base R works just as well.
The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")
(The base R equivalent is gsub("[[:punct:]]", " ", x).)
An alternative is to swap out all non-alphanumeric characters.
str_replace_all(x, "[^[:alnum:]]", " ")
Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
Instead of using regex to remove those "crazy" characters, just convert them to ASCII, which will remove accents, but will keep the letters.
astr <- "Ábcdêãçoàúü"
iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')
which results in
[1] "Abcdeacoauu"
Convert the Special characters to apostrophe,
Data <- gsub("[^0-9A-Za-z///' ]","'" , Data ,ignore.case = TRUE)
Below code it to remove extra ''' apostrophe
Data <- gsub("''","" , Data ,ignore.case = TRUE)
Use gsub(..) function for replacing the special character with apostrophe