Replace inside matched string with Notepad++ and regex - regex

I have some lines in a text file :
Joëlle;Dupont;123456
Alex;Léger;134234
And I want to replace them by :
Joëlle;Dupont;123456;joelle.dupont#mail.com
Alex;Léger;134234;alex.leger#mail.com
I want to replace all characters with accents (é, ë…) by characters without accents (e, e…) but only on the mail adress, only on a part of the line.
I know I can use \L\E to change uppercase letter into lowercase letter but it's not the only thing I have to do.
I used :
(.*?);(.*?);(\d*?)\n
To replace it by :
$1;$2;$3;\L$1.$2#mail.com\E\n
But it wouldn't replace characters with accents :
Joëlle;Dupont;123456;joëlle.dupont#mail.com
Alex;Léger;134234;alex.léger#mail.com
If you have any idea how I could do this with Notepad++, even with more than one replacement, maybe you can help me.

I don't know your whole population, but you could use the below to replace the variations of e with an e:
[\xE8-\xEB](?!.*;)
And replace with e.
[I got the range above from this webpage, taking the column names]
regex101 demo
This regex matches any è, é, ê or ë and replaces them with an e, if there is no ; on the same line after it.
For variations of o:
[\xF2-\xF6](?!.*;)
For c (there's only one, so you can also put in ç directly):
\xE7(?!.*;)
For a:
[\xE0-\xE5](?!.*;)

Related

How can I search and replace guids in Sublime 3

I have a textfile where I would like to replace all GUIDs with space.
I want:
92094, "970d6c9e-c199-40e3-80ea-14daf1141904"
91995, "970d6c9e-c199-40e3-80ea-14daf1141904"
87445, "f17e66ef-b1df-4270-8285-b3c15da366f7"
87298, "f17e66ef-b1df-4270-8285-b3c15da366f7"
96713, "3c28e493-015b-4b48-957f-fe3e7acc8412"
96759, "3c28e493-015b-4b48-957f-fe3e7acc8412"
94665, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
94405, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
To become:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,
How can i accomplish this in Sublime 3?
Ctrl+H
Find: "[\da-f-]{36}"
Replace: LEAVE EMPTY
Enable regex mode
Replace all
Explanation:
" : double quote
[ : start class character
\d : any digit
a-f : or letter from a to f
- : or a dash
]{36} : end class, 36 characters must be present
" : double quote
Result for given example:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,
Try doing a search for this pattern in regex search mode:
"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
And then just replace with empty string. This should strip off the GUID, leaving you with the output you want.
Demo
Another regex solution involving a slightly different search-replace strategy where we don't care about the GUI format and simply get the first column:
Search for ([^,]*,).* (again don't forget to activate the regex mode .*).
Replace with $1.
Details about the regular expression
The idea here is to capture all first columns. A column here is defined by a sequence of
"some non-comma character": [^,]*
followed by a comma: [^,]*,
The first column can then be followed by anything .* (the GUI format doesn't matter): [^,]*,.*
Finally we need to capture the 1st column using group capturing: ([^,]*,).*
In the replace field we use a backreference $x which refers the the x-th capturing group.

Escaping invalid markdown using python regex

I've been trying to write some python to escape 'invalid' markdown strings.
This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.
My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.
An example of what I'm looking for is:
*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.
My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():
def parser(txt):
match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
for e in re.finditer(match_md, txt):
if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
txt = txt[:e.start()] + '\\' + txt[e.start():]
return txt
note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups
But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?
I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.
This was my re.sub attempt:
def test(txt):
match_md = r'(?:(\*)(.+?)(\*))|' \
'(?:(\_)(.+?)(\_))|' \
'(?:(`)(.+?)(`))|' \
'(?:(\[.+?\])(\(.+?\)))|' \
'(\*)|' \
'(`)|' \
'(_)|' \
'(\[)'
return re.sub(match_md, "\\\\\g<0>", txt)
This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)
Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.
Thanks in advance!
You are probably looking for a regular expression like this:
def test(txt):
match_md = r'((([_*]).+?\3[^_*]*)*)([_*])'
return re.sub(match_md, "\g<1>\\\\\g<4>", txt)
Note that for clarity I just made up a sample for * and _. You can expand the list in the [] brackets easily. Now let's take a look at this thing.
The idea is to crunch through strings that look like *foo_* or _bar*_ followed by text that doesn't contain any specials. The regex that matches such a string is ([_*]).+?\1[^_*]*: We match an opening delimiter, save it in \1, and go further along the line until we see the same delimiter (now closing). Then we eat anything behind that that doesn't contain any delimiters.
Now we want to do that as long as no more delimited strings remain, that's done with (([_*]).+?\2[^_*]*)*. What's left on the right side now, if anything, is an isolated special, and that's what we need to mask. After the match we have the following sub matches:
g<0> : the whole match
g<1> : submatch of ((([_*]).+?\3[^_*]*)*)
g<2> : submatch of (([_*]).+?\3[^_*]*)
g<3> : submatch of ([_*]) (hence the \3 above)
g<4> : submatch of ([_*]) (the one to mask)
What's left to you now is to find a way how to treat the invalid hyperlinks, that's another topic.
Update:
Unfortunately this solution masks out valid markdown such as *hello* (=> \*hello\*). The work around to fix this would be to add a special char to the end of line and remove the masked special char once the substitution is done. OP might be looking for a better solution.

How can I use vim to substitute all whole lines that match a Regex

There I have a tex file which contains serval paragraphs like:
\paragraph{name1}
...
\paragraph{name2}
...
Now I want to substitute all the "paragraph" with item, just like:
\item
...
\item
...
to reach that I have tried many commands and finally i used this:
(note that I used "a:" to "z:" as paragraph names)
**:% s/\\paragraph[{][a-z]:[}]/\\item/g**
and I think that is nether pretty nor efficient. I have tried to match the line contains "paragraph" but somehow only this word is replaced. Now that I can delete all such lines with
**:% g/_*paragraph_*/d**
are there anyway better to perform a substitute in the same way?(or to say to substitute all the line contains a specific word)
Your first attempt was almost correct. Rather than this
:% s/\paragraph[{][a-z]:[}]/\item/g
Use this
:% s/^\\paragraph{[a-z|0-9]\+}$/\\item/g
Let's break it down piece by piece:
The ^ character matches the start of the line, so that you don't match something like this:
Some text \paragraph{abc}
The reason why we use \\ instead of \ is because \ is an escape character, so to match it, we escape the escape character.
Doing [a-z|0-9]\+ will match one or more a-z or 0-9 characters, which is what I assume your paragraph names are composed of. If you need capital letters, you could do something like [a-zA-Z|0-9]\+.
Finally, we anchor the expression to the end of the line with $, so that it does not match lines that don't fit this pattern exactly.
Easy way to do with macro!
First, search the pattern using / like /\paragraph
Let's start the macro. Clear register a by pressing qaq.
Press qa to start recording in register a.
Press n to go its occurence. Then, press c$ to delete till end of line and to insert the text. Then, type the text and then press escape key.
Press #a to repeat the process. End macro by pressing q.
Now, macro is recorded and you can press #a once to make changes in all such lines.
You can do this:
:%s/\\paragraph{[^{}]*}/\\item/g
This finds all occurrences of \paragraph{, followed by 0 or more non-{} characters, followed by } (i.e. something like \paragraph{stuff here}), and replaces them by \item.
Or if you want to replace all lines containing paragraph:
:%s/^.*paragraph.*$/\\item/

Vim regex match a space, a number and everything else until ;

I'm trying to make this regex but it's driving me insane. I've strings like this:
foobar 34;lorem ipsum;
foo 34/ABC;dolerm sit;
bar 3445b;amet;
I need to transform them like this:
foobar;34;lorem ipsum;
foo;34/ABC;dolerm sit;
bar;3445b;amet;
The regex I come up to is this one but it matches only numbers: \s\d*; and this one matches the whole line \s\d*\p*;
I need something to match only a white space, a number and than everything until the first ";".
does this work for you?
%s/ \ze\d/;/g
if you want to change
foo bar 3 r e p l a c e;bar;
to
foo bar;3;r;e;p;l;a;c;e;bar;
%s/ \d[^;]*/\=substitute(submatch(0)," ",";","g")/
You probably could get your original patterns working, if you used "non-greedy" matches, for example \p\{-} for "any number of printable characters, but as few as possible", or by explicitly excluding the ';' character with [^;]* (any number of any character that is not a ';').
:help non-greedy
:help /[ (then scroll down below the E769 topic)

Remove all special characters from a string in R?

How to remove all special characters from string in R and replace them with spaces ?
Some special characters to remove are : ~!##$%^&*(){}_+:"<>?,./;'[]-=
I've tried regex with [:punct:] pattern but it removes only punctuation marks.
Question 2 : And how to remove characters from foreign languages like : â í ü Â á ą ę ś ć ?
Answer : Use [^[:alnum:]] to remove~!##$%^&*(){}_+:"<>?,./;'[]-= and use [^a-zA-Z0-9] to remove also â í ü Â á ą ę ś ć in regex or regexpr functions.
Solution in base R :
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-="
gsub("[[:punct:]]", "", x) # no libraries needed
You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all from the stringr package, though gsub from base R works just as well.
The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.
x <- "a1~!##$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")
(The base R equivalent is gsub("[[:punct:]]", " ", x).)
An alternative is to swap out all non-alphanumeric characters.
str_replace_all(x, "[^[:alnum:]]", " ")
Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.
Instead of using regex to remove those "crazy" characters, just convert them to ASCII, which will remove accents, but will keep the letters.
astr <- "Ábcdêãçoàúü"
iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')
which results in
[1] "Abcdeacoauu"
Convert the Special characters to apostrophe,
Data <- gsub("[^0-9A-Za-z///' ]","'" , Data ,ignore.case = TRUE)
Below code it to remove extra ''' apostrophe
Data <- gsub("''","" , Data ,ignore.case = TRUE)
Use gsub(..) function for replacing the special character with apostrophe