Removing escaped unicode sequence in a text file [duplicate] - regex

This question already has an answer here:
Regex for matching Unicode pattern
(1 answer)
Closed 2 years ago.
I have a text file with lots of unicode escaped sequence (of emojis by the way), for instance
blablabla \uD83D\uDC4D\uD83C blablabla \uDFFC\uD83D\uDC4F\uD83C\uDFFD
I'd like to remove it all, and get
blablabla blablabla
Is there Any regex expression which would clean these considering that i use Notepad++?
Thanks.

I would suggest: \\u[0-9A-F]{4}\s?.
\\u escapes the slash and matches it and the u literal. [0-9A-F]{4} matches exactly 4 of these characters. Perhaps you should update it to also match length 2 characters depending on the actual text: \\u([0-9A-F]{4}|[0-9A-F]{2})\s?
The \s? matches zero or more whitespace so you don't end up with multiple consecutive whitespace characters.

Related

How to exclude a substring in a regular expression? [duplicate]

This question already has answers here:
What is the difference between .*? and .* regular expressions?
(3 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 5 months ago.
There is a line of text:
Lorem ~Ipsum~ is simply ~dummy~ text ~of~ the printing...
To find all the words enclosed in ~~ I use
re.search(r'~([^~]*)~', text)
Let's say it became necessary to use ~~ instead of ~
([^\~]+) indicates to exclude the ~ character from the text within those characters
How do I make a regular expression to exclude a string of characters instead of just one?
That is, ~~Lor~em~~ should return Lor~em
The symbol of the new string must not be excluded and the length of the found string cannot be 0
Use a non-greedy quantifier instead of a negated character set.
re.search(r'~~(.*?)~~', text, flags=re.DOTALL)
re.DOTALL makes . match newline characters.

Parsing regex with escaped pipe delimiter [duplicate]

This question already has answers here:
regular expression to match pipe separated strings with pipe escaping
(4 answers)
Closed 3 years ago.
Im trying to parse
|123|create|item|1497359166334|Sport|Some League|\|Team\| vs \|Team\||1497359216693|
With regex (https://regex101.com/r/KLzIOa/1/)
I currently have
[^|]++
Which is parsing everything correctly except \|Team\| vs \|Team\|
I would expect this to be parsed as |Team| vs |Team|
If i change the regex to
[^\\|]++
It parses the Teams separately instead of together with the escaped pipe
Basically i want to parse the fields between the pipes however, if there are any escaped pipes i would like to capture them. So with my example i would expect
["123", "create", "item", "1497359166334", "Sport", "Some League", "|Team| vs |Team|", "1497359216693"]
You can alternate between:
\\. - A literal backslash followed by anything, or
[^|\\]+ - Anything but a pipe or backslash
(?:\\.|[^|\\]+)+
https://regex101.com/r/KLzIOa/2
Note that there's no need for the possessive quantifier, because no backtracking will occur.
If you also want to replace \|s with |s, then do that afterwards: match \\\| and replace with |.
To handle escaping, you should match a backslash and the character after it as a single "item".
(?:\\.|[^|])++
This conveniently also works for escaping the backslashes themselves!
To then remove the backslashes from the results, use a simple replacement:
Replace: \\(.)
With: $1
Use:
(?:\\\||[^|])+
Demo & explanation

\w doesn't work in vim search replace but a-zA-Z does? [duplicate]

This question already has answers here:
Vim regex with metacharacters inside bracket
(3 answers)
Closed 4 years ago.
tldr
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
The text I'm searching:
1 string.here blah blah
24 another-string.here blah.
1523 another-string.goes.here. blah123
Desired output
string.here
another-string.here
another-string.goes.here
My Question
Why does this work:
:%s/\v^\d+\s+([a-zA-z\.-]+)\s+.*/\1/g
But this does not:
:%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
E486: Pattern not found :%s/\v^\d+\s+([\w\.-]+)\s+.*/\1/g
The only difference between the two is a-zA-Z vs \w inside square brackets. But doesn't \w equal a-zA-Z (plus some other non-whitespace characters not in this example text)?
I'm using default vim. Unmodified. Whatever comes with Ubuntu.
Non-vim platforms
When I try with the atom text editor instead of vim, both expressions work.
Search: ^\d+\s+([a-zA-z\.-]+)\s+.*
Replace: $1
When I try with RegExr both expressions work. (Although I have to add the multiline tag)
Other things I've tried
My understanding is that \v is necessary for avoiding escaping hell. I've tried without it:
:%s/^\d\+\s\+\([a-zA-Z\.-]\+\)\s\+.*/\1/g
works
:%s/^\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
does not work. ("Pattern not found")
I've also tried adding the m flag (so the end is /gm) but that didn't work
E488: Trailing characters
I've also tried without the ^.
:%s/\d\+\s\+\([\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\w\.-]\+\)\s\+.*
I've also tried using \\w instead of \w.
:%s/\d\+\s\+\([\\w\.-]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\([\\w\.-]\+\)\s\+.*
I've also tried using \[ \] instead of [ ].
:%s/\d\+\s\+\(\[\\w\.-\]\+\)\s\+.*/\1/g
E486: Pattern not found: \d\+\s\+\(\[\\w\.-\]\+\)\s\+.*
[a-zA-Z\.-] works in Vim regex search replace, but [\w\.-] does not.
[a-zA-Z\.-] is a collection of characters containing:
every character from a to z,
every character from A to Z,
the character .,
and the character -.
:help /collection is regrettably not explicit about this but character classes like \w are interpreted as "escaped w", and thus "plain w", so [\w\.-] is really just [w\.-] which is not what you want:
the character w,
the character .,
and the character -.

Meaning of regular expressions like - \\d , \\D, ^ , $ etc [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
What do these expressions mean? Where can I learn about their usage?
\\d
\\D
\\s
\\S
\\w
\\W
\\t
\\n
^
$
\
| etc..
I need to use the stringr package and i have absolutely no idea how to use these .
From ?regexp, in the Extended Regular Expressions section:
The caret ‘^’ and the dollar sign ‘$’ are metacharacters that
respectively match the empty string at the beginning and end of a
line. The symbols ‘\<’ and ‘>’ match the empty string at the
beginning and end of a word. The symbol ‘\b’ matches the empty
string at either edge of a word, and ‘\B’ matches the empty string
provided it is not at an edge of a word. (The interpretation of
‘word’ depends on the locale and implementation: these are all
extensions.)
From Perl-like Regular Expressions:
The escape sequences ‘\d’, ‘\s’ and ‘\w’ represent any decimal
digit, space character and ‘word’ character (letter, digit or
underscore in the current locale: in UTF-8 mode only ASCII letters
and digits are considered) respectively, and their upper-case
versions represent their negation. Vertical tab was not regarded
as a space character in a ‘C’ locale before PCRE 8.34 (included in
R 3.0.3). Sequences ‘\h’, ‘\v’, ‘\H’ and ‘\V’ match horizontal
and vertical space or the negation. (In UTF-8 mode, these do
match non-ASCII Unicode code points.)
Note that backslashes usually need to be doubled/protected in R input, e.g. you would use "\\h" to match horizontal space.
From ?Quotes:
Backslash is used to start an escape sequence inside character
constants. Escaping a character not in the following table is an
error.
\n newline
\r carriage return
\t tab
As others comment above, you may need a little more help if you're getting started with regular expressions for the first time. This is a little bit off-topic for StackOverflow (links to off-site resources), but there are some links to regular expression resources at the bottom of the gsubfn package overview. Or Google "regular expression tutorial" ...

What does the regex "/\\*{2,}/" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm kinda new to regex, and specifically, I don't understand there are 2 backslashes? I mean, I know the second one is to escape the character "*", but what does the first backslash do?
Well I'm passing this regex expression to the php function preg_match(), and I'm trying to find strings that include 2 or more consecutive "*".
That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:
match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as: match a single \ and another \ n-times 2 times or more
This is invalid syntax
/\*{2,}\
Read as match 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).
Both are very similar, except that single quotes ('') only support the following escape sequences:
\' - Produces '
\\- Produces \
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\" - Produces "
\' - Produces '
\\ - Produces \
\x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
\0 - Produces a null char
\1 - Produces a control char (same as chr(1))
\u<4-digit hex number> - Produces an UTF-8 character
\r - Produces a newline on old OSX
\n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
\t - Produces a tab
\<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
... (some more that I probably forgot) ...
\<anything> - Produces <anything>
Read more about this on https://php.net/manual/en/language.types.string.php
Depending on the platrofm you're using, "/\\*{2,}/" may actually be a representation of a /\*{2,'}/ string - this is because languages like Java treat \ as an escape character, so to actually put that character within regex, you need to escape the character in regex string.
So, we have /\*{2'}/ regex. \*' matches the star character, and{2,}` means at least two times. Your regex will match any two or more consecutive star characters.
Is it a string literal written in a program and if so which one? The double backslash may be to escape the escape char so that this regex matches at least 2 * star characters.
In JavaScript for example you need to escape the \ so that your string literal can express it as data before you transform it into a regular expression when using the RegExp constructor. Why do regex constructors need to be double escaped?
For PHP what you have with that regex is to repeat literally a * 2 or more times. You can easily see with with below diagram:
But when you have to code it in PHP you have to escape the backslash (with a backslash) to use it in string. For instance:
$re = "/\\*{2,}/";
$str = "...";
preg_match($re, $str, $matches);