Regex: scrub punctuation except if inside a word? - regex

I'm not great at regex but I have this for removing punctuation from a string.
let text = 'a user provided string'
let pattern = /(-?\d+(?:[.,]\d+)*)|[-.,()&$#![\]{}"']+/g;
text.replace(pattern, "$1");
I am looking for a way to modify this so that it keeps punctuation if inside a word e.g.
some-hypenated-words
a_snake_case
or.even.a.dot.word
should all keep the punctuation. How would I modify it for that?

One option could be changing the \d to \w to extend the match to word characters and add a hyphen to the character class in the capturing group.
In the replacement use group 1.
(\w+(?:[.,-]\w+)*)|[-.,()&$#![\]{}"']+
Regex demo
If you want to match multiple hyphens, commas or dots you could repeat the character class [.,-]+

Related

Regex should not be recognized for special characters

I want the regex not to be recognized, should be a special character before, between and after the regex.
My Regex:
\b([t][\W_]*?)+([e][\W_]*?)+([s][\W_]*?)+([t][\W_]*?)*?\b
https://regex101.com/r/zKg2eR/1
Example:
#test, te+st, t'est or =test etc.
I hope I could bring it across reasonably understandable.
If you want to match a word character excluding an underscore, you can write it as [^\W_] using a negated character class.
You don't need a character class for a single char [t] and you are repeating the groups as well, which you don't have to when you want to match a form of test
If the words are on a single line, you can append anchors ^ and $
^(t[^\W_]*)(e[^\W_]*)(s[^\W_]*)(t[^\W_]*)$
Regex demo
As you selected golang in the regex tester, you can not use lookarounds. Instead you can use an alternation to match either a whitespace char or the start/end of the string.
Then capture the whole match in another capture group.
(?:^|\s)((t[^\W_]*)(e[^\W_]*)(s[^\W_]*)(t[^\W_]*))(?:$|\s)
Regex demo

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

REGEX - Get all groups of characters with their delimiter

I'm not pretty good with regex sot his is my problem.
I have a String who contains c#m#fc#fm# and I want to get all groups of characters with their # at the end.
Like this :
c#
m#
fc#
fm#
I have try some regex but I never get what I want.
Thanks a lot for your help.
You can use [^#]+# and find all matches, where match will start by capturing one or more characters using negated character class [^#]+ (any character except #) and at the end will match one #
Regex Demo
Also, in case you have space in your string which you don't want to include in matched texts, you can put \s also within the negated character class and use this regex,
[^#\s]+#
Regex Demo excluding space from matched tokens

Extract a substring from value of key-value pair using regex

I have a string in log and I want to mask values based on regex.
For example:
"email":"testEmail#test.com", "phone":"1111111111", "text":"sample text may contain email testEmail#test.com as well"
The regex should mask
email value - both inside the string after "email" and "text"
phone number
Desired output:
"email":"*****", "phone":"*****", "text":"sample text may contain email ***** as well"
What I have been able to do is to mask email and phone individually but not the email id present inside the string after "text".
Regex developed so far:
(?<=\"(?:email|phone)\"[:])(\")([^\"]*)(\")
https://regex101.com/r/UvDIjI/2/
As you are not matching an email address in the first part by matching not a double quote, you could match the email address in the text by also not matching a double quote.
One way to do this could be to get the matches using lookarounds and an alternation. Then replace the matches with *****
Note that you don't have to escape the double quote and the colon could be written without using the character class.
(?<="(?:phone|email)":")[^"]+(?=")|[^#"\s]+#[^#"\s]+
Explanation
(?<="(?:phone|email)":") Assert what is on the left is either "phone":" or "email":"
[^"]+(?=") Match not a double quote and make sure that there is one at the end
| Or
[^#"\s]+#[^#"\s]+ Match an email like pattern by making use of a negated character class matching not a double quote or #
See the regex demo
Your current RegEx is trying to accomplish too much in a single take. You'd be better off splitting the conditions and dealing with them separately. I'll assume that the input will always follow the structure of your example, no edge cases:
Emails:
\w+#.+?(?="|\s) - In emails, every character preceded by # is always a word character, so using \w+# is enough to capture the first half of the email. As for the second half, I used a wildcard (.) with a lazy quantifier (+?) to stop the capture as soon as possible and combined it with a positive lookahead that checks for double quotes or whitespaces ((?="|\s)) so to capture both the e-mails inside "email" and "text" properties. Lookarounds are zero-length assertions and thus they don't get captured.
Phone number:
(?<="phone":")\d+ - Here I just use the prefix "phone":" in a lookbehind and then capture only digits \d+.
Combine both conditions and you have your RegEx: \w+#.+?(?="|\s)|(?<="phone":")\d+.
Regex101: https://regex101.com/r/UvDIjI/3
Meta Sequence Word Boundary \b & Alternation |
The input string pattern has either quotes or spaces wrapped around the targets which both are considered non-words. So this: "\bemailPattern\b" and this: space\bemailPattern\bspace are matches. The alternation gives one line the power of two lines. Search for emailPattern OR phonePattern.
/(\b\w+?#\w+?\.\w+?\b|[0-9]{10})/g;
(Word boundary (a non-word on the left) \b
One or more word characters \w+?
Literal #
One or more word characters \w+?
Escaped literal .
One or more word characters \w+?
Word boundary (a non-word on the right) \b
OR |
10 consecutive numbers [0-9]{10} )
global flag continues search after first match.
Demo
let str = `"email":"testEmail#test.com", "phone":"1111111111", "text":"sample text may contain email testEmail#test.com as well"`;
const rgx = /(\b\w+?#\w+?\.\w+?\b|[0-9]{10})/g;
let res = str.replace(rgx, '*****');
console.log(res);

Regex to match individual non-whitespace characters not contained in a word

I am trying to write a regex to match individual non-whitespace characters not contained in a specific word. The closest I've got is the following.
(?!word_to_discard)\b\S+\b
The problem is that the above expression matches the words that are not word_to_discard, but not the individual non-whitespace characters. Any ideas how to do that?
Let's split the problem:
1) You need to match characters not contained in a specific word. The easiest way to do that is to use a character group [ ] with negation ^. Let's also exclude any space character by adding \s token in the character group.
[^word_to_discard\s]
2) Now, you're saying only individual characters need to be matched, so you can use a boundary token \b to ensure there are no preceding/next alphanumeric characters.
\b[^word_to_discard\s]\b
3) In order to match all individual characters, you'll need to iterate through all matches. That thing is language/engine specific. For example, in JavaScript you'll need to specify /g parameter at the end of regex pattern, so each subsequent rgx.exec(text) invocation will get the next match in the text:
const text = "w y o r d z";
const rgx = /\b[^word_to_discard\s]\b/g;
rgx.exec(text); // Matches "y"
rgx.exec(text); // Matches "z"
rgx.exec(text); // returns null (no more matches)
The regex \b\S+\b matches between 2 word boundaries one or more times not a whitespace so that would not give your the individual non white-space characters.
You might use an alternation to match what you don't want, like word_to_discard and then capture in a group what you do want to match. You could for example use a character class to match lower characters a, b or c [a-c] not contained in word_to_discard or use \S to match not a whitespace character.
word_to_discard|(\S)
Regex demo