Extract a substring from value of key-value pair using regex

Extract a substring from value of key-value pair using regex - regex

I have a string in log and I want to mask values based on regex.
For example:
"email":"testEmail#test.com", "phone":"1111111111", "text":"sample text may contain email testEmail#test.com as well"
The regex should mask
email value - both inside the string after "email" and "text"
phone number
Desired output:
"email":"*****", "phone":"*****", "text":"sample text may contain email ***** as well"
What I have been able to do is to mask email and phone individually but not the email id present inside the string after "text".
Regex developed so far:
(?<=\"(?:email|phone)\"[:])(\")([^\"]*)(\")
https://regex101.com/r/UvDIjI/2/

As you are not matching an email address in the first part by matching not a double quote, you could match the email address in the text by also not matching a double quote.
One way to do this could be to get the matches using lookarounds and an alternation. Then replace the matches with *****
Note that you don't have to escape the double quote and the colon could be written without using the character class.
(?<="(?:phone|email)":")[^"]+(?=")|[^#"\s]+#[^#"\s]+
Explanation
(?<="(?:phone|email)":") Assert what is on the left is either "phone":" or "email":"
[^"]+(?=") Match not a double quote and make sure that there is one at the end
| Or
[^#"\s]+#[^#"\s]+ Match an email like pattern by making use of a negated character class matching not a double quote or #
See the regex demo

Your current RegEx is trying to accomplish too much in a single take. You'd be better off splitting the conditions and dealing with them separately. I'll assume that the input will always follow the structure of your example, no edge cases:
Emails:
\w+#.+?(?="|\s) - In emails, every character preceded by # is always a word character, so using \w+# is enough to capture the first half of the email. As for the second half, I used a wildcard (.) with a lazy quantifier (+?) to stop the capture as soon as possible and combined it with a positive lookahead that checks for double quotes or whitespaces ((?="|\s)) so to capture both the e-mails inside "email" and "text" properties. Lookarounds are zero-length assertions and thus they don't get captured.
Phone number:
(?<="phone":")\d+ - Here I just use the prefix "phone":" in a lookbehind and then capture only digits \d+.
Combine both conditions and you have your RegEx: \w+#.+?(?="|\s)|(?<="phone":")\d+.
Regex101: https://regex101.com/r/UvDIjI/3

Meta Sequence Word Boundary \b & Alternation |
The input string pattern has either quotes or spaces wrapped around the targets which both are considered non-words. So this: "\bemailPattern\b" and this: space\bemailPattern\bspace are matches. The alternation gives one line the power of two lines. Search for emailPattern OR phonePattern.
/(\b\w+?#\w+?\.\w+?\b|[0-9]{10})/g;
(Word boundary (a non-word on the left) \b
One or more word characters \w+?
Literal #
One or more word characters \w+?
Escaped literal .
One or more word characters \w+?
Word boundary (a non-word on the right) \b
OR |
10 consecutive numbers [0-9]{10} )
global flag continues search after first match.
Demo
let str = `"email":"testEmail#test.com", "phone":"1111111111", "text":"sample text may contain email testEmail#test.com as well"`;
const rgx = /(\b\w+?#\w+?\.\w+?\b|[0-9]{10})/g;
let res = str.replace(rgx, '*****');
console.log(res);

Related

Regex: scrub punctuation except if inside a word?

I'm not great at regex but I have this for removing punctuation from a string.
let text = 'a user provided string'
let pattern = /(-?\d+(?:[.,]\d+)*)|[-.,()&$#![\]{}"']+/g;
text.replace(pattern, "$1");
I am looking for a way to modify this so that it keeps punctuation if inside a word e.g.
some-hypenated-words
a_snake_case
or.even.a.dot.word
should all keep the punctuation. How would I modify it for that?

One option could be changing the \d to \w to extend the match to word characters and add a hyphen to the character class in the capturing group.
In the replacement use group 1.
(\w+(?:[.,-]\w+)*)|[-.,()&$#![\]{}"']+
Regex demo
If you want to match multiple hyphens, commas or dots you could repeat the character class [.,-]+

REGEX - how replace word before colon with word itself wrapped into double quotes

I have a "json" like this:
{
example:"hi there",
dateTime: "01/01/1970 bla:bla"
}
I should replace all the value before the colon wrapping them inside a doublequote.
Referring to this response Regex in ruby to enclose word before : with double quotes, I tried the code, but was not yet full correct, because it changed also the value before the colon in the dateTime.
so I should add to this code
(\w+)(?=:)
another control, which sees if the word is just after a comma.
I would like to change, so, the "json" to a real json like this:
{
"example":"hi there",
"dateTime": "01/01/1970 bla:bla"
}
and not like this like now:
{
"example":"hi there",
"dateTime": "01/01/1970 "bla":bla"
}

Here is the solution to select all attributes and replace(Check the demo):
For all case alphabets:
/(?:[a-z]+(?=:[" ]))/ig
For alphanumeric and underscore:
/(?:[\w]+(?=:[" ]))/g
Demo:
https://www.regextester.com/?fam=107535
?: is used to create group without back referencing. Hence, increases computational speed as it doesn't have to remember the group for reuse.

If you can rely to the position of values to be replaced at the start of the line, as in your example, you can use a regex like ^ +([a-zA-Z0-9_]*):, that matches only sequences of alphanumerical characters and underscores before colon and preceded by zero or more spaces and captures the sequence as first group.
You can see using online regex validator what is matched/captured and what is not for the input sample you showed.
Then you can use the captured group to wrap the text you are interested on in double quotes.
Simple runnable example in JS:
var input = `{
example:"hi there",
dateTime: "01/01/1970 bla:bla"
}`
var regexp = /^[ \t]+([a-zA-Z0-9_]*):/mg
var replaced = input.replace(regexp, '"$1":')
console.log(replaced)
EXPLANATION: m flag enables multiline match, g flag enables match all mode
I can't show you a Ruby example but the provided regexpr should help you!

Your pattern (\w+)(?=:) matches 1+ word character where on the right side is a colon. That will match bla:
What you might do is extend that positive lookahead to match 1+ times a word character \w+ and then use a positive lookahead to assert what is on the right is a colon followed by a double quote, match what is in between and a closing double quote.
Note that the matching is apart from any data structure and relies on what is on the right side of the match.
Then in the replacement you can wrap the match between quotes.
\w+(?=:\s*"[^"]+")
That matches:
\w+ Match 1+ word characters
(?= Positive lookahead to assert what is on the right matches
\s* Match 0+ times a whitespace character
*"[^"]*" Match using a negated character class from opening will closing quote and what is between.
) Close lookahead
Regex demo | Example using Ruby

Regex to match individual non-whitespace characters not contained in a word

I am trying to write a regex to match individual non-whitespace characters not contained in a specific word. The closest I've got is the following.
(?!word_to_discard)\b\S+\b
The problem is that the above expression matches the words that are not word_to_discard, but not the individual non-whitespace characters. Any ideas how to do that?

Let's split the problem:
1) You need to match characters not contained in a specific word. The easiest way to do that is to use a character group [ ] with negation ^. Let's also exclude any space character by adding \s token in the character group.
[^word_to_discard\s]
2) Now, you're saying only individual characters need to be matched, so you can use a boundary token \b to ensure there are no preceding/next alphanumeric characters.
\b[^word_to_discard\s]\b
3) In order to match all individual characters, you'll need to iterate through all matches. That thing is language/engine specific. For example, in JavaScript you'll need to specify /g parameter at the end of regex pattern, so each subsequent rgx.exec(text) invocation will get the next match in the text:
const text = "w y o r d z";
const rgx = /\b[^word_to_discard\s]\b/g;
rgx.exec(text); // Matches "y"
rgx.exec(text); // Matches "z"
rgx.exec(text); // returns null (no more matches)

The regex \b\S+\b matches between 2 word boundaries one or more times not a whitespace so that would not give your the individual non white-space characters.
You might use an alternation to match what you don't want, like word_to_discard and then capture in a group what you do want to match. You could for example use a character class to match lower characters a, b or c [a-c] not contained in word_to_discard or use \S to match not a whitespace character.
word_to_discard|(\S)
Regex demo

Regex: Non-Capturing Group in Non-Numeric?

I am trying to test that a timestamp (let's use HH:MM:ss as an example) does not have any numeric characters surrounding it, or to say that I would like to check for the presence of a non-numeric character before and after my timestamp. The non-numeric character does not need to exist, but no numeric character should exist directly before nor directly after. I do not want to capture this non-numeric character. How can I do this? Should I use "look-arounds" or non-capturing groups?
Fill in the blank + (2[0-3]:[0-5][0-9]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]:[0-5][0-9]) + Fill in the blank
Thanks!

I would like to check for the presence of a non-numeric character before and after my timestamp. The non-numeric character does not need to exist, but no numeric character should exist directly before nor directly after. I do not want to capture this non-numeric character.
The best way to match such a timestamp is using lookarounds:
(?<!\d)(2[0-3]:[0-5][0-9]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]:[0-5][0-9])(?!\d)
The (?<!\d) fails a match if there is a digit before the timestamp and (?!\d) fails a match if there is a digit after the timestamp.
If you use
\D*(2[0-3]:[0-5][0-9]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]:[0-5][0-9])\D*
(note that (?:...) non-capturing groups only hamper the regex engine, the patterns inside will still match, consume characters), you won't get overlapping matches (if there is a timestamp right after the timestapmp). However, this is a rare scenario I believe, so you still can use your regex and grab the value inside capture group 1.
Also, see my answer on How Negative Lookahead Works. A negative lookbehind works similarly, but with the text before the matching (consuming) pattern.
A JS solution is to use capturing groups:
var re = /(?:^|\D)(2[0-3]:[0-5][0-9]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]:[0-5][0-9])(?=\D|$)/g;
var text = "Some23:56:43text here Some13:26:45text there and here is a date 10/30/89T11:19:00am";
while ((m=re.exec(text)) !== null) {
document.body.innerHTML += m[1] + "<br/>";
}

The regex class for "anything that is not numeric" is:
\D
This is equivalent to:
[^\d]
So you would use:
\D*(2[0-3]:[0-5][0-9]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9]:[0-5][0-9])\D*
You don't need to surround it with a non-capturing group (?:).

Finding a match one after another

How do I find multiple matches that are (and can only be) separated from each other by whitespaces?
I have this regular expression:
/([0-9]+)\s*([A-Za-z]+)/
And I want each of the matches (not groups) to be surrounded by a whitespace or another match. If the condition is not fullfilled, the match should not be returned.
This is valid: 1min 2hours 3days
This is not: 1min, 2hours 3days (1min and 2hours should not be returned)
Is there a simpler way of finding a continuous sequence of matches (in Java preferably) than repeating the whole regex before and after the main one, checking if there is a whitespace, start/end of the string or another match?

I believe this pattern will meet your requirements (provided that only a single space character separates your alphanumeric tokens):
(?<=^|[\w\d]\s)([\w\d]+)(?=\s|$)
^^^^^^^^^^ ^^^^^^^ ^^^^
(2) (1) (3)
A capture group that contains an alphanumeric string.
A look-behind assertion: To the left of the capture group must be a) the beginning of the line or b) an alphanumeric character followed by a single space character.
A look-ahead assertion: To the right of the capture group must be a) a space character or b) the end of the line.
See regex101.com demo.
Here is some sample data that I included in the demo. Each bolded alphanumeric string indicates a successful capture:
1min 2hours 3days
1min, 2hours 3days
42min 4hours 2days

String text = "1min 2hours 3days";
boolean match = text.matches("(?:\\s*[0-9]+\\s*[A-Za-z]+\\s*)*");
This is basically looking for a pattern on your example. Then using * after the pattern its looking for zero or more occurrence of the pattern in text. And ?: means doesn't capture the group.
This will will also return true for empty string. If you don't want the empty string to be true, then change * into +

I've mananged to solve my problem by splitting the string using string.split("\\s+") and then matching the results to the pattern /([0-9]+)\s*([A-Za-z]+)/.

There is an error here the '' will match all characters and ignore your rest
/([0-9]+)\s([A-Za-z]+)/
Change to
/(\d+)\s+(\w+)/g
This will return an array of matches either digits or word characters. There is no need to always write '[0-9]' or '[A-Za-z]' the same thing can be said as '\d' match any 0 to 9 more can be found at this cheat sheet regular expression cheat sheet

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract a substring from value of key-value pair using regex - regex

Related

Regex: scrub punctuation except if inside a word?

REGEX - how replace word before colon with word itself wrapped into double quotes

Regex to match individual non-whitespace characters not contained in a word

Regex: Non-Capturing Group in Non-Numeric?

Finding a match one after another

Categories

Resources