Regex to extract 2 lists of connected words

Regex to extract 2 lists of connected words - regex

I want to extract 2 lists of words that are connected by the sign =. The regex code works for separate lists but not in combination.
Example string: bla word1="word2" blabla abc="xyz" bla bla
One output shall contain the words directly left of =, i.e. word1, abc and the other output shall contain the words directly right of =, i.e. word2, xyz without quotes.
\w+(?==\"(?:(?!\").)*\")
extracts the words left of =, i.e. word1,abc
=\"(?:(?!\").)*\" extracts the words right of = including quotes and =, i.e. ="word2",="xyz"
How can I combine these 2 queries to a single regex-expression that outputs 2 groups? Quotes and equal signs shall not be outputted.

You can use
([^\s=]+)="([^"]*)"
See the regex demo. Details:
([^\s=]+) - Group 1: one or more occurrences of a char other than whitespace and = char
=" - a =" substring
([^"]*) - Group 1: zero or more chars other than " char
" - a " char.
Note: \w+ only matches one or more letters, digits and underscores, and won't match if the keys contain, say, hyphens. (?:(?!\").)* tempered greedy token is not efficient, and does not match line break chars. As the negative lookahead only contains a single char pattern (\.), it is more efficient to write it as a negated character class, [^.]*. It also matches line break chars. If you do not want that behavior, just add the \r\n into the negated character class.

If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question)
import re
test_str='abc="def" ghi'
ans=re.search("(\w+)=\"(\w+)\"",test_str)
print(ans.group(1))
print(ans.group(2))
my_list=list(ans.groups())
print(my_list)

This should do what you want:
(?: (\w*)=)(?:\"(\w*)\")
This is for a python regex.
You can see it working here.

Related

How can i add conditional statements in Regex

I have 2 strings
1) abc-def
2) abc-
and i have written regex group (?<Myid>[a-zA-Z0-9-]+) all works fine for the first string
However in 2nd string i don't need "-", only abc should be selected. How can i add condition here.

I would phrase your regex as:
(?<Myid>[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*)
This pattern says to match:
[a-zA-Z0-9]+ match one or more alphanumeric characters
(?:-[a-zA-Z0-9]+)* followed by dash and more alphanumeric characters,
zero or more times
Demo

Just appending the negation rule at the end will suffice here I guess.
i.e. (?<Myid>[a-zA-Z0-9-]+[^-])
Demo: https://regex101.com/r/PetK6Q/1

Regex for masking data

I am trying to implement regex for a JSON Response on sensitive data.
JSON response comes with AccountNumber and AccountName.
Masking details are as below.
accountNumber Before: 7835673653678365
accountNumber Masked: 783567365367****
accountName Before : chris hemsworth
accountName Masked : chri* *********
I am able to match above if I just do [0-9]{12} and (?![0-9]{12}), when I replace this, it is replacing only with *, but my regex is not producing correct output.
How can I produce output as above from regex?

If all you want is to mask characters except first N characters, don't think you really a complicated regex. For ignoring first N characters and replacing every character there after with *, you can write a generic regex like this,
(?<=.{N}).
where N can be any number like 1,2,3 etc. and replace the match with *
The way this regex works is, it selects every character which has at least N characters before it and hence once it selects a character, all following characters also get selected.
For e.g in your AccountNumber case, N = 12, hence your regex becomes,
(?<=.{12}).
Regex Demo for AccountNumber masking
Java code,
String s = "7835673653678365";
System.out.println(s.replaceAll("(?<=.{12}).", "*"));
Prints,
783567365367****
And for AccountName case, N = 4, hence your regex becomes,
(?<=.{4}).
Regex Demo for AccountName masking
Java code,
String s = "chris hemsworth";
System.out.println(s.replaceAll("(?<=.{4}).", "*"));
Prints,
chri***********

If you match [0-9]{12} and replace that directly with a single asterix you are left with accountNumber Before: *8365
There is no programming language listed, but one option to replace the digits at the end is to use a positive lookbehind to assert what is on the left are 12 digits followed by a positive lookahead to assert what is on the right are 0+ digits followed by the end of the string.
Then in the replacement use *
If the value of the json exact the value of chris hemsworth and 7835673653678365 you can omit the positive lookaheads (?=\d*$) and (?=[\w ]*$) which assert the end of the string for the following 2 expressions.
Use the versions with the positive lookahead if the data to match is at the end of the string and the string contains more data so you don't replace more matches than you would expect.
(?<=[0-9]{12})(?=\d*$)\d
In Java:
(?<=[0-9]{12})(?=\\d*$)\\d
(?<=[0-9]{12}) Positive lookbehind, assert what is on the left are 12 digits
(?=\d*$) Positive lookahead, assert what is on the right are 0+ digits and assert the end of the string
\d Match a single digit
Regex demo
Result:
783567365367****
For the account name you might do that with 4 word characters \w but this will also replace the whitespace with an asterix because I believe you can not skip matching that space in one regex.
(?<=[\w ]{5})(?=[\w ]*$)[\w ]
In Java
(?<=[\\w ]{4})(?=[\\w ]*$)[\\w ]
Regex demo
Result
chri***********

Remove characters from regex query

I have trouble understanding why my regex query takes one extra character besides the symbols I have told regex to include into the query, so this is my regex:
([\-:, ]{1,})[^0-9]
This is my test text:
Test- Product-: 1 --- 3 hour ,--kayak:--rental
It always includes the first character of each starting word, like P on Product or h on hour, how can I prevent regex from including those first characters?
I am trying to get all dashes, double points, comma and spaces excluding numbers or any characters.

The [^0-9] part of your regex matches any char but a digit, so you should remove it from your pattern.
There is no need to wrap the character class with a capturing group, and {0,1} is equal to +, so the whole regex can be shortened to
[-:, ]+
Note that - in the initial and end positions inside a character class does not have to be escaped.

Regex matching a text after a specific string until another specific string

If I have the following example:
X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>
How can I select the text
Here is our forecast
after "X-FileName .... \n" until "Message-ID" execluded?
I read about lookahead and behind and tried this but didn't work:
(?<=X-FileName:(\n)+$).+(?=Message-ID:)

This should do it:
(?:X-FileName:[^\n]+)\n+([^\n]+)\n+(?:Message-ID:) (group #1 is the match)
Demo
Explanation:
(?:X-FileName:[^\n]+) matches X-Filename: followed by any number of characters that aren't newlines, without capturing it (?:).
\n+ matches any number of consecutive newlines.
([^\n]+) matches and captures any number of consecutive characters that aren't newlines.
\n+, again, matches any number of consecutive newlines.
(?:Message-ID:) matches Message-ID: without capturing it (?:).
Edit: as #WiktorStribiżew mentioned though, splitting your text into lines may be an easier/cleaner way to retrieve what you want.

There are two approaches here, and they depend on the broader context. If your expected substring is the second paragraph, just split with \n\n (or \r\n\r\n) and get the second item from the resulting list.
If it is a text inside some larger text, use a regex.
See a Python demo:
import re
s='''X-FileName: pallen (Non-Privileged).pst
Here is our forecast
Message-ID: <15464986.1075855378456.JavaMail.evans#thyme>'''
# Non-regex way for the string in the exact same format
print(s.split('\n\n')[1])
# Regex way to get some substring in a known context
m = re.search(r'X-FileName:.*[\r\n]+(.+)', s)
if m:
print(m.group(1))
The regex means:
X-FileName: - a literal substring
.* - any 0+ chars other than line break chars
[\r\n]+ - 1 or more CR or LF chars
(.+) - Group 1: one or more chars other than line break chars, as many as possible.
See the regex demo.

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)

This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""

The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+

Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").

Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to extract 2 lists of connected words - regex

If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question) import re test_str='abc="def" ghi' ans=re.search("(\w+)=\"(\w+)\"",test_str) print(ans.group(1)) print(ans.group(2)) my_list=list(ans.groups()) print(my_list)

This should do what you want: (?: (\w)=)(?:\"(\w)\") This is for a python regex. You can see it working here.

Related

How can i add conditional statements in Regex

Regex for masking data

Remove characters from regex query

Regex matching a text after a specific string until another specific string

regex to match entire words containing only certain characters

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to extract 2 lists of connected words - regex

If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question) import re test_str='abc="def" ghi' ans=re.search("(\w+)=\"(\w+)\"",test_str) print(ans.group(1)) print(ans.group(2)) my_list=list(ans.groups()) print(my_list)

This should do what you want: (?: (\w*)=)(?:\"(\w*)\") This is for a python regex. You can see it working here.

Related

How can i add conditional statements in Regex

Regex for masking data

Remove characters from regex query

Regex matching a text after a specific string until another specific string

regex to match entire words containing only certain characters

Categories

Resources

This should do what you want: (?: (\w)=)(?:\"(\w)\") This is for a python regex. You can see it working here.