what is the regexp to accept - regex

I have a last name in json request and i need to build schema for the json.
I have the schema as
"lastName": {
"type": "string",
"required": true,
"pattern":"^[a-zA-Z0-9'. ]{1,40}$"
}
But we got defect saying lastnames can be as follows.
Last names: apostrophe, hyphen, period (O’Rourke; Smith-Jones; St. Pierre).
Fixed the apostrophe, period and space but don't know how to put hyphen.
Please let me know how to fix this.

The hyphen can be put at the end of the list, which makes it clear that it's not a character range:
[.....-]
Note: I wouldn't accept special characters at the beginning of the name.

Escape it with a backslash (it can then be placed anywhere in the regex):
^[\-a-zA-Z0-9'. ]
or place it at the end (where it cannot be mistakenly parsed as a range separator):
^[a-zA-Z0-9'. -]

Related

Regex multiple exclusion and match for different patterns

I want to exclude some specific words and if those words doesnt match, then should match an md5 hash for example.
Here a small log as example
"value": "ef51be4506d7d287abc8c26ea6c495f6", "u_jira_status": "", "u_quarter_closed": "", "file_hash": "ef51be4506d7d287abc8c26ea6c495f6", "escalation": "0", "upon_approval": "proceed", "correlation_id": "", "cyber_kill_change": "ef51be4506d7d287abc8c26ea6c495f6", "sys_id": "ef51be4506d7d287abc8c26ea6c495f6", "u_business_service": "", "destination_ip": "ef51be4506d7d287abc8c26ea6c495f6", u'test': u'9db92f08db4f951423c87d84f39619ef'
As you can see there is multiple values that should match, just excluding "value" and "id"
Here the regex I am using so far
([^value|^id](\":\s\"|':\su')\b)[a-fA-F\d]{32}\b
There is two cases where after the exclusion could be
"something": "hash"
'something': u'hash'
Whit the previous regex the result is the following.
The result is excluding value and id as expected, but there is a value called "cyber_kill_change" that is not matching for some reason and for the other ones is matching "file_hash", "destination_ip" and 'test' as expected.
Now as you can see in the previous image the matches are
h": "ef51be4506d7d287abc8c26ea6c495f6
p": "ef51be4506d7d287abc8c26ea6c495f6
t': u'9db92f08db4f951423c87d84f39619ef
Instead of just the MD5 (In this example is the same for the all 3 matches)
9db92f08db4f951423c87d84f39619ef
Can someone explain to me how to match correctly, please?
Note
For the exclusions I cannot use something similar to this
(?<!value|id)
The < and ! are not accepted by the software where I want to add the regex.
If it helps I am trying to use this regex for XSOAR, here some documentation of the permitted Syntax
"cyber_kill_change" ends with the character 'e' which is the same as the last character in "value", which is why it was also excluded. The problem started when you use the brackets [], which is a "character class", which means "any character in the word value or Id will be match as a single character, not as a word". It is the same as:
[value|id]=(v|a|l|u|e|i|d)
To match the exact word, you can use (value|id) you may try this Expression:
((?<!(value|id))(\":\s\"|':\su')\b)[a-fA-F\d]{32}\b
I used CyrilEx Regex Tester to check the expression and I got the same result as shown in the following image:
Regex Tester

Notepad ++: how to remove all text before and after a string

I want to just keep the code for each line in this text, what is the regular expression for this
{"name": "Canada", "countryCd": "CA", "code": 393},
{"name": "Syria", "countryCd": "SR", "code": 3535},
{"name": "Germany", "countryCd": "GR", "code": 3213}
The expected result would be
CA
SR
GR
Kind of a hack (see #Totos comment) but works for your requirements:
.*"([A-Z]{2})".*
This needs to be replaced by $1, see a demo on regex101.com (side node: isn't Germany usually GER ?)
In notepad++ I would do a find and replace like:
.*?"countryCd": "([^"]+)".*
And replace that with:
\1
That way if for some reason your country code was not just 2 letters it would be captured correctly. The [^"] is a negative character class, meaning anything that isn't " and the + makes it at least 1 character. I find using negative character classes does what is actually intended.
And in this case you want to capture whatever is in the quotes after the country CD, and this will do the trick.

Match words which have given regex around

I need to get all the text that have two or more spaces "\s{2,}" around them.
Given the following text:
IP Address Name Location Type
10.1.10.5 USLAXBOWC01RB Santa Monica, CA local
I need to extract:
Line1: "IP Address", "Name", "Location", "Type"
Line2: "10.1.10.5", "USLAXBOWC01RB", "Santa Monica, CA", "local"
EDIT:
Text eligible for extraction:
"IP Address" & "Name" are two or more spaces apart so they are eligible to be extracted. Similarly, "Santa Monica, CA" & "local".
You try to split your text according the pattern "\s{2,}".
Thus, in Python, the regex lib re give you all the needed tools:
import re
line = "IP Address Name Location Type"
result = re.split('\s{2,}',line)
Which gives:
['IP Address', 'Name', 'Location', 'Type']
EDIT
I guess i understood a little more your question : you more care about isolating a sequence between \s{2,}, than splitting it. In your example, however, the solution above seems to be the most suitable.
You asked for a regex, here it is :
reg1 = "[^\s](?!\s{2,})(?:.(?!\s{2,}))*[^\s]"
It first selects a character which is not a space with [^\s](?!\s{2,}) not followed by two spaces or more. To do so, I used the negative lookahead assertion (?!...) ;
Then, it isolates a group (?:...) composed in this way : any character . which is not followed by \s{2,} ;
Repeat with * ;
It happens that the final character is not selected if we stop now. So we should add one more [^\s].
A re.findall(reg1,line), and you should be done. One drawback maybe : it detects sequences which are at least two characters long.
In that case, an other and simpler regex could eventually complete the job : reg2 = "\s{2,}([^\s])\s{2,}". It selects single non-space characters surrounded by two spaces or more. The use of the bracket (...), forces to return only the character.
By the way, I strongly advise a look on the documentation : https://docs.python.org/2/library/re.html
Hope you found what you are looking for :-)

how to use a regular expression to extract json fields?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?
Example:
"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24
I want to extract URL,TITLE,TAGS,
/"(url|title|tags)":"((\\"|[^"])*)"/i
I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:
"
A literal ".
(url|title|tags)
Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.
":"
Another literal string.
(
The beginning of another group. (Group 2)
(
Another group (3)
\\"
The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.
|
or...
[^"]
Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.
)
End of group 3...
*
The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.
)"
The end of group 2, and a literal ".
I've done a few non-obvious things here, that may come in handy:
Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
The i at the end of the expression makes it case insensitive.
Group 1 contains the name of the captured field.
EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.
Your new Regex is:
/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i
All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:
\[(S(,S)*)?\]
Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.
With our same S Notation, here's the whole dirty Regular Expression:
/"(url|title|tags)":(S|\[(S(,S)*)?\])/i
If it helps to see it in action, here's a view of it in action.
This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.
EDIT:
# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10
(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)
# test document
[
{
"_id": "56af331efbeca6240c61b2ca",
"index": 120000,
"guid": "bedb2018-c017-429E-b520-696ea3666692",
"isActive": false,
"balance": "$2,202,350",
"object": {
"name": "am",
"lastname": "lang"
}
}
]
the json string you'd like to extract field value from
{"fid":"321","otherAttribute":"value"}
the following regex expression extract exactly the "fid" field value "321"
(?<=\"fid\":\")[^\"]*
Please try below expression:
/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm
Explanation:
1st Capturing Group (url|title|tags): This is alternatively capturing the characters 'url','title' and 'tags' literally (case sensitive).
2nd Capturing Group ("([^""]+)"|[[^[]+]):
1st Alternative "([^""]+)" is matches all words within " and " including " and "
2nd Alternative [[^[]+] is matches all words within [ and ] including [ and ]
I have tested here
I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.
First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:
"matched".search("ch") // yields 3
For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).
Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:
find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain
With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.
You can see the library and code I authored at http://json.spiritway.co/
if your json is
{"key1":"abc","key2":"xyz"}
then below regex will extract key1 or key2 based on a key that you pass in regex
"key2(.*?)(?=,|}|$)
you can verify it here - regex101.com
Why does it have to be a Regular Expression object?
Here we can just use a Hash object first and then go search it.
mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}
The output of which would be
=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}
Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.
mh.values_at(:url, :title, :tags)
The output:
["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]
Taking the pattern that FrankieTheKneeman gave you:
pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i
we can search the mh hash by converting it to a json object.
/#{pattern}/.match(mh.to_json)
The output:
=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">
Of course this is all done in Ruby which is not a tag that you have but relates I hope.
But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.
pattern = /"(title)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">
pattern = /"(tags)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
=> nil
Sorry about that last one. It will have to be handled differently.

How does Sencha Touch matcher work?

I am trying to create a simple matcher that matches any string consisting of alphanumeric characters. I tried the following:
Ext.regModel('RegistrationData', {
fields: [
{name: 'nickname',type: 'string'},
],
validations: [
{type: 'format', name: 'nickname', matcher: /[a-zA-Z0-9]*/}
]
});
However this does not work as expected. I did not find any documentation on how should a regular expression in a matcher look like.
Thank you for help.
I found a blog on sencha.com, where they explain the validation.
I have no idea what sencha-touch is, but maybe it helps, when you tell us what you are giving to your regex, what you expect it to do, and what it actually does (does not work as expected is a bit vague). According to the blog it accepts "regular expression format", so for your simple check, it should be pretty standard.
EDIT:
As a wild guess, maybe you want to use anchors to ensure that the name has really only letters and numbers:
/^[a-zA-Z0-9]*$/
^ is matching the start of the string
$ is matching the end of the string
Your current regex /[a-zA-Z0-9]*/ would match a string containing zero or more occurrences of lower or upper case characters (A-Z) or numbers anywhere in the string. That's why Joe#2, J/o/e, *89Joe as well as Joe, Joe24andjOe28` match - they all contain zero or more subsequent occurrences of the respective characters.
If you want your string to contain only the respective characters you have to change the regex according to stema's answer:
/^[a-zA-Z0-9]*$/
But this has still one problem. Due to the * which meas zero or more occurrences it also matches an empty string, so the correct string should be:
/^[a-zA-Z0-9]+$/
with + meaning one or more occurrences. This will allow nicknames containing only one lowercase or uppercase character or number, such as a, F or 6.