Regular expression string followed by numbers - regex

I am writing a regular expression to extract phrases like #Question1# or #Question125# from html string like
Patitent name #Question1#, Patient was suffering from #Question2#, Patient's gender is #Question3#, patient has #Question4# drinking for the last month. His DOB is #Question5#
The first half of the expression is simple just #Question, but I also need to match for a series of digits with unspecified length, and the whole string ends with #.
Once I find the matching phrase, how I extract only the digits from the string? Like for example, #Question312#, I just want to get 312 out?
Any suggestion?

The regexp you are looking for is
/#Question[0-9]+#/
If you need to extract the number you can just wrap the [0-9]+ part in parenthesis
/#Question([0-9]+)#/
making it a group. How you use a captured group depends on the specific regexp implementation (e.g. python, perl, javascript ...). For example in python you can replace all those questions with corresponding answers from a list with
answers = ["Andrea", "Griffini"]
text = "My first name is #Question1# and my last name is #Question2#"
print re.sub("#Question([0-9]+)#",
lambda x:answers[int(x.group(1)) - 1],
text)

I think what you are looking for is:
#Question[0-9]+#
#Question
Any character in this class: [0-9], one or more repetitions
#

Related

How to capture text between a specific word and the semicolon immediately preceding it with regex?

I have many rows of people and titles in Excel, and am looking to filter out certain people by title. For example, cells may contain the following:
John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder
These cells are varying lengths and have varying numbers of people and titles. My plan is to add semicolons at the beginning and end to standardize it. This would give me:
;John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;
Currently, I have a code that can iterate through and uses the following regex Founder.*?; which will return each instance of founder based on my code (i.e. Founder;Founder;) but the trouble is that I can't seem to figure out how to also capture the names of the people. I would think I would need to designate the semicolon immediately preceding "Founder" but so far I have not been able to get this. My ultimate goal would be to return something like the following, which I have the code for with the exception of the correct regular expression.
;John Smith, Co-Founder;James Jackson, Co-Founder;
Depending on your version of Excel, you could also do this with a formula:
=FILTERXML("<t><s>" & SUBSTITUTE(A1,";","</s><s>")&"</s></t>","//s[contains(.,'Co-Founder')]")
However, for a regex, you could use
(?:^|;)([^;]*?Co-Founder)
which will return the Co-Founders in capturing group 1.
There is no need for leading/trailing semicolons.
Even though VBA regex does not support look-behind, you can work with that limitation.
the Co-Founders Regex
(?:^|;)([^;]*?Co-Founder)
Options: Case sensitive (or not, as you prefer); ^$ match at line breaks
Match the regular expression below (?:^|;)
Match this alternative ^
Assert position at the beginning of the string ^
Or match this alternative ;
Match the character “;” literally ;
Match the regex below and capture its match into backreference number 1 ([^;]*?Co-Founder)
Match any character that is NOT a “;” [^;]*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the character string “Co-Founder” literally Co-Founder
Created with RegexBuddy
Split the whole string combined with a positive filtering and the getCoFounders() function will return an array of findings:
Sub ExampleCall()
Dim s As String
s = ";John Smith, Co-Founder;Jane Doe, CEO;James Jackson, Co-Founder;"
Debug.Print Join(getCoFounders(s), "|")
End Sub
Function getCoFounders(s As String)
getCoFounders = Filter(Split(s, ";"), "Co-Founder", True, vbTextCompare)
End Function
Results in VB Editor's immediate window
John Smith, Co-Founder|James Jackson, Co-Founder

RegEx Parse Tool to extract digits from string

Using Alteryx, I have a field called Address which consists of fields like A32C, GH2X, ABC19E. So basically where digits are pinned between sets of letters. I am trying to use the RegEx tool to extract the digits out into a new column called ADDRESS1.
I have Address set to Field to Parse. Output method Parse.
My regular expression is typed in as:
(?:[[alpha]]+)(/d+)(?:[[alpha]]+)
And then I have (/d+) outputting to ADDRESS1. However, when I run this it parses 0 records. What am I doing wrong?
To match a digit, use [0-9] or \d. To match a letter, use [[:alpha:]].
Use
[[:alpha:]]+(\d+)[[:alpha:]]+
See the regex demo.
You can try this :
let regex = /(?!([A-Z]+))(\d+)(?=[A-Z]+)/g;
let values = 'A32CZ, GH2X, ABC19E'
let result = values.match(regex);
console.log(result);

Value matching in regex and Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting two columns based on a 4 number date.
A sample of the text is:
"first sentence, second sentence, third sentences, 2009"
What I do is going to "Add column based on this column" and insert
value.match(\d{4})
but I get the error
Parsing error at offset 12: Missing number, string, identifier, regex,
or parenthesized expression
any idea of the possible solution?
You need to fix 3 things to get this working:
1) As Wiktor says you need to start & end the regular expression with a forward slash /
2) The 'match' function requires you to match the whole string in the cell, not just the fragment you need - so your regular expression needs to match the whole string
3) To extract part of a string with 'match' you need to have capture groups in your regular expression- that is use ( ) around the bit of the regular expression you want to extract. The captured values will be put in an array and you will need to get the string out of tge array to store it in a cell
So you'll need something like:
value.match(/.*(\d{4})/)[0]
To get the four digit year from the end of the string

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

how to use a regular expression to extract json fields?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?
Example:
"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24
I want to extract URL,TITLE,TAGS,
/"(url|title|tags)":"((\\"|[^"])*)"/i
I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:
"
A literal ".
(url|title|tags)
Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.
":"
Another literal string.
(
The beginning of another group. (Group 2)
(
Another group (3)
\\"
The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.
|
or...
[^"]
Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.
)
End of group 3...
*
The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.
)"
The end of group 2, and a literal ".
I've done a few non-obvious things here, that may come in handy:
Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
The i at the end of the expression makes it case insensitive.
Group 1 contains the name of the captured field.
EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.
Your new Regex is:
/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i
All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:
\[(S(,S)*)?\]
Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.
With our same S Notation, here's the whole dirty Regular Expression:
/"(url|title|tags)":(S|\[(S(,S)*)?\])/i
If it helps to see it in action, here's a view of it in action.
This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.
EDIT:
# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10
(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)
# test document
[
{
"_id": "56af331efbeca6240c61b2ca",
"index": 120000,
"guid": "bedb2018-c017-429E-b520-696ea3666692",
"isActive": false,
"balance": "$2,202,350",
"object": {
"name": "am",
"lastname": "lang"
}
}
]
the json string you'd like to extract field value from
{"fid":"321","otherAttribute":"value"}
the following regex expression extract exactly the "fid" field value "321"
(?<=\"fid\":\")[^\"]*
Please try below expression:
/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm
Explanation:
1st Capturing Group (url|title|tags): This is alternatively capturing the characters 'url','title' and 'tags' literally (case sensitive).
2nd Capturing Group ("([^""]+)"|[[^[]+]):
1st Alternative "([^""]+)" is matches all words within " and " including " and "
2nd Alternative [[^[]+] is matches all words within [ and ] including [ and ]
I have tested here
I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.
First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:
"matched".search("ch") // yields 3
For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).
Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:
find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain
With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.
You can see the library and code I authored at http://json.spiritway.co/
if your json is
{"key1":"abc","key2":"xyz"}
then below regex will extract key1 or key2 based on a key that you pass in regex
"key2(.*?)(?=,|}|$)
you can verify it here - regex101.com
Why does it have to be a Regular Expression object?
Here we can just use a Hash object first and then go search it.
mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}
The output of which would be
=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}
Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.
mh.values_at(:url, :title, :tags)
The output:
["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]
Taking the pattern that FrankieTheKneeman gave you:
pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i
we can search the mh hash by converting it to a json object.
/#{pattern}/.match(mh.to_json)
The output:
=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">
Of course this is all done in Ruby which is not a tag that you have but relates I hope.
But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.
pattern = /"(title)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">
pattern = /"(tags)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
=> nil
Sorry about that last one. It will have to be handled differently.