How can I create a regex to parse this string - regex

how can I make a regex to parse this string ?
"desc: random text string, sender: James, se-status: red, problem-field: I'm a problem field, I'm a problem field, action: runs, target: John, ta-status: blue, status-apply: red, lore: lore ipsum dolor sit amet"
I want groups that get keys and values. Please note the "problem-field" has quotes in it. The groups should get the key and then locate the last comma before the next key name.
This is an example string. Other strings can have different field names, so the regex shouldn't match specific field names like sender or action.
expected result groups:
1. "desc"
2. "random text string"
3. "sender"
4. "James"
5. "se-status"
6. "red"
7. "problem-field"
8. "I'm a problem field, I'm a problem field"
9: "action"
10."runs"
11."target"
12."John"
13."ta-status"
14."blue"
15."status-apply"
16."red"
17."lore"
18."lore ipsum dolor sit amet"
Please note problem field should be 1 result only
This question started when I tried to improve my answer to this SO question here: JS: deserializing a string with a JSON-like structure
I have done a classic for, but then user Redu created a regex based answer. Yet I didn't like because the field names had to be fixed. So I tried to create a regex with capturing groups that go back to check last comma, but I quickly discovered that my regex skills don't go that far (yet). So I thought in creating this question so we can learn with the regex masters out there.

This regex could help you:
([\w-]+): ([\w,\s']+)(?:,|$)
Demo
It capture every key/value separated with a :

Related

Regex to get text between 2 large spaces

I want to try and regex this text to only get "Second Baptist School" as the output by using Customer: as the set beginning for it to recognize. How would I get it so that it recognizes the beginning and gets all of the text in between the large sections of blanks?
Customer: Second Baptist School Date of Sale: 9/26/2022
Right now I'm using Customer:\s*([^ -.]+) but it only gets "Second" as the output.
You can look for 2 or more white spaces with:
Customer:\s*(.*?)\s{2,}
this should align with your above examples. The {2,} says 2 or more.
https://regex101.com/r/1HapOO/1

Regex optional capture groups in any order

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
I am trying to extract names and emails based on the following regex:
For names, two consecutive capitalized words:
[A-Z][\w]+\s+[A-Z][\w]+
For emails:
\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b
Example text:
John Doe john#doe.com random text
Jane Doe random text jane#doe.com
jim#doe.com more random text tim#doe.com Tim Doe
So far I have used non-capture groups and positive look aheads to tackle the "in-no-particular-order-or-even-present" problem but only managed to do so by segmenting by newlines. So my regex looks like this:
^(?=(?:.*([A-Z][\w]+\s+[A-Z][\w]+))?)(?=(?:.*(\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b))?).*
And the results miss items where there are multiple contacts on the same line:
[
["John Doe", "john#doe.com"],
["Jane Doe", "jane#doe.com"],
["Tim Doe", "tim#doe.com"],
]
When what I'm looking for is:
[
["John Doe", "john#doe.com"],
["Jane Doe", "jane#doe.com"],
[nil, "jim#doe.com"],
["Tim Doe", "tim#doe.com"],
]
My skills in regex are limited and I started using regex because it seemed like the best tool for matching names and emails.
Is regex the best tool to use for this kind of problem or are there more efficient alternatives using loops if we're extracting hundreds of contacts in this manner?
Your text is already almost too random to make this work. Even more names and emails are very difficult to capture at times. A more advanced email pattern would only help a little.There are not only unusual email addresses there are also all sorts of wild name patterns.
What about D'arcy Bly, Markus-Anthony Reid, Lee Z, and those are probably the simplest examples.
So, you have to make a lot of assumptions and won't be fully satisfied unless you are using more advanced techniques like Natural language processing.
If you insist on your approach, I came up with this (toothless) monstrosity:
([A-Z]\w+ [A-Z]\w+)(?:\w* )*([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})|
([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})(?:\w* )*([A-Z]\w+ [A-Z]\w+)|
([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})
The order of the alternation groups is important to be able to capture the stray email.
Demo
PS: The demo I uses a branch reset to capture only in group 1 and 2. However, it looks like Ruby 2.x does not support branch reset groups. So, you need to check all 5 groups for values.
Here's a rewrite of #wp78de's idea into Ruby regexp syntax:
regexp = /
(?<name>
[A-Z][\w]+\s+[A-Z][\w]+
){0}
(?<email>
\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b
){0}
(?:
\g<name> (?:\w*\s)* \g<email>
| \g<email> (?:\w*\s)* \g<name>
| \g<email>
)
/x
text = <<-TEXT
John Doe john#doe.com random text
Jane Doe random text jane#doe.com
jim#doe.com more random text tim#doe.com Tim Doe
TEXT
p text.scan(regexp)
# => [["John Doe", "john#doe.com"],
# => ["Jane Doe", "jane#doe.com"],
# => [nil, "jim#doe.com"],
# => ["Tim Doe", "tim#doe.com"]]

Converting extracted text string to date where string varies in length in Postgres

I have a materialized view of a text column that extracts a string of numbers representing a date.
The materialized view is created using the following function:
(regexp_replace(left(substring(lower(replace(content,' ','_')) from 're-inspection_date:_(.*)_'),10),'\D','','g'))
And outputs a text string in the format of MMDDYYYY except it does not account for leading zeroes for single-digit months and days.
When I try to use the "to_date" function specifying the format MMDDYYYY using the following:
(to_date(regexp_replace(left(substring(lower(replace(content,' ','_')) from 're-inspection_date:_(.*)_'),10),'\D','','g'),'MMDDYYYY'))
I get the error "date/time field value out of range: '12122018'".
I believe the issue is due to one or both of the following reasons:
The resulting strings from my current regexp in the materialized view vary in length (e.g. 12212018 8222018 962018) due to my regexp removing all non-integer characters. The dates are 6, 7 or 8 digits long.
As a result, I haven't yet been able to come up with a way of inserting a delimiter between the month/day/year values.
Is there a way to make change these output strings to date format without changing my regexp?
If not, how could I change my regexp for extracting these values?
Bear in mind that the date I'm after in the source text is formatted as 12/1/2018 and also doesn't account for leading 0's in days or months. Also, there is another date preceding the target date in the text formatted the same way.
Here is a sample of the source text:
PLACEHOLDER TEXT FOR REDACTED STUFF BLAH BLAH BLAH
**** Loremipsum
11/28/2018 4: 21: 37 PM ****1 of 2 Facility Information Permit
Number: 12-34-56789 Name of Facility: Dolor sit amet-consectetur
Address: 123 Fake Street City, Zip: adipiscing elit12345 RESULT: sed
Do Eiusmod tempor: by 8: 00 AM Re-Inspection Date: 12/4/2018 Type: Blah-Type Stuff Etc: Dolor sit amet-consectetur...
Where the "Re-Inspection Date: 12/4/2018" is what I'm after.
I'm on Postgres 11.
Kaushik Nayak is correct I guess. I get the same thing with this regex using a positive lookbehind (?<= Re-Inspection Date:) and allowing for any number of integers [0-9]* seperated with one slash /{1}
SELECT to_date(substring('string'
from '(?<=Re-Inspection Date: )[0-9]*/{1}[0-9]*/{1}[0-9]*'), 'mm/dd/yyyy');
You may specify varying lengths of integers using the repetition {} pattern
select to_date(substring(lower(content)
from 're-inspection date:\s*(\d{1,2}/\d{1,2}/\d{4})' ),'mm/dd/yyyy') from t
Demo

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Regular Expression help needed to convert lst file to csv

I have a file (ratings.lst) downloaded from IMDB Interfaces. The content appears to be in in the following format :-
Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
My aim is to convert this file into a CSV file (comma separated) with the following desired result (example for 1 line) :
Distribution Votes Rating Title
0000001222, 301527, 8.4, The Green Mile (1999)
I am using textpad and it supports regex based search and replace. I'm not sure what type of regex is needed to achieve the above desired results. Can somebody please help me on this. Thanks in advance.
The other regular expressions are somewhat overcomplicated. Because whitespace is guaranteed not to appear in the first three columns, you don't have to do a fancy match - "three columns of anything separated by whitepace" will do.
Try replacing ^(.+?)\s+(.+?)\s+(.+?)\s+(.+?)$ with \1,\2,\3,"\4" giving the following output (using Notepad++)
Distribution,Votes,Rating,"Title"
0000001222,297339,8.4,"Reservoir Dogs (1992)"
0000001223,64504,8.4,"The Third Man (1949)"
0000000115,48173,8.4,"Jodaeiye Nader az Simin (2011)"
0000001232,324564,8.4,"The Prestige (2006)"
0000001222,301527,8.4,"The Green Mile (1999)"
Note the use of a non-greedy quantifier, .+?, to prevent accidentally matching more than we should. Also note that I've enclosed the fourth column with quote marks "" in case a comma appears in the movie title - otherwise the software you use to read the file would interpret Avatar, the Last Airbender as two columns.
The nice tabular alignment is gone - but if you open the file in Excel it will look fine.
Alternately, just do the entire thing in Excel.
First replace all " with "" then do this:
Find: ^\([0-9]+\)[ \t]+\([0-9]+\)[ \t]+\([^ \t]+\)[ \t]+\(.*\)
Replace with: \1,\2,\3,"\4"
Press F8 to open Replace dialog
Make sure Regular Expression is selected
In Find what: put: ^([[:digit:]]{10})[[:space:]]+([[:digit:]]+)[[:space:]]+([[:digit:]]- {1,2}\.[[:digit:]])[[:space:]]+(.*)$
In Replace with: put \1,\2,\3,"\4"
Click Replace All
Note: This uses 1 or more spaces between fields from ratings.lst - you might be better off specifying the exact number of spaces if you know it.
Also Note: I didn't put spaces between the comma seperated items, as generally you don't, but feel free to add those in
Final Note: I put the movie title in quotes, so that if it contains a comma it doesn't break the CSV format. You may want to handle this differently.
MY BAD This is a C# program. I will leave it up for an alternate solution.
The ignorepattern whitespace is for commenting the pattern.
This will create data which can be placed into a CSV file. Note CSV files do not have optional whitepsace in them as per your example....
string data =#"Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
";
string pattern = #"
^ # Always start at the Beginning of line
( # Grouping
(?<Value>[^\s]+) # Place all text into Value named capture
(?:\s+) # Match but don't capture 1 to many spaces
){3} # 3 groups of data
(?<Value>[^\n\r]+) # Append final to value named capture group of the match
";
var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => string.Join(",", mt.Groups["Value"].Captures
.OfType<Capture>()
.Select (c => c.Value))
);
Console.WriteLine (result);
/* output
Distribution,Votes,Rating,Title
0000001222,297339,8.4,Reservoir Dogs (1992)
0000001223,64504,8.4,The Third Man (1949)
0000000115,48173,8.4,Jodaeiye Nader az Simin (2011)
0000001232,324564,8.4,The Prestige (2006)
0000001222,301527,8.4,The Green Mile (1999)
*/