'Looping' Through a Repeating Substring with Regex - regex

The general problem:
I've got lot of data I'm trying to clean up then parse. Each line is really long, but they all have the same structure. It starts with one unique substring, followed by a second unique substring, followed by a substring that repeats about 20 times.
So it's: String A, String B, String C, String C, String C, etc. Every line is in that format.
At the start of String A is an ID, just a unique six digit number. I'm trying to insert that ID at the beginning of String B and all of the String C's.
String C is the problem. I can write regex's for each of the ID, B, and C, but trying to insert the captured ID into all the C's fails. It only works on the last one. That's actually the correct behavior here, but I'm pretty sure there is a way to to treat String C so that it will act like each instance of the substring is separate. And the regex runs over it again and again.
I tried using '\G' syntax but I can't seem to make it work.
So here's a specific example using some massively abridged sample data:
['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]], ['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]]],'TM',21,21,966],['run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]]],'TR',12,33,521],['run':142278, etc...
Just a note: The only difference between String B and all the String Cs is the number of brackets, but that's actually useful once I start parsing this out (ultimately it'll all be JSON).
What I'm trying to get is:
['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]],['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['sample_id':121084,'run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['sample_id':121084,'run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]],'TM',21,21,966],['sample_id':121084,'run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]],'TR',12,33,521],['sample_id':121084, etc...
In the latter text block each substring now begins with the ID 'sample_id':121084 (I bolded it to make it slightly easier to see what's going on).
Here's the Regex that gets me up through String C.
\[('sample_id':\d{6},)(?:.+\]\]\],\[\[)\[(.+?\d\],)\[(.+?\d\],)
So I'm trying to insert that first capture group ($1) in front of the second group, then the third group over and over and over (about 20x). If I repeat the last bit, I end up killing all but one of the C Strings, which again, I believe to be the 'proper' behavior. I'm trying to figure out how to get around that.
It's a mess I know. But each of those is just one line, and I've got doc after doc that'll have 100 or so lines like that. So a regex that doesn't break up the lines seems best.
I went over this page a few times trying to engineer a solution, but again, I couldn't make the \G syntax work here.
Collapse and Capture a Repeating Pattern in a Single Regex Expression
Should mention I'm trying to do this in Sublime Text 2. Thanks for any help.

Related

Regex return single match starting from beginning of line

I have text that looks something like this (but much uglier of course):
On 9/02/2019, 9/03/2019 and 9/05/2019 this thing happened
and I would like to regex capture anything related to the date
On 9/02/2019, 9/03/2019 and 9/05/2019
Using regex
([a-zA-Z ]{0,5}(?:\d+\/)+\d+\s*(?:,|and|\s)+)
this seems to work fine (allowing for the first 5 characters of string to be leading words).
But when I tell it I only want to start at the beginning of the string with
^([a-zA-Z ]{0,5}(?:\d+\/)+\d+\s*(?:,|and|\s)+)
It only captures On 9/02/2019,.
Strings that begin with the event followed by the date I will need to treat differently
He was walking when he encountered a stray dog on 9/2/19
https://regex101.com/r/p4H10Z/1
Thanks for the rapid responses. I was hoping to figure out how to make it a single match. Now I see. Grouping all of the date matches in another group that can be repeated does the job.
^([a-zA-Z ]{0,5}((?:\d+\/)+\d+\s*(?:,|and|\s)+)+)

extract the last 2 fields regardless of size

I have been trying to get the last "two fields" of the following strings:
cc-api-data.bar.bar
external-atl3-1.xx.fbcdn.net
fbcdn.net
for the first 2 strings, I would like to only get the "bar.bar" and "fbcdn.net." However, for the last string, I want it to match the whole thing since it has all i want.
I am pretty confident i could do this in a simple script but I am trying to use regex in this case. I can only get the last part of the string on the last string but not the whole thing. And I cannot tell the regex which field to take.
I literally just want the last two fields, no matter how many delimiters there are.
Any suggestions or is it even possible
My guess is that we might want an expression that would have a $ anchor similar to:
([^.]*)\.([^.]*)$
towards the right side of our strings.
Please see the demo for additional explanation
I was wondering how does regex know to get
only the part before the last period. Is it because it grabs any
character thats not a "." and because it is at the end of the line?
why couldnt it grab the first octet?
Good question, also a bit difficult to explain, by playing this demo, we can watch the many steps prior to getting to our matches:
Steps of Regular Expression
It would start char by char and test it against our expression, it would pass for our rules in the expression, yet in early chars or octet, once it would hit the $ end anchor, those early chars or octet would fail, because our last end of string rule has been broken.

Validating a list using Regex

I have been asked to help sort out a system that validates some returned MS Excel forms.
A Python program already exists that takes a returned Excel form, and another MS Excel workbook with looks much like the form, but the fields are filled with Regexes.
The Python code then validates the form by testing the returned form, against the Regexes in turn from the Regex form.
This all works as expected, and pumps out a workbook with a list of problems where a match doesn't occur.
However the validation wasn't always returning results as expected and I was asked to try sorting it out.
I have been reviewing the regexes against a document describes what the valid responses in the form should be. I can cope with most of them, but a couple of them got me to thinking. These are ones where the valid entry in the form is a list of items. E.g. a list of words seperated by commas or newlines.
Everytime I come across one of these I have been using the following approach:
^[A-Z]{0,10}((, *|\n)[A-Z]{0,10})*$
So this will match the first uppercase word up to 10 letters, then the rest of the list with each entry preceeded by a , or <CR>. It works, but I wonder if there is a better way? The reason I am thinking this is because the matching pattern for each list entry has to be in the regex twice. So if a problem is spotted, it has to be corrected in two places.
Is there a better way?
Use
^\b((^|, *|\n)[A-Z]{1,10})+$
It makes sure the text starts with a word character (i.e. a-zA-Z0-9_) by checking for a word boundary. Thus preventing the text to start with , or an empty line.
Then, repeating at least once, there should be a line beginning with one of
Start of text
A comma followed by any number of spaces
a newline
then, at least one, at the most ten, uppercase characters.
The first start alternative is obviously only possible for the start of the line, the second for intermediate words and the final for the last word on a line.
See it here at regex101.

REGEX to find first instance after set length

I'm probably going to get pilloried for asking this question, but after searching and trying to figure out this regex on my own, I'm just tired of wasting time trying to figure out. Here's the problem I'm trying to solve. I frequently use editpad pro to to convert character strings so they will fit into a mainframe.
For instance, I want to convert a column of words from excel into an IN clause for sql. The column is 5000 words or so.
I can easily copy and paste that into the text editor and then using find and replace convert that from a column of words to a single row with ',' separating each word.
Once that's done, though I want to use a regex to split this row before or after a comma after 70 characters have gone by.
(?P<start>^.{0,70})
This will give me the first 70 characters, but then I get stuck as I can't figure out how to create the next group to find all the characters up to the next comma so I can refer to it like this
(?P<start>^.{0,70})(?P<next>????,)
If I could get that, then I could create do a find and replace that would break it after the first comma that appears after the 70th character.
I know given the rest of the day I could figure it out, but I need to move on. I've tried this before. I would even be willing to only find the first 7o characters and then next few characters until the comma and then have to repeat the replace and find multiple times, if necessary, but I can not get the regex to work.
Any assistance with this would be greatly appreciated.
Here is some sample data that I have added line breaks into as an example of what I want it to look like after the regex runs.
'Ability','Absence','Absolute','Absorb','Accident','Acclaim','Accompany',
'Accomplish','Achievement','Acquaintance','Acquire','Across','Acting','Address',
'Admire','Adorable','Advance','Advertisement','Afraid','Agriculture','Align',
'All','Allow','Allowance','Allowed','Alone','Aluminium','Always','America',
'Analyze','Android','Angle','Announce','Annual','Ant','Antarctica','Antler',
I think you should consider restricting your initial concatenation, but here's a solution to your specific implementation :
^.{0,70}[^,]*
This will select the first 70 characters (if available), then every character up to the one before the next comma.
I don't think you need groups here, but you can obviously add them to the regex :
(?P<start>^.{0,70})(?P<next>[^,]*)

Regex: How to dynamically get words after the first word and not the last word in a '_' separated string?

Working on a migrations class in php.
If I have a string like this:
create_users_roles_table
and I want to get the words between the first and the last word correctly, plus being able to get the word correct if there's only one word inbetween like:
create_users_table
How do I go about that?
I've done:
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)_table
and that works fine when I do create_users_roles_table
and produces users and roles.
But when only doing create_users_table it produces user and s.
Obviously I need it to produce only users.
Anyone?
I think it should read
(\B)_([a-zA-Z]+)_?([a-zA-Z]+)?_table
But this won't work if there are three words in between. I'd suggest stripping the words and then splitting them separately, since I don't think regular expressions can handle variable number of capture groups.
If you can be sure of how many words there can be, you can always hard code this. For tree or less words you can use
(\B)_([a-zA-Z]+)(?:_([a-zA-Z]+))?(?:_([a-zA-Z]+))?_table