extract the last 2 fields regardless of size - regex

I have been trying to get the last "two fields" of the following strings:
cc-api-data.bar.bar
external-atl3-1.xx.fbcdn.net
fbcdn.net
for the first 2 strings, I would like to only get the "bar.bar" and "fbcdn.net." However, for the last string, I want it to match the whole thing since it has all i want.
I am pretty confident i could do this in a simple script but I am trying to use regex in this case. I can only get the last part of the string on the last string but not the whole thing. And I cannot tell the regex which field to take.
I literally just want the last two fields, no matter how many delimiters there are.
Any suggestions or is it even possible

My guess is that we might want an expression that would have a $ anchor similar to:
([^.]*)\.([^.]*)$
towards the right side of our strings.
Please see the demo for additional explanation
I was wondering how does regex know to get
only the part before the last period. Is it because it grabs any
character thats not a "." and because it is at the end of the line?
why couldnt it grab the first octet?
Good question, also a bit difficult to explain, by playing this demo, we can watch the many steps prior to getting to our matches:
Steps of Regular Expression
It would start char by char and test it against our expression, it would pass for our rules in the expression, yet in early chars or octet, once it would hit the $ end anchor, those early chars or octet would fail, because our last end of string rule has been broken.

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Matching all strings without 3 occurrences of/or final single character in RegEx

Trying to figure out the regex for the title,
i.e.,
foo
foo/bar/foo
foo/bar/foo/bar
foo/bar/d
I don't want it to match the 3rd or the 4th one but match the first two. In the 2nd option, the final foo can be anything but a single d.
You could use a regex but it will be more complicated than just counting the number of slashes and also checking the last character isn't a d. If you want to use a regex to check for the last part not being "/d" you could do something like check that it doesn't match ^.*/d$ but it may be clearer to just use code. (If counting slashes and checking string doesn't end in "/d" isn't exactly what you mean then it will help to have more examples)
Figured it out. See below if anyone is interested.
(^foo/?$)|(^foo/[^/]+/(([^d][^/]*)|(d[^/]+))/?$)

'Looping' Through a Repeating Substring with Regex

The general problem:
I've got lot of data I'm trying to clean up then parse. Each line is really long, but they all have the same structure. It starts with one unique substring, followed by a second unique substring, followed by a substring that repeats about 20 times.
So it's: String A, String B, String C, String C, String C, etc. Every line is in that format.
At the start of String A is an ID, just a unique six digit number. I'm trying to insert that ID at the beginning of String B and all of the String C's.
String C is the problem. I can write regex's for each of the ID, B, and C, but trying to insert the captured ID into all the C's fails. It only works on the last one. That's actually the correct behavior here, but I'm pretty sure there is a way to to treat String C so that it will act like each instance of the substring is separate. And the regex runs over it again and again.
I tried using '\G' syntax but I can't seem to make it work.
So here's a specific example using some massively abridged sample data:
['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]], ['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]]],'TM',21,21,966],['run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]]],'TR',12,33,521],['run':142278, etc...
Just a note: The only difference between String B and all the String Cs is the number of brackets, but that's actually useful once I start parsing this out (ultimately it'll all be JSON).
What I'm trying to get is:
['sample_id':121084,[122,'southwest',7.23,[[['station_01',[1]],['station_02',[1]],['station_03',[22]],['station_04',[49]],['station_05',[1]],['station_06',[4]],['station_07',[101]],['station_08',[22]]]],[[['sample_id':121084,'run':133225,'marker':'SAM',[[['substation_01',[1]],['substation_02',[3]],['substation_03',[16]],['substation_04',[15]],['substation_05',[14]],['substation_06',[6]],['substation_07',[41]],['substation_08',[19]],['substation_09',[13]],['substation_10',[1]],['substation_11',[13]],['substation_12',[1]]]],'TK',22,34,127],['sample_id':121084,'run':608049,'marker':'TIM',[[['substation_01',[12]],['substation_02',[6]],['substation_03',[17]],['substation_04',[11]],['substation_05',[1]],['substation_06',[6]],['substation_07',[5]],['substation_08',[19]]],'TM',21,21,966],['sample_id':121084,'run':445801,'marker':'RON',[[['substation_01',[5]],['substation_02',[5]],['substation_03',[6]],['substation_04',[11]],['substation_05',[1]],['substation_06',[15]],['substation_07',[11]],['substation_08',[16]],['substation_09',[1]],['substation_10',[13]],['substation_11',[3]]],'TR',12,33,521],['sample_id':121084, etc...
In the latter text block each substring now begins with the ID 'sample_id':121084 (I bolded it to make it slightly easier to see what's going on).
Here's the Regex that gets me up through String C.
\[('sample_id':\d{6},)(?:.+\]\]\],\[\[)\[(.+?\d\],)\[(.+?\d\],)
So I'm trying to insert that first capture group ($1) in front of the second group, then the third group over and over and over (about 20x). If I repeat the last bit, I end up killing all but one of the C Strings, which again, I believe to be the 'proper' behavior. I'm trying to figure out how to get around that.
It's a mess I know. But each of those is just one line, and I've got doc after doc that'll have 100 or so lines like that. So a regex that doesn't break up the lines seems best.
I went over this page a few times trying to engineer a solution, but again, I couldn't make the \G syntax work here.
Collapse and Capture a Repeating Pattern in a Single Regex Expression
Should mention I'm trying to do this in Sublime Text 2. Thanks for any help.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

Regular Expression - exclude part of string

This is a follow-up to a previous question. I have a string "Test 999-99-9", how would I match on everything except the last -9 part? Keep in mind, the last -9 may or may not be there, but if it is, I want to ignore it and match on the rest of the string. Any suggestions?
Alternatively, if it ignored the entire 999-99-9 or 999-99 part, and just returned the "Test" part, that would be fine, too. It seems like that may be easier to do. I basically want to take the following expression and invert it to return the other half of the string: (\d{3}-\d{2}|\d{3}-\d{2}-\d{1})$
RegEx to ignore 999-99-9 and just return "Test" part:
^([\w ]+) [\d]{3}-[\d]{2}-?[\d]?$
OCR Software supports groups:
http://www.laserfiche.com/NewsPortal/Article/2012/05/21/tech-tip-pattern-matching-with-regular-expressions
Note: The parentheses determine which information is extracted from
the text. The other characters determine the pattern that will be
looked for. For example, \d\d\d-\d\d-(\d\d\d\d) will find the social
security number and return the last four digits of it.
^(Test \d{3}-\d{2})(-\d{1})?$ will return everything except the last "-9" from your example, whether the "-9" is present or not.