How can I extract RFC1123 hostnames from a string using regular expressions? - regex

I'm looking for a regular expression that would match anything that could be a valid RFC1123 hostname in a string that can contain anything. The idea is to extract everything that could possibly be a hostname (by checking that the substring follows all requirements to be one) - except for the maximum length of 255 characters, which is easy to check on the results afterwards.
I initially came up with:
/(^|[^a-z0-9-])([a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)*)([^a-z0-9-]|$)/i
While this matches some hostnames in parenthesized expression 2 (as intended), it seems to skip others. Looking the problem up on stack overflow, I found this related question:
Regular expression to match DNS hostname or IP Address?
Judging by the positive feedback the answer should be correct (although it doesn't verify label size), so I thought I'd give it a try. I converted their expression to an extractable format similar to my previous one:
/(^|[^a-z0-9-])((([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9])\.)*([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9]))([^a-z0-9-]|$)/i
Again, it should return the desired results in parenthesized expression 2, but it appears to skip some valid substrings. I believe there may be a problem with the way I'm checking for delimiters that are not part of the hostname.
Any ideas?

Figured it out. When scanning a string for sequential matches, using delimiters both before and after the desired expression means two characters must be consummed between each pair of hostnames. So when hostnames are only one character apart, the second one is skipped!
To obtain correct results one must simply remove the leading delimiter:
/([a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)*)([^a-z0-9-]|$)/i
It is only necessary for validation, not scanning.

Related

Regular Expression misses matches in string

I'm trying to write a regular expression that captures desired strings between strings
("f38 ","f38 ","f1 ", "..") and ("\par","\hich","{","}","","..") from a decompiled DOC file and append each match to an array to eventually be printed out into a new file.
I'm having an issue with catching certain strings between "f38 " and "\hich" (usually when the string spans multiple lines but there is at least 1 exception to this I've found in the example string snippet of the DOC file I'm using on regex101.com)
Here is the regular expression as I have it now
(?<=f38 |f38 | |f1 |\.\.)\w.+(?=\\par|\\cell |\\hich|{|}|\\|\.\.)
The troublesome matches come out including "\hich". Like "e\hich" and "d\hich" and I want to match "e" and "d" respectively in these examples not the \hich portion. I'm thinking the problem is with handling the newline/line-breaks somehow.
Here is a smaller snippet of the input string, I have bolded what is matched and bolded + capitalized the problematic match. From this I want the "e" not the \hich. Note that above there are 2 examples of things going right and "\hich" is not included in the match.
l\hich\af38\dbch\af31505\loch\f38 ..ikely to involve asbestos exposure: removal, encapsulation, alteration, repair, maintenance, insulation, spill/emergency clean-up, transportation, disposal and storage of ACM. The general industry standards cover all other operations where exposure to asb..\hich\af38\dbch\af31505\loch\f38 E\HICH\af38\dbch\af31505\loch\f38 stos is possible
Here is an example with a longer portion of the input string at regex101.com
Any help would be appreciated. Thanks!
The problem is with the part you want to match those single-character samples. \w.+ requires at least two characters to match. So, for when you get "e\hich" that first backslash get matched to the dot in regex and lasts until the next backslash (which is one of the "terminators" listed in the positive lookahead portion of the regex).
You might want to use * instead of +.

List of allowed characters from regular expression

Does someone know about some way how to extract allowed characters from regular expression and construct user friendly message?
For example, by providing regular expression
^[a-zA-Z0-9&\-\+_\.\s]{1,10}$
to get something like
a-z A-Z 0-9 & - + _ . with spaces
I am using java. I can imagine that it could be too complicated or even impossible to cover all types of regular expressions, but maybe you know about some library, tool or algorithm that could help.
Thanks
Yes. It can be done.
What you need is:
Turn your regexp body into a string.
Parse that string (with a regex for instance) that will output the desired list.
Apply possible regexp options (such as ignore case to the result).
This is tedious work if you're not VERY familiar with Regexp. I actually have code in production doing just that, but it's proprietary so I can't post it here and it's not in Java.
I guess you should first ask yourself whether there is no simpler solution for your problem. If for instance your regexp is a constant, you could associate it with a by-hand list of accepted characters.
If your input is a character-class like the one you provided, you could match it with the expression
([^\\]-[^\\]|\\.|[^^$[\]])
that will give you a list of elements like "a-z", "\+", "_" that you could then tidy up a little further, e.g., removing the "\", and then print it nicely formatted.
And you could extract the length information using
{([0-9]+)(,([0-9]+))?}
that accepts {1,10} as well as {10} with the "from" and "to" values being captured each in their own group.
That should get you started.

Wildcard in Word 2013 to match zero or more whitespaces

What is the analog of regular expression's * modifier in Word 2013 wildcards?
In Word 2013 Find tool with wildcards enabled, apparently 0 is not a valid number as the number of matches. For example, if you type in the search box
fe{1,2}d
it will match fed and feed. However,
fe{0,2}d
will just produce an error message. What is the correct expression to match fd, fed, feed, feeed, etc.?
My motivation is to match a specific text when it is in a paragraph alone (i.e., surrounded by paragraph marks ^13) but with a possible whitespaces after it:
^13hello world {0,}^13
which just produces an error message. I did not find any solution without enabling wildcards, but even with wildcards enabled I can't get it working.
Similarly,
^13hello world #^13
matches one or more spaces, but I need zero or more.
I don't believe Word has ever had an equivalent for the zero-or-more operator, so while I haven't checked in Word 2013, I wouldn't expect to see it there either. (This page is old, but as far as I know it's still pretty authoritative on wildcard searching in Word: http://word.mvps.org/faqs/general/usingwildcards.htm)
In general, I would suggest doing two searches, one without the character and one using the 1-or-more operator.
ETA: Removed bad wildcard search.

C# regular express for list ips 65.232.211.[001-175]

I want to match IP against my IP list which stored in arraylist but it is in this format
65.232.211.[001-175]
eg. 68.232.211.133 must be match
68.232.211.199 not match
I want regualr express for this scenario but I dont know how it would be..
I tried but not getting correct ans..
Please help me..
You could use something like so: 68\\.232\\.211\\.0*([1-9][0-9]?|1[0-6][0-9]|17[0-5]). The last part should match the numerical range you are after (courtesy of Regex_For_Range).
Since the period character in regex is a special character (denoting any character), it needs to be escaped. This is done by adding an extra slash, like so: \.. Since you are using C# (it seems) you need to escape the slash as well since that is a special character in the C# language.
You could, alternatively (and even better than the above) use the following regex to split the IP in 2 and do what ever validation you need: ^([\d.]+?)\.(\d+)$. This regex would yield 2 groups, so taking 68.232.211.133 as an example, it would yield 68.232.211 and 133.
The above will allow you to match the initial part of the IP as a string and it will then allow you to take the last section of the IP, change it to a numerical value and perform range checks using mathematical operator.
In my opinion, the second approach should be favoured since it is (in my opinion) easier to maintain.

Regular Expression for irregularly occurring repeating string

I searched but have not found an answer to the question - maybe it is so obvious that no one else had to ask...
I am using UltraEdit 16.00 to run my Regular Expressions in PERL mode...
Situation:
I have a delimited string that can contain a variable number of repeating segments that must adhere to a very specific format. These segments occur randomly throughout the delimited string.
Example:
CLP*data*data*data~REF*data*data~N1*data*data*data~**CAS*OA*29*99.99**~AMT*I*99.99~SVC*data*data*data*data~**CAS*PR*99.99**~**CAS*CO**99.99**~DTM*150*date~AMT*B6*99.99~SVC*data*data*data*data~CAS*PR*N16*99.99~**CAS*CO* *99.99**...line continues from here.
Correct format - CAS*OA*29*99.99~
Incorrect format 1 - CAS*OA* *99.99~
Incorrect format 2 - CAS*OA**99.99~
Goal:
Identify only those strings where ALL of the CAS segments adhere to the format.
Things I've Tried:
(BTW: I know my Regular Expressions are not optimized, so please give me a break)
CAS Segment Missing value or containing one or more spaces
CAS\*(OA|PR|CR|CO)\*\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds
CAS\*(OA|PR|CR|CO)\*[\s]+?\*[-]?[\d]+\.?[\d]{0,2}~ matches the first instance if finds
CAS segment NOT Missing value or containing space(s)
CAS\*(OA|PR|CR|CO)\*[^0-9A-Z]+?\*[-]?[\d]+\.?[\d]{0,2}~ Again, matches first instance
Negative Lookahead using combinations of the above (I am new to trying this approach)
^(?:(?!ab).)+$ - ab => one of the above regular expressions - never got it to work
Question:
How do I write the regular expression to enforce/validate the format of EVERY CAS instance no matter how often it occurs (there is a potential for 0 instances)?
To say that every CAS instance in your string is valid is to say that there does not exist at least one invalid CAS sequence. The approach you were getting at with a negative lookahead is the simplest way to represent this - here's an example:
/^(?!.*CAS(?!<whatever matches a valid CAS instance>))/
Basically: "Make sure there does not exist in the string an instance of CAS that is not followed by whatever matches a valid CAS instance". Replace the contents of the second negative lookahead, and include whatever it is before 'CAS' that indicates the start of a CAS instance.
As you can see, you don't need to match the string from start to finish to do what you want.
This idea will make sure the whole line is correct. E.G. It will not match the line unless it is correct.
^(regexThatOnlyMatchesASingleCorrectInstance)*$
This starts at the beginning of the line ^ and matches as many as it can + of regexThatOnlyMatchesASingleCorrectInstance and ensures that the end of the string $ is found right after the last one.
Of course this will only work when there is a ~ at the end of the string. For the ~ part, use this: (?:~|$) so that you it doesn't require the delimiter at the end of the string.