List of allowed characters from regular expression - regex

Does someone know about some way how to extract allowed characters from regular expression and construct user friendly message?
For example, by providing regular expression
^[a-zA-Z0-9&\-\+_\.\s]{1,10}$
to get something like
a-z A-Z 0-9 & - + _ . with spaces
I am using java. I can imagine that it could be too complicated or even impossible to cover all types of regular expressions, but maybe you know about some library, tool or algorithm that could help.
Thanks

Yes. It can be done.
What you need is:
Turn your regexp body into a string.
Parse that string (with a regex for instance) that will output the desired list.
Apply possible regexp options (such as ignore case to the result).
This is tedious work if you're not VERY familiar with Regexp. I actually have code in production doing just that, but it's proprietary so I can't post it here and it's not in Java.
I guess you should first ask yourself whether there is no simpler solution for your problem. If for instance your regexp is a constant, you could associate it with a by-hand list of accepted characters.

If your input is a character-class like the one you provided, you could match it with the expression
([^\\]-[^\\]|\\.|[^^$[\]])
that will give you a list of elements like "a-z", "\+", "_" that you could then tidy up a little further, e.g., removing the "\", and then print it nicely formatted.
And you could extract the length information using
{([0-9]+)(,([0-9]+))?}
that accepts {1,10} as well as {10} with the "from" and "to" values being captured each in their own group.
That should get you started.

Related

how to make this regular expression [^:]+(?=,|$) so it is not found (" ")

I'm having a problem. I can not understand how to make this regular expression [^:]+(?=,|$) so it is not found (" ")
I need regex to pass access token without quotation marks this expression:
{"access_token":"UEaYoz4xgKQUyjHv9dg6nzaWN52jHbeGRymGVqdo6wd‌​WwXLjoxPydlNkXEOJYki‌​QpEXOHTo99Tn7i9Q-MHP‌​MFmnqmfLjel-0qVVpF1r‌​FxEiB_RtX3kMYm5-ihH7‌​OYB3aEzFvnQ_HsNevGlV‌​72AFKKJrhSP9V637SSYC‌​5MDzU4Wri0_uPW1VMuLu‌​q-IhtOPrSe0lqu86clal‌​ySuevFf5w_jcHPEm5xIx‌​R4pTzELfYluQiFS9JrAC‌​s5tF2d-WwkTZaYhjCf9M‌​Wx5JVqtMJC0x8shPvHZA‌​rH5Um1jpO12UHtRSU6P5‌​rP5VHuEk8AAQmDEv5EYh‌​59RI6jAWKtYRZMEBoJZO‌​UEbF9ZelPB4jYqpx4gsV‌​kP0GVJ57o_d3OiAllvOo‌​kY14u1GXZ3XN1fesOi89‌​srmatVf_J6ka50m9ilrW‌​tzMYWNq6vf2j-JgQA87R‌​80DTaRtCFfg"}
This part of access token need to pass without the quotes:
UEaYoz4xgKQUyjHv9dg6nzaWN52jHbeGRymGVqdo6wd‌​WwXLjoxPydlNkXE‌​OJYki‌​QpEXOHTo99Tn7‌​i9Q-MHP‌​MFmnqmfLjel‌​-0qVVpF1r‌​FxEiB_RtX‌​3kMYm5-ihH7‌​OYB3aEz‌​FvnQ_HsNevGlV‌​72AFK‌​KJrhSP9V637SSYC‌​5MD‌​zU4Wri0_uPW1VMuLu‌​q‌​-IhtOPrSe0lqu86clal‌‌​​ySuevFf5w_jcHPEm5xI‌​x‌​R4pTzELfYluQiFS9J‌​rAC‌​s5tF2d-WwkTZaYh‌​jCf9M‌​Wx5JVqtMJC0x8‌​shPvHZA‌​rH5Um1jpO12‌​UHtRSU6P5‌​rP5VHuEk8‌​AAQmDEv5EYh‌​59RI6jA‌​WKtYRZMEBoJZO‌​UEbF9‌​ZelPB4jYqpx4gsV‌​kP0‌​GVJ57o_d3OiAllvOo‌​k‌​Y14u1GXZ3XN1fesOi89‌‌​​srmatVf_J6ka50m9ilr‌​W‌​tzMYWNq6vf2j-JgQA‌​87R‌​80DTaRtCFfg
You are making things over complicated, JMeter supports Perl5-style regular expressions, it means you can make the quotation marks a part of the search pattern so they will be considered the left and right boundaries like:
"access_token":"(.+?)"
Going forward when it comes to JSON it makes more sense to use JSON Path PostProcessor available since JMeter 3.0. The relevant JSONPath query will be as simple as:
$.access_token
To learn how to develop more complex JSONPath queries see Advanced Usage of the JSON Path Extractor in JMeter guide
If you mean [^: "{}]+ then you have to say so. This matches any string which does not contain any of the four enumerated characters. It will still match only the leftmost longest match, so you will need to anchor the match somehow. Based on your example, I guess you are really looking for
[^:"[{}]+(?="?([},]))
which matches the longest-leftmost string of characters not in the character class which occur just before an optional double quote followed by either a closing brace or a comma.
As others already noted in comments, using regex to parse JSON is generally a very bad idea. There are many corner cases where this will fail; for a start, the JSON could be split on multiple lines, so that the brace or the comma is not on the same line as the access token, and then extracting it by this regex will fail.

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

C# regular express for list ips 65.232.211.[001-175]

I want to match IP against my IP list which stored in arraylist but it is in this format
65.232.211.[001-175]
eg. 68.232.211.133 must be match
68.232.211.199 not match
I want regualr express for this scenario but I dont know how it would be..
I tried but not getting correct ans..
Please help me..
You could use something like so: 68\\.232\\.211\\.0*([1-9][0-9]?|1[0-6][0-9]|17[0-5]). The last part should match the numerical range you are after (courtesy of Regex_For_Range).
Since the period character in regex is a special character (denoting any character), it needs to be escaped. This is done by adding an extra slash, like so: \.. Since you are using C# (it seems) you need to escape the slash as well since that is a special character in the C# language.
You could, alternatively (and even better than the above) use the following regex to split the IP in 2 and do what ever validation you need: ^([\d.]+?)\.(\d+)$. This regex would yield 2 groups, so taking 68.232.211.133 as an example, it would yield 68.232.211 and 133.
The above will allow you to match the initial part of the IP as a string and it will then allow you to take the last section of the IP, change it to a numerical value and perform range checks using mathematical operator.
In my opinion, the second approach should be favoured since it is (in my opinion) easier to maintain.

do we ever use regex to find regex expressions?

let's say i have a very long string. the string has regular expressions at random locations. can i use regex to find the regex's?
(Assuming that you are looking for a JavaScript regexp literal, delimited by /.)
It would be simple enough to just look for everything in between /, but that might not always be a regexp. For example, such a search would return /2 + 3/ of the string var myNumber = 1/2 + 3/4. This means that you will have to know what occurs before the regular expression. The regexp should be preceded by something other than a variable or number. These are the cases that I can think of:
/regex/;
var myVar = /regex/;
myFunction(/regex/,/regex/);
return /regex/;
typeof /regex/;
case /regex/;
throw /regex/;
void /regex/;
"global" in /regex/;
In some languages you can use lookbehind, which might look like this (untested!):
(?=<^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/
However, JavaScript does not support that. I would recommend imitating lookbehind by putting the portion of the regexp designed to match the literal itself in a capturing group and accessing that. All cases of which I am aware can be matched by this regexp:
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)
NOTE: This regex sometimes results in false positives in comments.
If you want to also grab modifiers (e.g. /regex/gim), use
(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow|\bvoid|\bin)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/\w*)
If there are any reserved words I am missing that may be followed by a regexp literal, simply add this to the end of the first group: |\bkeyword
All that remains then is to access the capturing group, using a code similar to the following:
var codeString = "function(){typeof /regex/;}";
var searchValue = /(?:^|\n|[^\s\w\/]|\breturn|\btypeof|\bcase|\bthrow)\s*(\/(?:\\\/|[^\/\*\n])(?:\\\/|[^\/\n])*\/)/g;
// the global modifier is necessary!
var match = searchValue.exec(codeString); // "['typeof /regex/','/regex/']"
match = match[1]; // "/regex/"
UPDATE
I just fixed an error with the regexp concerning escaped slashes that would have caused it to get only /\/ of a regexp like /\/hello/
UPDATE 4/6
Added support for void and in. You can't blame me too much for not including this at first, as even Stack Overflow doesn't, if you look at the syntax coloring in the first code block.
What do you mean by "regular expression"? aaaa is a valid regular expression. This is also a regular expression. If you mean a regular expression literal you might need something like this: /\/(?:[^\\\/]|\\.)*\// (adapted from here).
UPDATE
slebetman makes a good point; regular-expression literals don't need to start with /. In Perl or sed, they can start with whatever you want. Essentially, what you're trying to do is risky and probably won't work for all cases.
Its not the best way to go about this.
You can attempt to do so with some degree of confidence (using EOL to break up into substrings and finding ones that look like regular expressions - perhaps delimited by quotation marks) however dont forget that a very long string CAN be a regex, so you will never have complete confidence using this approach.
Yes, if you know whether (and how!) your regex is delimited. Say, for example, that your string is something like
aaaaa...aaa/b/aaaaa
where 'b' is the 'regular expression' delimited by the character / (this is a near-basic scenario); what you have to do is scan the string for the expected delimiter, extract whatever it's inbetween delimiters (paying attention to escape chars) and you should be set.
This, if your delimiter is a known character and if you are sure that it appears an even number of times or you want to discard the rest (for example, which set of delimiters are you considering in the following string: aaa/b/aaa/c/aaa/d)
If this is the case then you need to follow the same reasoning you'd do to find any substring in a given string. Once you've found the first regexp, keep parsing until you hit the end of the string or you find another regexp, and so on.
I suspect, however, that you are looking for a 'general rule' to find any string that, once parsed, would result in a valid regular expression (say we're talking about POSIX regexp-- try man re_format if you're under *BSD). If that is the case you could try every possible substring of every length of the given string and feed it to a regexp parser for syntax correctness. Still, you have proven nothing of the validity of the regexp, i.e. on what they actually match.
If that is what you're trying to do I strongly recommend finding another way or explaining better what you are trying to accomplish here.

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.