Given a regular expression, is is possible to find a string that matches that expression programmatically? If so, please mention an algorithm for that, assuming that a string exists.
Bonus question: Give the performance/complexity of that algorithm, if able.
PS: Note I am not asking this: Programmatically derive a regular expression from a string. More likely I am asking the reserve problem.
Generex is a Java library for generating String from a regular expression.
Check it out: https://github.com/mifmif/Generex
Here is the sample Java code demonstrating library usage:
Generex generex = new Generex("[0-3]([a-c]|[e-g]{1,2})");
// Generate random String
String randomStr = generex.random();
System.out.println(randomStr);// a random value from the previous String list
// generate the second String in lexicographical order that match the given Regex.
String secondString = generex.getMatchedString(2);
System.out.println(secondString);// it print '0b'
// Generate all String that matches the given Regex.
List<String> matchedStrs = generex.getAllMatchedStrings();
// Using Generex iterator
Iterator iterator = generex.iterator();
while (iterator.hasNext()) {
System.out.print(iterator.next() + " ");
}
// it prints:
// 0a 0b 0c 0e 0ee 0ef 0eg 0f 0fe 0ff 0fg 0g 0ge 0gf 0gg
// 1a 1b 1c 1e 1ee 1ef 1eg 1f 1fe 1ff 1fg 1g 1ge 1gf 1gg
// 2a 2b 2c 2e 2ee 2ef 2eg 2f 2fe 2ff 2fg 2g 2ge 2gf 2gg
// 3a 3b 3c 3e 3ee 3ef 3eg 3f 3fe 3ff 3fg 3g 3ge 3gf 3gg
Another one: https://code.google.com/archive/p/xeger/
Here is the sample Java code demonstrating library usage:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);
Assume you define regular expressions like this:
R :=
<literal string>
(RR) -- concatenation
(R*) -- kleene star
(R|R) -- choice
Then you can define a recursive function S(r) which finds a matching string:
S(<literal string>) = <literal string>
S(rs) = S(r) + S(s)
S(r*) = ""
S(r|s) = S(r)
For example: S(a*(b|c)) = S(a*) + S(b|c) = "" + S(b) = "" + "b" = "b".
If you have a more complex notion of regular expression, you can rewrite it in terms of the basic primitives and then apply the above. For example, R+ = RR* and [abc] = (a|b|c).
Note that if you've got a parsed regular expression (so you know its syntax tree), then the above algorithm takes at most time linear in the size of the regular expression (assuming you're careful to perform the string concatenations efficiently).
To find given expression in string which fit under that criteria, for that I had tried below algorithm.
i) Create the array for all strings available in given source.
ii) Create a function with parameters for array, expression and initial index count.
iii) Call function recursively and increase the index with every move, until we match string has not found.
iv) Return/break the function if String with desired expression is found.
Below is same java code:
public class ExpressionAlgo {
public static void main(String[] args) {
// TODO Auto-generated method stub
String data = "A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions";
regCheck(data.split(" "), "sym", 0);
}
public static void regCheck(String[] ar, String expresion, int i) {
if(ar[i].contains(expresion)){
System.out.println(ar[i]);
return;
}
if(i<ar.length-1){
i=i+1;
regCheck(ar, expresion, i);
}
}
}
As far as I calculated the complexity of this code is N^3 because I had use split, contains method and call regCheck method recursively.
Related
I want to validate and get the data for following tags(9F03,9F02,9C ) using regex:
9F02060000000060009F03070000000010009C0101
Above string is in Tag - length - value format.
Where 9F02,9F03,9C are tags and have fixed length but their position and value in string can vary.
Just after the tag there is the length of the value in bytes that tag can store.
for example:
9F02=tag
06=Length in bytes
000000006000= value
Thanks,
Ashutosh
Standard regex doesn't know how to count very well, it behaves like a state machine in that way.
What you can do though if the number of possibilities is small is represent each possibility in a state in regex, and use multiple regex queries for each tag ...
/9F02(01..|02....|03......)/
/9C(01..|02....)/
... And so on.
Example here.
http://rubular.com/r/euHRxeTLqH
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String[] args) {
String s = "9F02060000000060009F03070000000010009C0101";
String regEx = "(9F02|9F03|9C)";
Pattern p = Pattern.compile(regEx);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println("Tag : "+ m.group());
String length = s.substring(m.end(), m.end()+2);
System.out.println("Length : " + length);
int valueEndIndex = new Integer(m.end()) + 3 + new Integer(length);
String value = s.substring(m.end()+3,valueEndIndex);
System.out.println("Value : "+ value);
}
}
}
This code will give you following output :
Tag : 9F02
Length : 06
value : 000000
Tag : 9F03
Length : 07
value : 0000000
Tag : 9C
Length : 01
value : 1
I am not sure about byte length you are mentioning here, but I guess this code shall help you kick start!
I'd like to know if its possible to use a value inside the expression as a variable for a second part of the expression
The goal is to extract some specific strings from a memory dump. One part of the string is based on a (more or less) fixed structure that can be described well using regular expressions. The Problem is the second part of the string that has a variable length and no "footer" or anything that can be "matched" as an "END".
Instead there is a length indicator on position 2 of the first part.
Here is a simplified example string that id like to find (an all others) inside a large file
00 24 AA BB AA DD EE FF GG HH II JJ ########### ( # beeing unwanted data)
Lets assume that the main structure would allways be 00 XX AA BB AA - but the last part (starting from DD) will be variable in length for each string based on the value of XX
I know that this can be done in code outside regex but iam curious if its possible :)
Short answer: NO
Long answer:
You can acheive what you want in two steps:
Extract the value inside string
Build dynamically a regexp for matching
PSEUDO CODE
s:='00 24 AA BB AA DD EE FF GG HH II JJ ###########'
re:=/00 (\d{2}) AA BB AA/
if
s::matches(re)
then
match := re::match(s)
len := matches(1)
dynamicRE := new Regexp(re::toString() + ' (?:[A-Z]{2} ){' + len + '}')
// dynamicRE == /00 (\d{2}) AA BB AA (?:[A-Z]{2} ){24,24}/
if s::matches(dynamicRE) then
// MATCH !!
else
// NO MATCH !!
end if
end if
For example if the string is:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
The output should be:
20 BB EC 45 40 C8 97 20 84 8B 10
int main()
{
char input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char output[500];
// what to write here so that i can get the desired output as:
// output = "20 BB EC 45 40 C8 97 20 84 8B 10"
return 0;
}
In C, you could do this with a scanset conversion (though it's a bit RE-like, so the syntax gets a bit strange):
sscanf(input, "[%*[^]]][%[^]]]", second_string);
In case you're wondering how that works, the first [ matches an open bracket literally. Then you have a scanset, which looks like %[allowed_chars] or %[^not_allowed_chars]. In this case, you're scanning up to the first ], so it's %[^]]. In the first one, we have a * between the % and the rest of the conversion specification, which means sscanf will try to match that pattern, but ignore it -- not assign the result to anything. That's followed by a ] that gets matched literally.
Then we repeat essentially the same thing over again, but without the *, so the second data that's matched by this conversion gets assigned to second_string.
With the typo fixed and a bit of extra code added to skip over the initial XYZ ::, working (tested) code looks like this:
#include <stdio.h>
int main() {
char *input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char second_string[64];
sscanf(input, "%*[^[][%*[^]]][%[^]]]", second_string);
printf("content: %s\n", second_string);
return 0;
}
Just find the second [ and start extracting (or just printing) until next ]....
You can use string::substr if you are willing to convert to std::string
If you don't know the location of brackets, you can use string::find_last_of for the last bracket and again string::find_last_of to find the open bracket.
Well, say, your file looks like this:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
XYZ ::[1][Maybe some other text]
XYZ ::[1][Some numbers maybe: 123 98345 123 9-834 ]
XYZ ::[1][blah-blah-blah]
The code that will extract the data will look something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//opening the file to read from
std::ifstream file( "in.txt" );
if( !file.is_open() )
{
cout << "Cannot open the file";
return -1;
}
std::string in, out;
int blockNumber = 1;//Which bracket block we are looking for. We are currently looking for the second one.
while( getline( file, in ) )
{
int n = 0;//Variable for storing index in the string (where our target text starts)
int i = 0;//Counter for [] blocks we have encountered.
while( i <= blockNumber )
{
//What we are doing here is searching for the position of [ symbol, starting
//from the n + 1'st symbol of the string.
n = in.find_first_of('[', n + 1);
i++;
}
//Getting our data and printing it.
out = in.substr( n + 1, ( in.find_first_of(']', n) - n - 1) );
std::cout << out << std::endl;
}
return 0;
}
The output after executing this will be:
20 BB EC 45 40 C8 97 20 84 8B 10
Maybe some other text
Some numbers maybe: 123 98345 123 9-834
blah-blah-blah
The simplest solution is something along the lines of:
std::string
match( std::string const& input )
{
static boost::regex const matcher( ".*\\[[^]]*\\]\\[(.*)\\]" );
boost::smatch matched;
return regex_match( input, matched, matcher )
? matched[1]
: std::string();
}
The regular expression looks a bit complicated because you need to match
meta-characters, and because the compiler I use doesn't support raw
strings yet. (With raw strings, I think the expression would be
R"^(.*\[[^]]\]\[(.*)\])^". But I can't verify that.)
This returns an empty string in case there is no match; if you're sure
about the format, you might prefer to throw an exception. You can also
extend it to do as much error checking as necessary: in general, the
more you validate a text input, the better it is, but you didn't give
precise enough information about what was legal for me to fill it out
completely. (For your example string, for example, you might replace
the ".*" at the beginning of the regular expression with
"\\u{3}\\s*::": three upper case characters followed by zero or more
whitespace, then two ':'. Or the first [] group might be
"\\[\\d\\]", if you're certain it's always a single digit.
This could work for you in a very specific sense:
std::string str(input);
std::string output(input.find_last_of('['), input.find_last_of(']'));
out = output.c_str();
The syntax isnt quite correct so you will need to look that up. You probably need to define your question a little better as well as this will only work if you want the brcketed string at the end.
Using string library in C. I'll give a code snippet that process a single linewhich can be used in a loop that reads the file line by line. NOTE: string.h should be included
int length = strlen( input );
char* output = 0;
// Search
char* firstBr = strchr( input, '[' );
if( 0 != firstBr++ ) // check for null pointer
{
char* secondBr = strchr( firstBr, '[' );
// we don't need '['
if( 0 != secondBr++ )
{
int nOutLen = strlen( secondBr ) - 1;
if( 0 < nOutLen )
{
output = new char[nOutLen+1];
strncpy( output, secondBr, nOutLen );
output[ nOutLen ] = '\0';
}
}
}
if( 0 != output )
{
cout << output;
delete[] output;
output = 0;
}
else
{
cout << "Error!";
}
You could use this regex to get what is inside "<" and ">":
// Regex: "<%999[^>]>" (Max of 999 Bytes)
int n1 = sscanf(source, "<%999[^>]>", dest);
I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"
Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.
Is it possible ( or why not possible ) to convert input string to a string that match regex in least Levenshtein distance ?
i.e. if 1234 is string and ^([0-9]{6})$ is regex, i need output something like 123412 ( output string matches the regex and is 2 distance from original string, there may be other string but first result will do )
How to do this ? ( no brute force..)
edit:
in other possibilities, can I get Levenshtein distance only ? ( without matching string ... )
or what other information apart form boolean( match or not match ) can regex give ?
If you know about finite automaton you could construct one which represents your regex (there are libraries to do so). Then you run it with your string (1234) and you would end up in some state. From this state you do a breath-first search until you reach a accept state. While you searching you keep track of which transitions (characters) you run over. And the characters will give you the shortest (or one of them) string which qualify your regex.
Added link: you may have a look on http://www.brics.dk/automaton/ which is a automaton library implemented at Aarhus University (BSD license)
Update: I have build what you seek with the automaton implementation from above. First, the ExtendedOperations class which is in the same package as the other automaton classes because I needed to access some methods.
package dk.brics.automaton;
public class ExtendedOperations {
//Taken from Automaton.run and modified to just return state instead of accept (bool)
static State endState(String s, Automaton a)
{
if (!a.deterministic) a.determinize();
State p = a.initial;
for (int i = 0; i < s.length(); i++) {
p = p.step(s.charAt(i));
if (q == null) return null;
}
return p;
}
public static String getShortestCompletion(Automaton a, String partlyInput)
{
State e = endState(partlyInput, a);
if (e == null) return null;
return BasicOperations.getShortestExample(e, true);
}
}
Second, a little testsample:
package subsetautomaton;
import dk.brics.automaton.*;
public class Main {
public static void main(String[] args) {
RegExp re = new RegExp("[a-zA-Z0-9._%+-]+\\#[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}");
Automaton a = re.toAutomaton();
System.out.println(ExtendedOperations.getShortestCompletion(a, "a"));
}
}
The example is a naive email address reg. exp. Notice that ^ is implicit in the reg. exp. and the same thing with $. Second, # is escaped with \ because it means 'any string' in this implementation.
The result of the example above is: #-.AA