change string to match regex - regex

Is it possible ( or why not possible ) to convert input string to a string that match regex in least Levenshtein distance ?
i.e. if 1234 is string and ^([0-9]{6})$ is regex, i need output something like 123412 ( output string matches the regex and is 2 distance from original string, there may be other string but first result will do )
How to do this ? ( no brute force..)
edit:
in other possibilities, can I get Levenshtein distance only ? ( without matching string ... )
or what other information apart form boolean( match or not match ) can regex give ?

If you know about finite automaton you could construct one which represents your regex (there are libraries to do so). Then you run it with your string (1234) and you would end up in some state. From this state you do a breath-first search until you reach a accept state. While you searching you keep track of which transitions (characters) you run over. And the characters will give you the shortest (or one of them) string which qualify your regex.
Added link: you may have a look on http://www.brics.dk/automaton/ which is a automaton library implemented at Aarhus University (BSD license)
Update: I have build what you seek with the automaton implementation from above. First, the ExtendedOperations class which is in the same package as the other automaton classes because I needed to access some methods.
package dk.brics.automaton;
public class ExtendedOperations {
//Taken from Automaton.run and modified to just return state instead of accept (bool)
static State endState(String s, Automaton a)
{
if (!a.deterministic) a.determinize();
State p = a.initial;
for (int i = 0; i < s.length(); i++) {
p = p.step(s.charAt(i));
if (q == null) return null;
}
return p;
}
public static String getShortestCompletion(Automaton a, String partlyInput)
{
State e = endState(partlyInput, a);
if (e == null) return null;
return BasicOperations.getShortestExample(e, true);
}
}
Second, a little testsample:
package subsetautomaton;
import dk.brics.automaton.*;
public class Main {
public static void main(String[] args) {
RegExp re = new RegExp("[a-zA-Z0-9._%+-]+\\#[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}");
Automaton a = re.toAutomaton();
System.out.println(ExtendedOperations.getShortestCompletion(a, "a"));
}
}
The example is a naive email address reg. exp. Notice that ^ is implicit in the reg. exp. and the same thing with $. Second, # is escaped with \ because it means 'any string' in this implementation.
The result of the example above is: #-.AA

Related

Regular expression to match all digits of unknown length except the last 4 digits

There is a number with unknown length and the idea is to build a regular expression which matches all digits except last 4 digits.
I have tried a lot to achieve this but no luck yet.
Currently I have this regex: "^(\d*)\d{0}\d{0}\d{0}\d{0}.*$"
Input: 123456789089775
Expected output: XXXXXXXXXXX9775
which I am using as follows(and this doesn't work):
String accountNumber ="123456789089775";
String pattern = "^(\\d*)\\d{1}\\d{1}\\d{1}\\d{1}.*$";
String result = accountNumber.replaceAll(pattern, "X");
Please suggest how I should approach this problem or give me the solution.
In this case my whole point is to negate the regex : "\d{4}$"
You may use
\G\d(?=\d{4,}$)
See the regex demo.
Details
\G - start of string or end of the previous match
\d - a digit
(?=\d{4,}$) - a positive lookahead that requires 4 or more digits up to the end of the string immediately to the right of the current location.
Java demo:
String accountNumber ="123456789089775";
String pattern = "\\G\\d(?=\\d{4,}$)"; // Or \\G.(?=.{4,}$)
String result = accountNumber.replaceAll(pattern, "X");
System.out.println(result); // => XXXXXXXXXXX9775
still not allowed to comment as I don't have that "50 rep" yet but DDeMartini's answer would swallow prefixed non-number-accounts as "^(.*)" would match stuff like abcdef1234 as well - stick to your \d-syntax
"^(\\d+)(\\d{4}$)"
seems to work fine and demands numbers (minimum length 6 chars). Tested it like
public class AccountNumberPadder {
private static final Pattern LAST_FOUR_DIGITS = Pattern.compile("^(\\d+)(\\d{4})");
public static void main(String[] args) {
String[] accountNumbers = new String[] { "123456789089775", "999775", "1234567890897" };
for (String accountNumber : accountNumbers) {
Matcher m = LAST_FOUR_DIGITS.matcher(accountNumber);
if (m.find()) {
System.out.println(paddIt(accountNumber, m));
} else {
throw new RuntimeException(String.format("Whooaaa - don't work for %s", accountNumber));
}
}
}
public static String paddIt(String input, Matcher m) {
StringBuilder b = new StringBuilder();
for (int i = 0; i < m.group(1).length(); i++) {
b.append("X");
}
return input.replace(m.group(1), b.toString());
}
}
Try:
String pattern = "^(.*)[0-9]{4}$";
Addendum after comment: A refactor to only match full numerics could look like this:
String pattern = "^([0-9]+)[0-9]{4}$";

Regex for custom parsing

Regex isn't my strongest point. Let's say I need a custom parser for strings which strips the string of any letters and multiple decimal points and alphabets.
For example, input string is "--1-2.3-gf5.47", the parser would return
"-12.3547".
I could only come up with variations of this :
string.replaceAll("[^(\\-?)(\\.?)(\\d+)]", "")
which removes the alphabets but retains everything else. Any pointers?
More examples:
Input: -34.le.78-90
Output: -34.7890
Input: df56hfp.78
Output: 56.78
Some rules:
Consider only the first negative sign before the first number, everything else can be ignored.
I'm trying to do this using Java.
Assume the -ve sign, if there is one, will always occur before the
decimal point.
Just tested this on ideone and it seemed to work. The comments should explain the code well enough. You can copy/paste this into Ideone.com and test it if you'd like.
It might be possible to write a single regex pattern for it, but you're probably better off implementing something simpler/more readable like below.
The three examples you gave prints out:
--1-2.3-gf5.47 -> -12.3547
-34.le.78-90 -> -34.7890
df56hfp.78 -> 56.78
import java.util.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(strip_and_parse("--1-2.3-gf5.47"));
System.out.println(strip_and_parse("-34.le.78-90"));
System.out.println(strip_and_parse("df56hfp.78"));
}
public static String strip_and_parse(String input)
{
//remove anything not a period or digit (including hyphens) for output string
String output = input.replaceAll("[^\\.\\d]", "");
//add a hyphen to the beginning of 'out' if the original string started with one
if (input.startsWith("-"))
{
output = "-" + output;
}
//if the string contains a decimal point, remove all but the first one by splitting
//the output string into two strings and removing all the decimal points from the
//second half
if (output.indexOf(".") != -1)
{
output = output.substring(0, output.indexOf(".") + 1)
+ output.substring(output.indexOf(".") + 1, output.length()).replaceAll("[^\\d]", "");
}
return output;
}
}
In terms of regex, the secondary, tertiary, etc., decimals seem tough to remove. However, this one should remove the additional dashes and alphas: (?<=.)-|[a-zA-Z]. (Hopefully the syntax is the same in Java; this is a Python regex but my understanding is that the language is relatively uniform).
That being said, it seems like you could just run a pretty short "finite state machine"-type piece of code to scan the string and rebuild the reduced string yourself like this:
a = "--1-2.3-gf5.47"
new_a = ""
dash = False
dot = False
nums = '0123456789'
for char in a:
if char in nums:
new_a = new_a + char # record a match to nums
dash = True # since we saw a number first, turn on the dash flag, we won't use any dashes from now on
elif char == '-' and not dash:
new_a = new_a + char # if we see a dash and haven't seen anything else yet, we append it
dash = True # activate the flag
elif char == '.' and not dot:
new_a = new_a + char # take the first dot
dot = True # put up the dot flag
(Again, sorry for the syntax, I think you need some curly backets around the statements vs. Python's indentation only style)

Replace C++ function with Regular Expression

I would like to convert the following C++ method to a regular expression match/replace string pair. Is it possible to do this in a single pass, i.e. with a single call to a regex replace method? (such as this one)
std::string f(std::string value)
{
if (value.length() < 3)
{
value = std::string("0") + value;
}
value = value.substr(0, value.length() - 2) + std::string(".") + value.substr(value.length() - 2, 2);
return value;
}
The input is a string of one or more digits.
Some examples:
f("1234") = "12.34"
f("123") = "1.23"
f("12") = "0.12"
f("1") = ".01"
The only way I've been able to achieve this so far is by using 2 steps:
1. Apply a prefix of "00" to the input string.
2. Use the following regex match/replace pair:
Match: (0*)(\d+)(\d{2})
Replace: $2.$3
My question is, can this be done in a single "pass" by only calling the Regex replace method once and without prepending anything to the string beforehand.
I believe this isn't possible with a single expression/replacement, but I'd just like someone to confirm that (or otherwise provide a solution :) ).
I hope this will help. (Change a bit again) x3.
string a_="123456";
a_="14";
a_="9";
string a = regex_replace(a_,regex("(.*)(.{2})|()"),string("$1.$2."));
//a = regex_replace(regex_replace(a,regex("^"),string("00$1$2")),regex("(.+)(.{2})"),string("$1.$2"));
//a = regex_replace("00"+a,regex("(.+)(.{2})"),string("$1.$2"));
float i=atof(a.c_str());
if(!(i))//just go here for 0-9
{
i=atof((string("0.0")+a_).c_str());
}
cout<<i<<endl;
return 0;

Matching token sequences

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"
Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.

Regular Expression to find numbers with same digits in different order

I have been looking for a regular expression with Google for an hour or so now and can't seem to work this one out :(
If I have a number, say:
2345
and I want to find any other number with the same digits but in a different order, like this:
2345
For example, I match
3245 or 5432 (same digits but different order)
How would I write a regular expression for this?
There is an "elegant" way to do it with a single regex:
^(?:2()|3()|4()|5()){4}\1\2\3\4$
will match the digits 2, 3, 4 and 5 in any order. All four are required.
Explanation:
(?:2()|3()|4()|5()) matches one of the numbers 2, 3, 4, or 5. The trick is now that the capturing parentheses match an empty string after matching a number (which always succeeds).
{4} requires that this happens four times.
\1\2\3\4 then requires that all four backreferences have participated in the match - which they do if and only if each number has occurred once. Since \1\2\3\4 matches an empty string, it will always match as long as the previous condition is true.
For five digits, you'd need
^(?:2()|3()|4()|5()|6()){5}\1\2\3\4\5$
etc...
This will work in nearly any regex flavor except JavaScript.
I don't think a regex is appropriate. So here is an idea that is faster than a regex for this situation:
check string lengths, if they are different, return false
make a hash from the character (digits in your case) to integers for counting
loop through the characters of your first string:
increment the counter for that character: hash[character]++
loop through the characters of the second string:
decrement the counter for that character: hash[character]--
break if any count is negative (or nonexistent)
loop through the entries, making sure each is 0:
if all are 0, return true
else return false
EDIT: Java Code (I'm using Character for this example, not exactly Unicode friendly, but it's the idea that matters now):
import java.util.*;
public class Test
{
public boolean isSimilar(String first, String second)
{
if(first.length() != second.length())
return false;
HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
for(char c : first.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count++;
hash.put(c, count);
}
else
{
hash.put(c, 1);
}
}
for(char c : second.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count--;
if(count < 0)
return false;
hash.put(c, count);
}
else
{
return false;
}
}
for(Integer i : hash.values())
{
if(i.intValue()!=0)
return false;
}
return true;
}
public static void main(String ... args)
{
//tested to print false
System.out.println(new Test().isSimilar("23445", "5432"));
//tested to print true
System.out.println(new Test().isSimilar("2345", "5432"));
}
}
This will also work for comparing letters or other character sequences, like "god" and "dog".
Put the digits of each number in two arrays, sort the arrays, find out if they hold the same digits at the same indices.
RegExes are not the right tool for this task.
You could do something like this to ensure the right characters and length
[2345]{4}
Ensuring they only exist once is trickier and why this is not suited to regexes
(?=.*2.*)(?=.*3.*)(?=.*4.*)(?=.*5.*)[2345]{4}
The simplest regular expression is just all 24 permutations added up via the or operator:
/2345|3245|5432|.../;
That said, you don't want to solve this with a regex if you can get away with it. A single pass through the two numbers as strings is probably better:
1. Check the string length of both strings - if they're different you're done.
2. Build a hash of all the digits from the number you're matching against.
3. Run through the digits in the number you're checking. If you hit a match in the hash, mark it as used. Keep going until you don't get an unused match in the hash or run out of items.
I think it's very simple to achieve if you're OK with matching a number that doesn't use all of the digits. E.g. if you have a number 1234 and you accept a match with the number of 1111 to return TRUE;
Let me use PHP for an example as you haven't specified what language you use.
$my_num = 1245;
$my_pattern = '/[' . $my_num . ']{4}/'; // this resolves to pattern: /[1245]{4}/
$my_pattern2 = '/[' . $my_num . ']+/'; // as above but numbers can by of any length
$number1 = 4521;
$match = preg_match($my_pattern, $number1); // will return TRUE
$number2 = 2222444111;
$match2 = preg_match($my_pattern2, $number2); // will return TRUE
$number3 = 888;
$match3 = preg_match($my_pattern, $number3); // will return FALSE
$match4 = preg_match($my_pattern2, $number3); // will return FALSE
Something similar will work in Perl as well.
Regular expressions are not appropriate for this purpose. Here is a Perl script:
#/usr/bin/perl
use strict;
use warnings;
my $src = '2345';
my #test = qw( 3245 5432 5542 1234 12345 );
my $canonical = canonicalize( $src );
for my $candidate ( #test ) {
next unless $canonical eq canonicalize( $candidate );
print "$src and $candidate consist of the same digits\n";
}
sub canonicalize { join '', sort split //, $_[0] }
Output:
C:\Temp> ks
2345 and 3245 consist of the same digits
2345 and 5432 consist of the same digits