How a regular expression is matched? - regex

Recently in an Interview, I was asked a question that I have a string with a couple of billions of characters in it. The string contains ASCII and non-ASCII characters in it. The task was to remove all the non-ASCII characters and in output, the string must contain only ASCII characters. The solution must be a time efficient algorithm.
I suggested two approaches:
Make an array of ASCII characters. Loop over string check if the current character is in ASCII characters array. If yes then skip or else replace that with null.
Obviously, it's not a time efficient solution.
Secondly, I suggested that if we partition the array in half and a further half and so on. I'll still be checking ASCII characters like in above approaches.
This conversation lead to a discussion where the interviewer was looking for a solution in which we don't have to go character by character and he suggested using Regular Expressions.
My Question here is when we match a pattern using Regular Expressions, will it check the string character by character or it'll use some other approach. I was sure the Regular Expressions will find/match character by character.
Can anyone please clear my doubt?
Thanks

You could use a range like this:
[\x20-\x7E]
This range matches every character from [space] to ~. The printable ascii range.

Regular expressions indeed do use optimisations for cases where a sequence of characters is matched: simply explained, if you're looking for "XXXXXXX", you know you can test every 7-th character, and only look closer once you find an X there. However, you need to filter every single character: this means, a regular expression would be not more efficient (and indeed it would be less efficient, because you would need to go in and out of regexp to process your discoveries).
Instead, the efficient method (assuming C-like architecture) would be to start with two indices (source and result) at zero, and process the string: if the character has the high-bit clear, it's ASCII: copy from source to result, increment both indices. If the high-bit is set, it's non-ASCII: just increment source index.
void removeNonAscii(char *str) {
int s, r;
for (s = 0, r = 0; str[s]; s++) {
if (!(str[s] & 128)) {
str[r++] = str[s];
}
}
str[r] = 0;
}
(or you can make a non-destructive one, by copying into a new string instead of overwriting the current one; the algorithm is the same.)

Related

^ and $ expressed in fundamental operations in regular expressions

I've read a book where it states that all fundamental operations in regular expressions are concatatenation, or(|), closure(*) and parenthesis to override default precedence. Every other operation is just a shortcut for one or more fundamental operations.
For example, (AB)+ shortcut is expanded to (AB)(AB)* and (AB)? to (ε | AB) where ε is empty string. First of all, I looked up ASCII table and I am not sure which charcode is designated to empty string. Is it ASCII 0?
I'd like to figure out how to express the shortcuts ^ and $ as in ^AB or AB$ expression in the fundamental operations, but I am not sure how to do this. Can you help me out how this is expressed in fundamentals?
Regular expressions, the way they are defined in mathematics, are actually string generators, not search patterns. They are used as a convenient notation for a certain class of sets of strings. (Those sets can contain an infinite number of strings, so enumerating all elements is not practical.)
In a programming context, regexes are usually used as flexible search patterns. In mathematical terms we're saying, "find a substring of the target string S that is an element of the set generated by regex R". This substring search is not part of the regex proper; it's like there's a loop around the actual regex engine that tries to match every possible substring against the regex (and stops when it finds a match).
In fundamental regex terms, it's like there's an implicit .* added before and after your pattern. When you look at it this way, ^ and $ simply prevent .* from being added at the beginning/end of the regex.
As an aside, regexes (as commonly used in programming) are not actually "regular" in the mathematical sense; i.e. there are many constructs that cannot be translated to the fundamental operations listed above. These include backreferences (\1, \2, ...), word boundaries (\b, \<, \>), look-ahead/look-behind assertions ((?= ), (?! ), (?<= ), (?<! )), and others.
As for ε: It has no character code because the empty string is a string, not a character. Specifically, a string is a sequence of characters, and the empty string contains no characters.
^AB can be expressed as (εAB) ie an empty string followed by AB and AB$ can be expressed as (ABε) that's AB followed by an empty string.
The empty string is actually defined as '', that's a string of 0 length, so has no value in the ASCII table. However the C programming language terminates all strings with the ASCII NULL character, although this is not counted in the length of the string it still must be accounted for when allocating memory.
EDIT
As #melpomene pointed out in their comment εAB is equivalent to AB which makes the above invalid. Having talked to a work college I'm no longer sure how to do this or even if it's possible. Hopefully someone can come up with an answer.

Replace odd length substrings of character

I am struggling with a little problem concerning regular expressions.
I want to replace all odd length substrings of a specific character with another substring of the same length but with a different character.
All even sequences of the specified character should remain the same.
Simplified example: A string contains the letters a,b and y and all the odd length sequences of y's should be replaced by z's:
abyyyab -> abzzzab
Another possible example might be:
ycyayybybcyyyyycyybyyyyyyy
becomes
zczayybzbczzzzzcyybzzzzzzz
I have no problem matching all the sequences of odd length using a regular expression.
Unfortunately I have no idea how to incorporate the length information from these matches into the replacement string.
I know I have to use backreferences/capture groups somehow, but even after reading lots of documentation and Stack Overflow articles I still don't know how to pursue the issue correctly.
Concerning possible regex engines, I am working with mainly with Emacs or Vim.
In case I have overlooked an easier general solution without a complicated regular expression (e.g. a small and fixed series of simple search and replace commands), this would help too.
Here's how I'd do it in vim:
:s/\vy#<!y(yy)*y#!/\=repeat('z', len(submatch(0)))/g
Explanation:
The regex we're using is \vy#<!y(yy)*y#!. The \v at the beginning turns on the magic option, so we don't have to escape as much. Without it, we would have y\#<!y\(yy\)*y\#!.
The basic idea for this search, is that we're looking for a 'y' y followed by a run of pairs of 'y's,(yy)*. Then we add y#<! to guarantee there isn't a 'y' before our match, and add y\#! to guarantee there isn't a 'y' after our match.
Then we replace this using the eval register, i.e. \=. From :h sub-replace-\=:
*sub-replace-\=* *s/\=*
When the substitute string starts with "\=" the remainder is interpreted as an
expression.
The special meaning for characters as mentioned at |sub-replace-special| does
not apply except for "<CR>". A <NL> character is used as a line break, you
can get one with a double-quote string: "\n". Prepend a backslash to get a
real <NL> character (which will be a NUL in the file).
The "\=" notation can also be used inside the third argument {sub} of
|substitute()| function. In this case, the special meaning for characters as
mentioned at |sub-replace-special| does not apply at all. Especially, <CR> and
<NL> are interpreted not as a line break but as a carriage-return and a
new-line respectively.
When the result is a |List| then the items are joined with separating line
breaks. Thus each item becomes a line, except that they can contain line
breaks themselves.
The whole matched text can be accessed with "submatch(0)". The text matched
with the first pair of () with "submatch(1)". Likewise for further
sub-matches in ().
TL;DR, :s/foo/\=blah replaces foo with blah evaluated as vimscript code. So the code we're evaluating is repeat('z', len(submatch(0))) which simply makes on 'z' for each 'y' we've matched.

quickly find a short string that isn't a substring in a given string

I've been trying to serialize some data with a delimiter and ran into issues.
I'd like to be able to quickly find a string that isn't a substring of given string if it contains a delimiter, so that I can use that for a delimiter.
If I didn't care about size the quickest way to find it would be to check a character in the given string, and pick a different character, make a string of the given string's length of only that character.
There may be a way to do some sort of check, testing first the middle characters, then the middle of the first and last segment... but I didn't see a clear algorithm there.
My current idea, which is fairly quick but non optimal is
initialize a hash with all characters as keys and 0 as a count
Read string characters as bytes using the hash to count.
walk the keys finding the smallest number of characters. stopping immediately if I find one that has zero characters.
Use that number of characters plus one as the delimiter.
I believe that is O(n), though obviously non the shortest. But the delimiter will always be no more than n/256 + 1 characters.
I could also try some sort of trie based construction, but I'm not quite sure how to implement that and thats 0(n^2) right?
https://cs.stackexchange.com/questions/21896/algorithm-request-shortest-non-existing-substring-over-given-alphabet
may be helpful.
Your counting of characters method isn't sufficient because you're only talking about the current string. The whole point of a delimiter is that in theory you're separating multiple strings, and therefore you'd need to count all of them.
I see two potential alternative solutions
Pick a delimiter and escape that delimiter in the strings.
Can use URI::Escape to escape a specific character, say &, and use that as the delimiter.
Specify the size of your string before you send it. That way you know exactly how many characters to pull. Essentially pack and unpack
And because I'm already on the train of alternative solutions, might as well propose all of the other serialization modules out there: Comparison of Perl serialization modules
I like the theory behind a task like this, but rings too much like an XY Problem
I agree with #Miller that your best bet is to pick a character and escape that in the text.
However, this is not what you asked, so I'll attempt to answer the question.
I take it these strings are long, so finding the delimiter is time-sensitive.
In straight Perl, the hash idea may well be as fast as you can get. As a native C extension, you can do better. I say this because my experience is that Perl array access is pretty slow for some reason, and this algorithm uses arrays to good effect:
int n_used_chars = 0;
int chars[256], loc_of_char[256];
for (int i = 0; i < 256; i++) used_chars[i] = loc_of_char[i] = i;
for (int i = 0; i < string_length; i++) {
char c = string[i];
int loc = loc_of_char[c];
if (loc >= n_used_chars) {
// Character c has not been used before. Swap it down to the used set.
chars[loc] = chars[n_used_chars];
loc_of_char[chars[loc]] = loc;
chars[n_used_chars] = c;
loc_of_chars[c] = n_used_chars++;
}
}
// At this point chars[0..n_used_chars - 1] contains all the used chars.
// and chars[n_used_chars..255] contains the unused ones!
This will be O(n) and very fast in practice.
What if all the characters are used? Then things get interesting... There are 64K two-byte combinations. We could use the trick above, and both arrays would be 64K. Initialization and memory would be expensive. Would it be worthwhile? Perhaps not.
If all characters are used, I would use a randomized approach: guess a delimiter and then scan the string to verify it's not contained.
How to make the guess in a prudent way?

Java Regex to find if a given String contains a set of characters in the same order of their occurrence.

We need Java Regex to find if a given String contains a set of characters in the same order of their occurrence.
E.g. if the given String is "TYPEWRITER",
the following strings should return a match:
"YERT", "TWRR" & "PEWRR" (character by character match in the order of occurrence),
but not
"YERW" or "YERX" (this contains characters either not present in the given string or doesn't match the order of occurrence).
This can be done by character by character matching in a for loop, but it will be more time consuming. A regex for this or any pointers will be highly appreciated.
First of all REGEX has nothing to do with it. Regex is powerful but not that much powerful to accomplish this.
The thing you are asking is a part of Longest Common Subsequence(LCS) Algorithm implementation. For your case you need to change the algorithm a bit. I mean instead of matching part of string from both, you'll require to match your one string as a whole subsequence from the Larger one.
The LCS is a dynamic algorithm and so far this is the fastest way to achieve this. If you take a look at the LCS Example here you'll find that what I am talking about.

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).