This must have a canonical answer but I cannot find it... Using a regular expression to validate an email address has answers which show regex is really not the best way to validate emails. Searching online keeps turning up lots and lots of regex-based answers.
That question is about PHP and an answer references a handy class MailAddress. C# has something very similar but what about plain old C++? Is there a boost/C++11 utility to take all the pain away? Or something in WinAPI/MFC, even?
I have to write one solution because I have a g++ version installed that doesnt support std::regex (Application crashes) and I dont want to upgrade the thing for a single E-Mail validation as this application probably never will need any further regex I wrote a function doing the job. You can even easily scale allowed characters for each part of the E-Mail addres (before #, after # and after '.') depdending on your needs. Took 20 min to write and was way easier then messing with compiler and environment stuff just for one function call.
Here you go, have fun:
bool emailAddressIsValid(std::string _email)
{
bool retVal = false;
//Tolower cast
std::transform(_email.begin(), _email.end(), _email.begin(), ::tolower);
//Edit these to change valid characters you want to be supported to be valid. You can edit it for each section. Remember to edit the array size in the for-loops below.
const char* validCharsName = "abcdefghijklmnopqrstuvwxyz0123456789.%+_-"; //length = 41, change in loop
const char* validCharsDomain = "abcdefghijklmnopqrstuvwxyz0123456789.-"; //length = 38, changein loop
const char* validCharsTld = "abcdefghijklmnopqrstuvwxyz"; //length = 26, change in loop
bool invalidCharacterFound = false;
bool atFound = false;
bool dotAfterAtFound = false;
uint16_t letterCountBeforeAt = 0;
uint16_t letterCountAfterAt = 0;
uint16_t letterCountAfterDot = 0;
for (uint16_t i = 0; i < _email.length(); i++) {
char currentLetter = _email[i];
//Found first #? Lets mark that and continue
if (atFound == false && dotAfterAtFound == false && currentLetter == '#') {
atFound = true;
continue;
}
//Found '.' after #? lets mark that and continue
if (atFound == true && dotAfterAtFound == false && currentLetter == '.') {
dotAfterAtFound = true;
continue;
}
//Count characters before # (must be > 0)
if (atFound == false && dotAfterAtFound == false) {
letterCountBeforeAt++;
}
//Count characters after # (must be > 0)
if (atFound == true && dotAfterAtFound == false) {
letterCountAfterAt++;
}
//Count characters after '.'(dot) after # (must be between 2 and 6 characters (.tld)
if (atFound == true && dotAfterAtFound == true) {
letterCountAfterDot++;
}
//Validate characters, before '#'
if (atFound == false && dotAfterAtFound == false) {
bool isValidCharacter = false;
for (uint16_t j = 0; j < 41; j++) {
if (validCharsName[j] == currentLetter) {
isValidCharacter = true;
break;
}
}
if (isValidCharacter == false) {
invalidCharacterFound = true;
break;
}
}
//Validate characters, after '#', before '.' (dot)
if (atFound == true && dotAfterAtFound == false) {
bool isValidCharacter = false;
for (uint16_t k = 0; k < 38; k++) {
if (validCharsDomain[k] == currentLetter) {
isValidCharacter = true;
break;
}
}
if (isValidCharacter == false) {
invalidCharacterFound = true;
break;
}
}
//After '.' (dot), and after '#' (.tld)
if (atFound == true && dotAfterAtFound == true) {
bool isValidCharacter = false;
for (uint16_t m = 0; m < 26; m++) {
if (validCharsTld[m] == currentLetter) {
isValidCharacter = true;
break;
}
}
if (isValidCharacter == false) {
invalidCharacterFound = true;
break;
}
}
//Break the loop to speed up thigns if one character was invalid
if (invalidCharacterFound == true) {
break;
}
}
//Compare collected information and finalize validation. If all matches: retVal -> true!
if (atFound == true && dotAfterAtFound == true && invalidCharacterFound == false && letterCountBeforeAt >= 1 && letterCountAfterAt >= 1 && letterCountAfterDot >= 2 && letterCountAfterDot <= 6) {
retVal = true;
}
return retVal;
}
I know there are other errors present but the main one is the bracket that is supposed to close my main method. It ask me to enter another bracket to close the class body. I have gone through many times, correctly indenting and entering in brackets to close loops and methods but it just doesn't want to work. Any ideas?
import java.util.Stack;
import java.util.Scanner;
public class RPNApp{
public static void main (String [] args)
{
/* Scanner object which takes user input and splits each element into an array of type String*/
Scanner scan = new Scanner(System.in);
System.out.println("Please enter numbers and operators for the Reverse Polish Notation calculator.");
String scanner = scan.nextLine();
String [ ] userInput = scanner.split(" ");
Stack<Long> stack = new Stack<Long>();
for (int i = 0; i <= userInput.length; i++) {
if (isNumber()) {
Long.parseLong(userInput[i]);
stack.push(Long.parseLong(userInput[i]));
}
}
}
public static boolean isOperator (String userInput[i]) //userInput is the array.
{
for (int i = 0; i<userInput.length; i++) {
if (!(x.equals("*") || x.equals("+") || x.equals("-") || x.equals("/") || x.equals("%"))) {
return false;
}else {
return true;
}
}
}
public static boolean isNumber (String userInput[i])
{
for (int i = 0; i<x.length(); i++) {
char c = x.charAt(i);
if (!(Character.isDigit(c))) {
return false;
}
} return true;
}
}
I have made quite a few changes, I knew there were other errors present. But the error I encountered from not having a correct parameter in my method was the worry. You mentioned there was still something wrong, have I tended to the syntax error you noticed?
Updated code
import java.util.Stack;
import java.util.Scanner;
public class RPNApp{
public static void main (String [] args){
/* Scanner object which takes user input and splits each element into an array of type String*/
Scanner scan = new Scanner(System.in);
System.out.println("Please enter numbers and operators for the Reverse Polish Notation calculator.");
String scanner = scan.nextLine();
String [ ] userInput = scanner.split(" ");
Stack<Long> stack = new Stack<Long>();
for (int i = 0; i < userInput.length; i++) {
String current = userInput[i];
if (isNumber(current)) {
Long.parseLong(userInput[i]);
stack.push(Long.parseLong(userInput[i]));
System.out.println(stack.toString());
}
}
}
public static boolean isOperator (String x) { //userInput is the array.
if (!(x.equals("*") || x.equals("+") || x.equals("-") || x.equals("/") || x.equals("%"))) {
return false;
}else {
return true;
}
}
public static boolean isNumber (String x) {
for (int i = 0; i<x.length(); i++) {
char c = x.charAt(i);
if (!(Character.isDigit(c))) {
return false;
}
} return true;
}
}
This piece of code certainly has more than just a few issues. But if you have written it entirely in your head without ever compiling it, it's actually pretty good! It shows that you think about the problem in a surprisingly correct way. I don't understand how one can get so many details wrong, but the overall structure right. And some of the syntax errors aren't really your fault: it's absolutely not obvious why it should be array.length but string.length() but at the same time arrayList.size(), it's completely inconsistent mess.
Here, I cleaned it up a bit:
import java.util.Stack;
import java.util.Scanner;
public class RPNApp {
public static void main(String[] args) {
/* Scanner object which takes user input and splits each element into an array of type String*/
Scanner scan = new Scanner(System. in );
System.out.println("Please enter numbers and operators for the Reverse Polish Notation calculator.");
String scanner = scan.nextLine();
String[] userInput = scanner.split(" ");
Stack<Long> stack = new Stack<Long>();
for (int i = 0; i <= userInput.length; i++) {
if (isNumber(userInput[i])) {
Long.parseLong(userInput[i]);
stack.push(Long.parseLong(userInput[i]));
}
}
}
public static boolean isOperator(String userInput) {
for (int i = 0; i < userInput.length(); i++) {
char x = userInput.charAt(i);
if (!(x == '*' || x == '+' || x == '-' || x == '/' || x == '%')) {
return false;
}
}
return true;
}
public static boolean isNumber(String s) {
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (!(Character.isDigit(c))) {
return false;
}
}
return true;
}
}
Few other points to notice:
Exists-loops: Check if true, return true in loop, return false in the end.
Forall-loops: Check if false, return false in loop, return true in the end.
Chars and Strings are not the same. Chars are enclosed in single quotes and compared by ==.
It's still wrong. Think harder why. And try not to post non-compilable stuff any more.
In your function parameters you can't have userInput[i] like that. Get rid of the [i] part and then fix the rest of the other errors.
please teach me how to get rid of those if statements by using functors (or any other better methods) inside the following loop:
//Loop over each atom
std::string temp_name ;
float dst;
for (pdb::pdb_vector:: size_type i=0; i < data.size(); ++i)
{
if (type == 0)
{
//choose by residue name
temp_name = data[i].residue_name;
} else {
//choose by atom name
temp_name = data[i].atom_name;
}
//compare the name and extract position if matched
if (temp_name.compare(name) == 0)
{
if (direction.compare("x") == 0)
{
dst = ::atof(data[i].x_coord.c_str());
} else if ((direction.compare("y") == 0)) {
dst = ::atof(data[i].y_coord.c_str());
} else {
dst = ::atof(data[i].z_coord.c_str());
}
}
}
you can replace if(type == 0) with the ternary operator:
// if(type == 0) ...
temp_name = (type == 0 ? data[i].residue_name : data[i].atom_name);
but the rest of your checks seem like they would only be less readable if you tried something similar.
The following is the interview question:
Machine coding round: (Time 1hr)
Expression is given and a string testCase, need to evaluate the testCase is valid or not for expression
Expression may contain:
letters [a-z]
'.' ('.' represents any char in [a-z])
'*' ('*' has same property as in normal RegExp)
'^' ('^' represents start of the String)
'$' ('$' represents end of String)
Sample cases:
Expression Test Case Valid
ab ab true
a*b aaaaaab true
a*b*c* abc true
a*b*c aaabccc false
^abc*b abccccb true
^abc*b abbccccb false
^abcd$ abcd true
^abc*abc$ abcabc true
^abc.abc$ abczabc true
^ab..*abc$ abyxxxxabc true
My approach:
Convert the given regular expression into concatenation(ab), alteration(a|b), (a*) kleenstar.
And add + for concatenation.
For example:
abc$ => .*+a+b+c
^ab..*abc$ => a+b+.+.*+a+b+c
Convert into postfix notation based on precedence.
(parantheses>kleen_star>concatenation>..)
(a|b)*+c => ab|*c+
Build NFA based on Thompson construction
Backtracking / traversing through NFA by maintaining a set of states.
When I started implementing it, it took me a lot more than 1 hour. I felt that the step 3 was very time consuming. I built the NFA by using postfix notation +stack and by adding new states and transitions as needed.
So, I was wondering if there is faster alternative solution this question? Or maybe a faster way to implement step 3. I found this CareerCup link where someone mentioned in the comment that it was from some programming contest. So If someone has solved this previously or has a better solution to this question, I'd be happy to know where I went wrong.
Some derivation of Levenshtein distance comes to mind - possibly not the fastest algorithm, but it should be quick to implement.
We can ignore ^ at the start and $ at the end - anywhere else is invalid.
Then we construct a 2D grid where each row represents a unit [1] in the expression and each column represents a character in the test string.
[1]: A "unit" here refers to a single character, with the exception that * shall be attached to the previous character
So for a*b*c and aaabccc, we get something like:
a a a b c c c
a*
b*
c
Each cell can have a boolean value indicating validity.
Now, for each cell, set it to valid if either of these hold:
The value in the left neighbour is valid and the row is x* or .* and the column is x (x being any character a-z)
This corresponds to a * matching one additional character.
The value in the upper-left neighbour is valid and the row is x or . and the column is x (x being any character a-z)
This corresponds to a single-character match.
The value in the top neighbour is valid and the row is x* or .*.
This corresponds to the * matching nothing.
Then check if the bottom-right-most cell is valid.
So, for the above example, we get: (V indicating valid)
a a a b c c c
a* V V V - - - -
b* - - - V - - -
c - - - - V - -
Since the bottom-right cell isn't valid, we return invalid.
Running time: O(stringLength*expressionLength).
You should notice that we're mostly exploring a fairly small part of the grid.
This solution can be improved by making it a recursive solution making use of memoization (and just calling the recursive solution for the bottom-right cell).
This will give us a best-case performance of O(1), but still a worst-case performance of O(stringLength*expressionLength).
My solution assumes the expression must match the entire string, as inferred from the result of the above example being invalid (as per the question).
If it can instead match a substring, we can modify this slightly so, if the cell is in the top row it's valid if:
The row is x* or .*.
The row is x or . and the column is x.
Given only 1 hour we can use simple way.
Split pattern into tokens: a*b.c => { a* b . c }.
If pattern doesn't start with ^ then add .* in the beginning, else remove ^.
If pattern doesn't end with $ then add .* in the end, else remove $.
Then we use recursion: going 3 way in case if we have recurring pattern (increase pattern index by 1, increase word index by 1, increase both indices by 1), going one way if it is not recurring pattern (increase both indices by 1).
Sample code in C#
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
namespace ReTest
{
class Program
{
static void Main(string[] args)
{
Debug.Assert(IsMatch("ab", "ab") == true);
Debug.Assert(IsMatch("aaaaaab", "a*b") == true);
Debug.Assert(IsMatch("abc", "a*b*c*") == true);
Debug.Assert(IsMatch("aaabccc", "a*b*c") == true); /* original false, but it should be true */
Debug.Assert(IsMatch("abccccb", "^abc*b") == true);
Debug.Assert(IsMatch("abbccccb", "^abc*b") == false);
Debug.Assert(IsMatch("abcd", "^abcd$") == true);
Debug.Assert(IsMatch("abcabc", "^abc*abc$") == true);
Debug.Assert(IsMatch("abczabc", "^abc.abc$") == true);
Debug.Assert(IsMatch("abyxxxxabc", "^ab..*abc$") == true);
}
static bool IsMatch(string input, string pattern)
{
List<PatternToken> patternTokens = new List<PatternToken>();
for (int i = 0; i < pattern.Length; i++)
{
char token = pattern[i];
if (token == '^')
{
if (i == 0)
patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
else
throw new ArgumentException("input");
}
else if (char.IsLower(token) || token == '.')
{
if (i < pattern.Length - 1 && pattern[i + 1] == '*')
{
patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Multiple });
i++;
}
else
patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
}
else if (token == '$')
{
if (i == pattern.Length - 1)
patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
else
throw new ArgumentException("input");
}
else
throw new ArgumentException("input");
}
PatternToken firstPatternToken = patternTokens.First();
if (firstPatternToken.Token == '^')
patternTokens.RemoveAt(0);
else
patternTokens.Insert(0, new PatternToken { Token = '.', Occurence = Occurence.Multiple });
PatternToken lastPatternToken = patternTokens.Last();
if (lastPatternToken.Token == '$')
patternTokens.RemoveAt(patternTokens.Count - 1);
else
patternTokens.Add(new PatternToken { Token = '.', Occurence = Occurence.Multiple });
return IsMatch(input, 0, patternTokens, 0);
}
static bool IsMatch(string input, int inputIndex, IList<PatternToken> pattern, int patternIndex)
{
if (inputIndex == input.Length)
{
if (patternIndex == pattern.Count || (patternIndex == pattern.Count - 1 && pattern[patternIndex].Occurence == Occurence.Multiple))
return true;
else
return false;
}
else if (inputIndex < input.Length && patternIndex < pattern.Count)
{
char c = input[inputIndex];
PatternToken patternToken = pattern[patternIndex];
if (patternToken.Token == '.' || patternToken.Token == c)
{
if (patternToken.Occurence == Occurence.Single)
return IsMatch(input, inputIndex + 1, pattern, patternIndex + 1);
else
return IsMatch(input, inputIndex, pattern, patternIndex + 1) ||
IsMatch(input, inputIndex + 1, pattern, patternIndex) ||
IsMatch(input, inputIndex + 1, pattern, patternIndex + 1);
}
else
return false;
}
else
return false;
}
class PatternToken
{
public char Token { get; set; }
public Occurence Occurence { get; set; }
public override string ToString()
{
if (Occurence == Occurence.Single)
return Token.ToString();
else
return Token.ToString() + "*";
}
}
enum Occurence
{
Single,
Multiple
}
}
}
Here is a solution in Java. Space and Time is O(n). Inline comments are provided for more clarity:
/**
* #author Santhosh Kumar
*
*/
public class ExpressionProblemSolution {
public static void main(String[] args) {
System.out.println("---------- ExpressionProblemSolution - start ---------- \n");
ExpressionProblemSolution evs = new ExpressionProblemSolution();
evs.runMatchTests();
System.out.println("\n---------- ExpressionProblemSolution - end ---------- ");
}
// simple node structure to keep expression terms
class Node {
Character ch; // char [a-z]
Character sch; // special char (^, *, $, .)
Node next;
Node(Character ch1, Character sch1) {
ch = ch1;
sch = sch1;
}
Node add(Character ch1, Character sch1) {
this.next = new Node(ch1, sch1);
return this.next;
}
Node next() {
return this.next;
}
public String toString() {
return "[ch=" + ch + ", sch=" + sch + "]";
}
}
private boolean letters(char ch) {
return (ch >= 'a' && ch <= 'z');
}
private boolean specialChars(char ch) {
return (ch == '.' || ch == '^' || ch == '*' || ch == '$');
}
private void validate(String expression) {
// if expression has invalid chars throw runtime exception
if (expression == null) {
throw new RuntimeException(
"Expression can't be null, but it can be empty");
}
char[] expr = expression.toCharArray();
for (int i = 0; i < expr.length; i++) {
if (!letters(expr[i]) && !specialChars(expr[i])) {
throw new RuntimeException(
"Expression contains invalid char at position=" + i
+ ", invalid_char=" + expr[i]
+ " (allowed chars are 'a-z', *, . ^, * and $)");
}
}
}
// Parse the expression and split them into terms and add to list
// the list is FSM (Finite State Machine). The list is used during
// the process step to iterate through the machine states based
// on the input string
//
// expression = a*b*c has 3 terms -> [a*] [b*] [c]
// expression = ^ab.*c$ has 4 terms -> [^a] [b] [.*] [c$]
//
// Timing : O(n) n -> expression length
// Space : O(n) n -> expression length decides the no.of terms stored in the list
private Node preprocess(String expression) {
debug("preprocess - start [" + expression + "]");
validate(expression);
Node root = new Node(' ', ' '); // root node with empty values
Node current = root;
char[] expr = expression.toCharArray();
int i = 0, n = expr.length;
while (i < n) {
debug("i=" + i);
if (expr[i] == '^') { // it is prefix operator, so it always linked
// to the char after that
if (i + 1 < n) {
if (i == 0) { // ^ indicates start of the expression, so it
// must be first in the expr string
current = current.add(expr[i + 1], expr[i]);
i += 2;
continue;
} else {
throw new RuntimeException(
"Special char ^ should be present only at the first position of the expression (position="
+ i + ", char=" + expr[i] + ")");
}
} else {
throw new RuntimeException(
"Expression missing after ^ (position=" + i
+ ", char=" + expr[i] + ")");
}
} else if (letters(expr[i]) || expr[i] == '.') { // [a-z] or .
if (i + 1 < n) {
char nextCh = expr[i + 1];
if (nextCh == '$' && i + 1 != n - 1) { // if $, then it must
// be at the last
// position of the
// expression
throw new RuntimeException(
"Special char $ should be present only at the last position of the expression (position="
+ (i + 1)
+ ", char="
+ expr[i + 1]
+ ")");
}
if (nextCh == '$' || nextCh == '*') { // a* or b$
current = current.add(expr[i], nextCh);
i += 2;
continue;
} else {
current = current.add(expr[i], expr[i] == '.' ? expr[i]
: null);
i++;
continue;
}
} else { // a or b
current = current.add(expr[i], null);
i++;
continue;
}
} else {
throw new RuntimeException("Invalid char - (position=" + (i)
+ ", char=" + expr[i] + ")");
}
}
debug("preprocess - end");
return root;
}
// Traverse over the terms in the list and iterate and match the input string
// The terms list is the FSM (Finite State Machine); the end of list indicates
// end state. That is, input is valid and matching the expression
//
// Timing : O(n) for pre-processing + O(n) for processing = 2O(n) = ~O(n) where n -> expression length
// Timing : O(2n) ~ O(n)
// Space : O(n) where n -> expression length decides the no.of terms stored in the list
public boolean process(String expression, String testString) {
Node root = preprocess(expression);
print(root);
Node current = root.next();
if (root == null || current == null)
return false;
int i = 0;
int n = testString.length();
debug("input-string-length=" + n);
char[] test = testString.toCharArray();
// while (i < n && current != null) {
while (current != null) {
debug("process: i=" + i);
debug("process: ch=" + current.ch + ", sch=" + current.sch);
if (current.sch == null) { // no special char just [a-z] case
if (test[i] != current.ch) { // test char and current state char
// should match
return false;
} else {
i++;
current = current.next();
continue;
}
} else if (current.sch == '^') { // process start char
if (i == 0 && test[i] == current.ch) {
i++;
current = current.next();
continue;
} else {
return false;
}
} else if (current.sch == '$') { // process end char
if (i == n - 1 && test[i] == current.ch) {
i++;
current = current.next();
continue;
} else {
return false;
}
} else if (current.sch == '*') { // process repeat char
if (letters(current.ch)) { // like a* or b*
while (i < n && test[i] == current.ch)
i++; // move i till end of repeat char
current = current.next();
continue;
} else if (current.ch == '.') { // like .*
Node nextNode = current.next();
print(nextNode);
if (nextNode != null) {
Character nextChar = nextNode.ch;
Character nextSChar = nextNode.sch;
// a.*z = az or (you need to check the next state in the
// list)
if (test[i] == nextChar) { // test [i] == 'z'
i++;
current = current.next();
continue;
} else {
// a.*z = abz or
// a.*z = abbz
char tch = test[i]; // get 'b'
while (i + 1 < n && test[++i] == tch)
; // move i till end of repeat char
current = current.next();
continue;
}
}
} else { // like $* or ^*
debug("process: return false-1");
return false;
}
} else if (current.sch == '.') { // process any char
if (!letters(test[i])) {
return false;
}
i++;
current = current.next();
continue;
}
}
if (i == n && current == null) {
// string position is out of bound
// list is at end ie. exhausted both expression and input
// FSM reached the end state, hence the input is valid and matches the given expression
return true;
} else {
return false;
}
}
public void debug(Object str) {
boolean debug = false;
if (debug) {
System.out.println("[debug] " + str);
}
}
private void print(Node node) {
StringBuilder sb = new StringBuilder();
while (node != null) {
sb.append(node + " ");
node = node.next();
}
sb.append("\n");
debug(sb.toString());
}
public boolean match(String expr, String input) {
boolean result = process(expr, input);
System.out.printf("\n%-20s %-20s %-20s\n", expr, input, result);
return result;
}
public void runMatchTests() {
match("ab", "ab");
match("a*b", "aaaaaab");
match("a*b*c*", "abc");
match("a*b*c", "aaabccc");
match("^abc*b", "abccccb");
match("^abc*b", "abccccbb");
match("^abcd$", "abcd");
match("^abc*abc$", "abcabc");
match("^abc.abc$", "abczabc");
match("^ab..*abc$", "abyxxxxabc");
match("a*b*", ""); // handles empty input string
match("xyza*b*", "xyz");
}}
int regex_validate(char *reg, char *test) {
char *ptr = reg;
while (*test) {
switch(*ptr) {
case '.':
{
test++; ptr++; continue;
break;
}
case '*':
{
if (*(ptr-1) == *test) {
test++; continue;
}
else if (*(ptr-1) == '.' && (*test == *(test-1))) {
test++; continue;
}
else {
ptr++; continue;
}
break;
}
case '^':
{
ptr++;
while ( ptr && test && *ptr == *test) {
ptr++; test++;
}
if (!ptr && !test)
return 1;
if (ptr && test && (*ptr == '$' || *ptr == '*' || *ptr == '.')) {
continue;
}
else {
return 0;
}
break;
}
case '$':
{
if (*test)
return 0;
break;
}
default:
{
printf("default case.\n");
if (*ptr != *test) {
return 0;
}
test++; ptr++; continue;
}
break;
}
}
return 1;
}
int main () {
printf("regex=%d\n", regex_validate("ab", "ab"));
printf("regex=%d\n", regex_validate("a*b", "aaaaaab"));
printf("regex=%d\n", regex_validate("^abc.abc$", "abcdabc"));
printf("regex=%d\n", regex_validate("^abc*abc$", "abcabc"));
printf("regex=%d\n", regex_validate("^abc*b", "abccccb"));
printf("regex=%d\n", regex_validate("^abc*b", "abbccccb"));
return 0;
}
In testOne() I use regex expression use judge if a string contains some specific strings
In testTwo() I use if else statement to to the same thing
I wonder why testTwo() is always faster than testOne() in my test cases
Is regex expression not suitable for the problem? or my regex expression is not well written?
My test code is as follow, thanks very much!
public class TestReg {
static final Pattern PATT = Pattern
.compile("(tudou|video.sina|v.youku|v.ku6|tv.sohu|v.163|tv.letv|v.ifeng|v.qq|iqiyi|(5)?6)\\.(com|cn)");
#Test
public void testOne() {
int count = 0;
for (int i = 0; i < 10000; i++) {
for (String vurl : TESTCASES) {
if (PATT.matcher(vurl).find())
count++;
}
}
System.out.println("testOne:" + count);
}
#Test
public void testTwo() {
int count = 0;
for (int i = 0; i < 10000; i++) {
for (String vurl : TESTCASES) {
if (vurl.indexOf("tudou.com") != -1
|| vurl.indexOf("video.sina.com") != -1
|| vurl.indexOf("v.youku.com") != -1
|| vurl.indexOf("v.ku6.com") != -1
|| vurl.indexOf("56.com") != -1
|| vurl.indexOf("tv.sohu.com") != -1
|| vurl.indexOf("v.163.com") != -1
|| vurl.indexOf("tv.letv.com") != -1
|| vurl.indexOf("v.ifeng.com") != -1
|| vurl.indexOf("v.qq.com") != -1
|| vurl.indexOf("iqiyi.com") != -1
|| vurl.indexOf("6.cn") != -1) {
count++;
}
}
}
System.out.println("testOne:" + count);
}
static final String[] TESTCASES = {
"http://blog.csdn.net/v_july_v/article/details/7624837",
"http://jobs.douban.com/intern/apply/?type=dev&position=intern_sf",
"https://class.coursera.org/ml/lecture/index",
"http://blog.csdn.net/v_july_v/article/details/7624837",
"http://jobs.douban.com/intern/apply/?type=dev&position=intern_sf",
"https://class.coursera.org/ml/lecture/index",
"http://blog.csdn.net/v_july_v/article/details/7624837",
"http://jobs.douban.com/intern/apply/?type=dev&position=intern_sf",
"https://class.coursera.org/ml/lecture/index",
"http://blog.csdn.net/v_july_v/article/details/7624837",
"http://jobs.douban.com/intern/apply/?type=dev&position=intern_sf",
"https://class.coursera.org/ml/lecture/index",
"http://www.56.com/u38/v_NjYyNTUyMjc.html",
"http://video.sina.com.cn/v/b/69614895-2128825751.html",
"http://www.tudou.com/programs/view/xcPewAoJ26M",
"http://v.youku.com/v_show/id_XMzQ0OTI0MTgw.html",
"http://www.56.com/u87/v_NjMzMjEzNTY.html",
"http://tv.sohu/u87/v_NjMzMjEzNTY.html",
"http://tv.letv/u38/v_NjYyNTUyMjc.html",
"http://v.ifeng/v/b/69614895-2128825751.html",
"http://v.qq/programs/view/xcPewAoJ26M",
"http://v.163/v_show/id_XMzQ0OTI0MTgw.html",
"http://iqiyi/u87/v_NjMzMjEzNTY.html",
"http://v.6.cn/u87/v_NjMzMjEzNTY.html" };
}
I wouldn't use either:
Regular expressions are designed to match patterns; they're overkill for exact matches
The || statement is a bit painful.
I'd just use a HashSet<String>. For each URL, you first use something like the URL class to extract the host name, and then see if it's in the set of hosts you're interested in.
Aside from anything else, that will prevent false positives - your current approach would match
http://www.someotherhost.com/something/tudou.com
... which you don't actually want to.