What is the most efficient way to check if a string is part of a bigger string? - regex

I have a string which is formed by concatenation of IP addresses, for example:
"127.272.1.43;27.27.1.43;127.127.27.67;128.27.1.43;127.20.1.43;111.27.1.43;127.27.1.43;"
When a new IP address is given, I need to check if the first half of the IP is part of the IP address string. For example, if "127.27.123.23" is given I need to find if any of the IP address in the string starts with "127.27"
I have the following code, where userIP = "127.27."
int i = StringUtils.indexOf(dbIPString, userIP);
do {
if (i > 0) {
char ch = dbIPString.charAt(i - 1);
if (ch == ';') {
System.out.println("IP is present in db");
break;
} else {
i = StringUtils.indexOf(dbIPString, userIP, i);
}
} else if (i == 0) {
System.out.println("IP is present in db");
break;
} else {
System.out.println("IP is not present in db");
}
} while (i >= 0);
Can it be more efficient? Or can I use regular expression? Which one is more efficient?

Plain string matches are usually faster than regex matches. I'd keep it simple and do something like this:
if (StringUtils.startsWith(dbIPString, userIP)) {
... // prefix is present
} else if (StringUtils.indexOf(dbIPString, ";" + userIP) > 0) {
... // prefix is present
} else {
... // prefix is not present
}
If you can arrange to have the list always begin with a ';' then searching the first entry would no longer be a special case and the logic can be simplified.
If the list will be large and you're going to be doing a lot of these searches and speed really matters then perhaps you could add each prefix to some sort of hash or tree as you build the list of addresses. Lookups in those data structures should be faster than string matches.

Assuming that you only care for entire IP address matches, and assuming you don't want 127.255.1.43 to match when you're looking for 127.25, then
(?<=^|;)127\.25\.\d+\.\d+
would be a fitting regex.
In Java:
Pattern regex = Pattern.compile(
"(?<=^|;) # Assert position at the start of the string or after ;\n" +
Pattern.quote(userIP) +
"\\.\\d+\\.\\d+ # Match .nnn.nnn",
Pattern.COMMENTS);

Related

Generate string lexicographically larger than input

Given an input string A, is there a concise way to generate a string B that is lexicographically larger than A, i.e. A < B == true?
My raw solution would be to say:
B = A;
++B.back();
but in general this won't work because:
A might be empty
The last character of A may be close to wraparound, in which case the resulting character will have a smaller value i.e. B < A.
Adding an extra character every time is wasteful and will quickly in unreasonably large strings.
So I was wondering whether there's a standard library function that can help me here, or if there's a strategy that scales nicely when I want to start from an arbitrary string.
You can duplicate A into B then look at the final character. If the final character isn't the final character in your range, then you can simply increment it by one.
Otherwise you can look at last-1, last-2, last-3. If you get to the front of the list of chars, then append to the length.
Here is my dummy solution:
std::string make_greater_string(std::string const &input)
{
std::string ret{std::numeric_limits<
std::string::value_type>::min()};
if (!input.empty())
{
if (std::numeric_limits<std::string::value_type>::max()
== input.back())
{
ret = input + ret;
}
else
{
ret = input;
++ret.back();
}
}
return ret;
}
Ideally I'd hope to avoid the explicit handling of all special cases, and use some facility that can more naturally handle them. Already looking at the answer by #JosephLarson I see that I could increment more that the last character which would improve the range achievable without adding more characters.
And here's the refinement after the suggestions in this post:
std::string make_greater_string(std::string const &input)
{
constexpr char minC = ' ', maxC = '~';
// Working with limits was a pain,
// using ASCII typical limit values instead.
std::string ret{minC};
auto rit = input.rbegin();
while (rit != input.rend())
{
if (maxC == *rit)
{
++rit;
if (rit == input.rend())
{
ret = input + ret;
break;
}
}
else
{
ret = input;
++(*(ret.rbegin() + std::distance(input.rbegin(), rit)));
break;
}
}
return ret;
}
Demo
You can copy the string and append some letters - this will produce a lexicographically larger result.
B = A + "a"

How to build Jape rules in gate

I need to build a rule where Lhs check if the first character of word beggin in b then check the whole word without the first character that found in lookup
This is a sample code for something similar to what you want(Copied from https://gate.ac.uk/wiki/jape-repository/strings.html#section-1.). You can read a little more and get to the exact solution:
Rule:GetMobile
(
{Phone}
):tag
-->
:tag{
// get the offsets
Long phoneStart = tagAnnots.firstNode().getOffset();
Long phoneEnd = tagAnnots.lastNode().getOffset();
// check the number is longer than or equal to 2 characters (just in case)
if(phoneEnd - phoneStart >= 2) {
try {
String firstTwoChars = doc.getContent()
.getContent(tagAnnots.firstNode().getOffset(),
tagAnnots.firstNode().getOffset() + 2).toString();
// check it matches 07
if("07".equals(firstTwoChars)) {
// create the new annotation
gate.FeatureMap features = Factory.newFeatureMap();
features.put("kind", "mobile");
outputAS.add(tagAS.firstNode(),
tagAS.lastNode(), "Phone", features);
}
}
catch(InvalidOffsetException e) {
// not possible
throw new LuckyException("Invalid offset from annotation");
}
}
}
Here are some places where you can read up:
https://gate.ac.uk/wiki/jape-repository/
https://gate.ac.uk/sale/talks/gate-course-jun14/module-1-jape/module-1-jape.pdf

Checking if a word is contained within an array

I want to check for a word contained within a bigger string, but not necessarily in the same order. Example: The program will check if the word "car" exists in "crqijfnsa". In this case, it does, because the second string contains c, a, and r.
You could build a map containing the letters "car" with the values set to 0. Cycle through the array with all the letters and if it is a letter in the word "car" change the value to 1. If all the keys in the map have a value greater than 0, than the word can be constructed. Try implementing this.
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once;
So, actually what you are looking for is an algorithm to check if two words are "Anagrams" are not.
Following thread provides psuedocode that might be helpful
Finding anagrams for a given word
A very primitive code would be something like this:
for ( std::string::iterator it=str.begin(); it!=str.end(); ++it)
for ( std::string::iterator it2=str2.begin(); it2!=str2.end(); ++it2) {
if (*it == *it2) {
str2.erase(it);
break;
}
}
if (str2.empty())
found = true;
You could build up a table of count of characters of each letter in the word you are searching for, then decrement those counts as you work through the search string.
bool IsWordInString(const char* word, const char* str)
{
// build up table of characters in word to match
std::array<int, 256> cword = {0};
for(;*word;++word) {
cword[*word]++;
}
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
cword[*str]--;
}
}
return std::accumulate(cword.begin(), cword.end(), 0) == 0;
}
It's also possible to return as soon as you find a match, but the code isn't as simple.
bool IsWordInString(const char* word, const char* str)
{
// empty string
if (*word == 0)
return true;
// build up table of characters in word to match
int unmatched = 0;
char cword[256] = {0};
for(;*word;++word) {
cword[*word]++;
unmatched++;
}
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
cword[*str]--;
unmatched--;
if (unmatched == 0)
return true;
}
}
return false;
}
Some test cases
"" in "crqijfnsa" => 1
"car" in "crqijfnsa" => 1
"ccar" in "crqijfnsa" => 0
"ccar" in "crqijfnsac" => 1
I think the easiest (and probably fastest, test that youself :) ) implementation would be done with std::includes:
std::string testword {"car"};
std::string testarray {"crqijfnsa"};
std::sort(testword.begin(),testword.end());
std::sort(testarray.begin(),testarray.end());
bool is_in_array = std::includes(testarray.begin(),testarray.end(),
testword.begin(),testword.end());
This also handles all cases of duplicate letters correctly.
The complexity of this approach should be O(n * log n) where n is the length of testarray. (sort is O(n log n) and includes has linear complexity.

Match a structure against set of patterns

I need to match a structure against set of patterns and take some action for each match.
Patterns should support wildcards and i need to determine which patterns is matching incoming structure, example set:
action=new_user email=*
action=del_user email=*
action=* email=*#gmail.com
action=new_user email=*#hotmail.com
Those patterns can be added/removed at realtime. There can be thousands connections, each have its own pattern and i need to notify each connection about I have received A structure which is matching. Patterns are not fully regex, i just need to match a string with wildcards * (which simple match any number of characters).
When server receives message (lets call it message A) with structure action=new_user email=testuser#gmail.com and i need to find out that patterns 1 and 3 are matching this message, then i should perform action for each pattern that match (send this structure A to corresponding connection).
How this can be done with most effecient way? I can iterate this patterns and check one-by-one, but im looking for more effecient and thread-safe way to do this. Probably its possible to group those patterns to reduce checking.. Any suggestions how this can be done?
UPD: Please note i want match multiplie patterns(thousands) aganst fixed "string"(actually a struct), not vice versa. In other words, i want to find which patterns are fitting into given structure A.
Convert the patterns to regular expressions, and match them using RE2, which is written in C++ and is one of the fastest.
Actually, if I understood correctly, the fourth pattern is redundant, since the first pattern is more general, and includes every string that is matched by the fourth. That leaves only 3 patterns, which can be easly checked by this function:
bool matches(const char* name, const char* email)
{
return strstr(name, "new_user") || strstr(name, "del_user") || strstr(email, "#gmail.com");
}
And if you prefer to parse whole string, not just match the values of action and email, then the following function should do the trick:
bool matches2(const char* str)
{
bool match = strstr(str, "action=new_user ") || strstr(str, "action=del_user ");
if (!match)
{
const char* emailPtr = strstr(str, "email=");
if (emailPtr)
{
match = strstr(emailPtr, "#gmail.com");
}
}
return match;
}
Note that the strings you put as arguments must be escaped with \0. You can read about strstr function here.
This strglobmatch supports * and ? only.
#include <string.h> /* memcmp, index */
char* strfixstr(char *s1, char *needle, int needle_len) {
int l1;
if (!needle_len) return (char *) s1;
if (needle_len==1) return index(s1, needle[0]);
l1 = strlen(s1);
while (l1 >= needle_len) {
l1--;
if (0==memcmp(s1,needle,needle_len)) return (char *) s1;
s1++;
}
return 0;
}
int strglobmatch(char *str, char *glob) {
/* Test: strglobmatch("almamxyz","?lmam*??") */
int min;
while (glob[0]!='\0') {
if (glob[0]!='*') {
if ((glob[0]=='?') ? (str[0]=='\0') : (str[0]!=glob[0])) return 0;
glob++; str++;
} else { /* a greedy search is adequate here */
min=0;
while (glob[0]=='*' || glob[0]=='?') min+= *glob++=='?';
while (min--!=0) if (*str++=='\0') return 0;
min=0; while (glob[0]!='*' && glob[0]!='?' && glob[0]!='\0') { glob++; min++; }
if (min==0) return 1; /* glob ends with star */
if (!(str=strfixstr(str, glob-min, min))) return 0;
str+=min;
}
}
return str[0]=='\0';
}
If all you want is wildcart matching, then you might try this algorithm. The point is to check all substrings that is not a wildcart to be subsequent in a string.
patterns = ["*#gmail.com", "akalenuk#*", "a*a#*", "ak*#gmail.*", "ak*#hotmail.*", "*#*.ua"]
string = "akalenuk#gmail.com"
preprocessed_patterns = [p.split('*') for p in patterns]
def match(s, pp):
i = 0
for w in pp:
wi = s.find(w, i)
if wi == -1:
return False
i = wi+len(w)
return i == len(s) or pp[-1] == ''
print [match(string, pp) for pp in preprocessed_patterns]
But it might be best to still use regexp in case you would need something more than a wildcart in a future.

skipping a character in an array if previous character is the same

I'm iterating through an array of chars to do some manipulation. I want to "skip" an iteration if there are two adjacent characters that are the same.
e.g. x112abbca
skip----------^
I have some code but it's not elegant and was wondering if anyone can think of a better way? I have a few case's in the switch statement and would be happy if I didn't have to use an if statement inside the switch.
switch(ent->d_name[i])
{
if(i > 0 && ent->d_name[i] == ent->d_name[i-1])
continue;
case ' ' :
...//code omited
case '-' :
...
}
By the way, an instructor once told me "avoid continues unless much code is required to replace them". Does anyone second that? (Actually he said the same about breaks)
Put the if outside the switch.
While I don't have anything against using continue and break, you can certainly bypass them this time without much code at all: simply revert the condition and put the whole switch statement within the if-block.
Answering the rectified question: what's clean depends on many factors. How long is this list of characters to consider: should you iterate over them yourself, or perhaps use a utility function from <algorithm>? In any case, if you are referring to the same character multiple times, perhaps you ought to give it an alias:
std::string interesting_chars("-_;,.abc");
// ...
for (i...) {
char cur = abc->def[i];
if (cur != prev || interesting_chars.find(cur) == std::string::npos)
switch (current) // ...
char chr = '\0';
char *cur = &ent->d_name[0];
while (*cur != '\0') {
if (chr != *cur) {
switch(...) {
}
}
chr = *cur++;
}
If you can clobber the content of the array you are analyzing, you can preprocess it with std::unique():
ent->erase(std::unique(ent->d_name.begin(), ent->d_name.end()), ent.end());
This should replace all sequences of identical characters by a single copy and shorten the string appropriately. If you can't clobber the string itself, you can create a copy with character sequences of just one string:
std::string tmp;
std::unique_copy(ent->d_name.begin(), ent->d_name.end(), std::back_inserter(tmp));
In case you are using C-strings: use std::string instead. If you insist in using C-strings and don't want to play with std::unique() a nicer approach than yours is to use a previous character, initialized to 0 (this can't be part of a C-string, after all):
char previous(0);
for (size_t i(0); ent->d_name[i]; ++i) {
if (ent->d_name[i] != previous) {
switch (previous = ent->d_name[i]) {
...
}
}
}
I hope I understand what you are trying to do, anyway this will find matching pairs and skip over a match.
char c_anotherValue[] = "Hello World!";
int i_len = strlen(c_anotherValue);
for(int i = 0; i < i_len-1;i++)
{
if(c_anotherValue[i] == c_anotherValue[i+1])
{
printf("%c%c",c_anotherValue[i],c_anotherValue[i+1]);
i++;//this will force the loop to skip
}
}