c++ search text n boolean mode - c++

basically have two questions.
1. Is there a c++ library that would do full text boolean search just like in mysql. E.g.,
Let's say I have:
string text = "this is my phrase keywords test with boolean query.";
string booleanQuery = "\"my phrase\" boolean -test -\"keywords test\" OR ";
booleanQuery += "\"boolean search\" -mysql -sql -java -php"b
//where quotes ("") contain phrases, (-) is NOT keyword and OR is logical OR.
If answer to first is no, then;
2. Is it possible to search a phrase in text. e.g.,
string text =//same as previous
string keyword = "\"my phrase\"";
//here what's the best way to search for my phrase in the text?

TR1 has a regex class (derived from Boost::regex). It's not quite like you've used above, but reasonably close. Boost::phoenix and Boost::Spirit also provide similar capabilities, but for a first attempt the Boost/TR1 regex class is probably a better choice.

As to the 2nd point: string class does have a method find, see http://www.cppreference.com/wiki/string/find

Sure there is, try Spirit:
http://boost-spirit.com/home/

Related

Reserved Keyword search, but in reverse. Regex

I'm writing a code that looks through a string and then takes in words that are not considered "reserved keywords". I am new to regex, but have spent quite some time learning what kind of structure I need to look for reserved words. So far, I've written something along the lines of this:
\b(import|false|int|etc)\b
I am going to use an array list to feed in all of the reserved words into the string above, but I need it to work opposite of how it works now. I've figured out how to get it to search for the specific words with the code above, but how can I get it to look for the words that are NOT listed above. I've tried incorporating the ^ symbol, but I'm not having any luck there. Any regex veterans out there who see what I'm doing wrong?
There are two obvious possibilities, depending on what (else) you are doing.
Possibility 1: Use a dict or set:
You could just match words and then test for membership in a set or dictionary:
Reserved_words = set('import false true int ...'.split())
word_rx = r'\b\w+\b' # Or whatever rule you like for "words"
for m in re.finditer(...):
word = m.group(0)
if word in Reserved_words:
print("Found reserved word:", word)
else:
print("Found unreserved word:", word)
This approach is frequently used in lexers, where it is easier to just write a catch-all "match a word" rule, and then check a matched word against a list of keywords, than it is to write a fairly complex rule for each keyword and a catch-all to deal with the rest.
You can use a dict if you want to associate some kind of payload with the keyword (such as a class handle for instantiating a particular AST node type, etc.).
Possibility 2: Use named groups:
Another possibility is that you could use named groups in your regex to capture keyword/nonkeyword values:
word_rx = r'\b(?<keyword>import|int|true|false|\.\.\.)|(?<nonkeyword>\w+)\b'
for m in re.finditer(...):
word = m.group('keyword')
if word:
print("Found keyword:", word)
else:
word = m.group('nonkeyword')
print("Found nonkeyword:", word)
This is going to be slower than the previous approach, because of prefixes: "int" matches a keyword, but "integral" starts to match an int, then fails, then backtracks to the other branch, then matches a nonkeyword. :-(
However, if you are strongly tied to a mostly-regex implementation, for example, if you have many other regex-based rules, and you are processing them in a loop, then go for it!

Address extraction in Turkish with REGEX

I'm new about regex.
I want to extract address line in Turkish text.
but in turkish there is no standart while writing address.
For instance, district = mahalle
You write district for types below
"Mah." "Mh." "MAH." "MH" "mh." "mah." or "mahalle"
regex = ((.*)((\b[Mm][Aa]?[Hh].?)(.*)))
The regex is find all types of district except last one.
Two possible types of district;
1. "mah. mh. "
2. "mahalle"
How can i find it same regex sentence?
Note: i don't want to | (or) statement. .... .... | (.*)mahalle(.*)
Since there aren't many options to begin with, you can use OR operator to avoid complexity. Take a look at how stanford nlp does it with us states:
ABSTATE = Ala|Ariz|[A]z|[A]rk|Calif|Colo|Conn|Ct|Dak|[D]el|Fla|Ga|[I]ll|Ind|Kans?|Ky|[L]a|[M]ass|Md|Mich|Minn|[M]iss|Mo|Mont|Neb|Nev|Okla|[O]re|[P]a|Penn|Tenn|[T]ex|Va|Vt|[W]ash|Wisc?|Wyo
so taking our example: Mah.|Mh.|MAH.|MH|mh.|mah.|mahalle. You can of course simplify this by using case insensitive flag to cover Mah./MAH./mah..

How to include 2 words within Regex and result must be based on only those 2 words VB.NET

I would like to know how to include only 2 or more keywords within a Regex. and ending results should only show those words defined, not only one word.
What I currently have works with multiple keywords but I want it to use BOTH words not either one of the other.
For example:
Dim pattern As String = "(?i)[\t ](?<w>((arma)|(crapo))[a-z0-9]*)[\t ]"
Now the code works fine by including 'arma' or 'crapo'. I only want it to include BOTH 'arma' AND 'crapo' otherwise do not show any results.
Dealing with finding certain keywords within a PDF document and I only want to be shown results if the PDF document includes BOTH 'arma' and 'crapo' (Works fine by showing results for 'arma' OR 'crapo' I want to see results based on 'arma' AND 'crapo'.
Sorry for sounding so repetitive.
Edit: Here is my code. Please read comment.
Dim filesz() As String = GetPatternedFiles("c:\temp\", New String() {"tes*.pdf", "fes*.pdf", "Bas*.pdf"})
'The getpatterenedfiles is a function" also gettextfromPDF is another function.
For Each s As String In filesz
Dim thetext As String = Nothing
Dim pattern As String = "(?i)[\t ](?<w>(crapo)|(arma)[a-z0-9]*)[\t ]"
thetext = GetTextFromPDF(s)
For Each m As Match In Regex.Matches(thetext, pattern)
ListBox1.Items.Add(s)
Next
Next
You can use this regex:
\barma\b.*?\bcrapo\b|\bcrapo\b.*?\barma\b
Working demo
The idea is to match arma whatever crapo or crapo whatever arma and use word boundaries to avoid words like karma.
However, if you want to match karma or crapotos as you asked in your comment you can use:
arma.*?crapo|crapo.*?arma

Best way to test for FOO or BAR or Foo or Bar in a regex?

I am doing checking for keywords which are headers, and the input is totally out of my control.
So I've figured out that they will have the first letter capitalized, but also might be in all caps.
I can do a Java Pattern that is:
Pattern test = Pattern.compile("\\b(FOO|BAR|Foo|Bar)\\b");
And doing a Pattern matcher with that works fine. As in:
boolean ans = test.matcher(sometext).find();
However when I have 6 or 8 keywords to check for it starts to get kind of ugly to have all the keywords there twice.
Can anyone come up with a more elegant regex that might do this?
Thanks
ADDED 3/26/15
Let me re-emphasize, its not as simple as just ignoring case completely, which is what was initially suggested. The first letter does need to be capitalized, its the rest of the string that can be upper or lower.
Use the "ignore case" flag (?i):
Pattern test = Pattern.compile("(?i)\\b(FOO|BAR)\\b");
You don't need \\b\\b since anything that comes normally is treated as a word rather than as acharacter class.
also use i(ignoreCase) modifier.
Your regex should be:
(foo|bar)
Add, i modifier, according to your language
Also, you are saying "to test". Using regex for that is overkill.
Do this:
String Str = new String("Welcome to Foo bar ");
Str = Str.toLowerCase();
return Str.contains("foo")||Str.contains("bar"); // returns true or false

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.