Using regexp with sphinx - regex

I need to make an algorythm that allows me to use uncertain (regexp) search in sphinx.
For example: i need to find a phrase that contains uncertain symbols: "2x4" maybe look like "2x4" or "2*4" or "2-4".
I want to do something like this: "2(x|*|-)4". But if i try to use this construction in query, sphinx split it on three words: "2", "(x|*|-)" and "4":
$ search -p "2x4"
...
index 'xxx': query '2x4 ': returned 25 matches of 25 total in 0.000 sec
...
words:
1. '2x4': 25 documents, 25 hits
$ search -p "2(x|y)4"
...
index 'xxx': query '2(x|y)4 ': returned 0 matches of 0 total in 0.000 sec
words:
1. '2': 816 documents, 842 hits
2. 'x': 21 documents, 21 hits
3. 'y': 0 documents, 0 hits
4. '4': 2953 documents, 3014 hits
Like ugly hack I cat do something like (2x4)|(2*4)|(2-4), but this is not good solution if I get a big phrase like "2x4x2.2" and need "2(x|*|-)4(x|*|-)2(.|,)2".
I can use "charset_table" option to define "*>x","->x",",>." and so on, but this is not flexible decision.
Can you find a better solution?
ps: sorry for my english =)

From what I've read, Sphinx doesn't support regex searches. Moreover, while the extended syntax (enabled with the -e option) has operators that support alternatives (the "OR" operator: |) and sequencing (the strict order operator: <<), they only work on words, not atoms, so that 2 << (x|*|-) << 4 would match strings where each element is a separate word, such as '2 x 4', '2 * 4'.
One option is to write a utility that converts a pattern of the form 2(x|*|-)4(x|*|-)2(.|,)2 (or, to follow the regex idiom, 2[-*x]4[-*x]2[.,]2) into a Sphinx extended query.

You can indeed use regular expressions with Sphinx.
While they cannot be used at search time, they can be used while building the index to identify a group of words/symbols that should be considered to be the same token.
http://sphinxsearch.com/docs/current.html#conf-regexp-filter
# index '13-inch' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Sphinx indexes whole words - and 'tokenizes' the word into an integer that is then stored in the index. As such regular expressions can't work because dont have the original words.
However there is dict=keywords - which does store the words in an index. But this can only right now be used for * and ? wildcards, doesnt support regular expressions.
Also, perhaps could use the techniques discussed here
http://swtch.com/~rsc/regexp/regexp4.html
This shows how generic regex searching can be implemented with a trigram index. Sphinx
itself would work as the trigram index. You store the trigrams as keywords which then
sphinx indexes. Sphinx can run the boolean queries taht that system outputs.
(normal sphinx, works pretty much like the 'Indexed Word Search' section documents. So
the trick would be using sphinx as the backend for the indexed Reg-Ex Search)

Related

Regex - finding optional arguments

I would like to write a regex to extract arguments and operation sign if given or just extract a value in a given formula.
"=400/500" will find 400, /, and 500
"=400" will find 400
So far I tried with group matching approach and a regex like following:
=(.*)(/|\*|\+|-)(.*)
However, that does not work in all cases. For example, I get following:
"=400/500" will find 400, /, and 500 which is exactly what I need
"=400" does not find any matches and I expect to get 400
I tried some modifications to my script but so far without any success.
Thanks for your help in advance!
try
=(.*)(\/)(.*)|=(.*)
if you are going with the any chr wild cards with an "/" deliminator
Try this simplest one, Hope this will be helpful. You can add () at different places to capture all matches in different groups.
Regex demo
Regex: \d+[+*\/-]?\d+
1. [+*\/-]? match for +, -, /, and * any of these operations, and ? makes it optional.
2. \d+ This will match digits one or more digits.
Do you mean something like this one ? I assumed that it will be digits only.
=\d+[+\-*\/]?\d+
Regex Demo
If you would like to use grouping:
=(\d+)([+\-*\/])?(\d+)
Regex Demo
Remove = if you dont want to match it
Thanks to (.*) who responded :-)
Your answers were very quick and insightful. I was not 100% clear in my original question, hence people provided different solutions. I wanted to utilize regex group searching. Expressions are mainly formulae or assignment operation. For example:
=500 // want to get 500 and understand there is no operator
=500+34 // want to get 500 and 34 and understand the operation is addition
=500/100 // want to get 500 and 100 and understand the operation is division
=500-100 //...
=500*100
That way I can easily extract arguments and operator or just a value in case there is no operator. I settled with modified Sahil's answer. For example, now I am using the following:
=(\d+)([+*\/-])?(\d+)?
These are results for the following inputs:
Input string: "=400/500" Result: 0: [0,8] =400/500
1: [1,4] 400
2: [4,5] /
3: [5,8] 500
Input string: "=500" Result: 0: [0,4] =500
1: [1,4] 500
2: [-1,-1] null
3: [-1,-1] null
Input string: "=400/" Result: 0: [0,5] =400/
1: [1,4] 400
2: [4,5] /
3: [-1,-1] null
If I use regex this way based on a number of groups found I can easily figure out type of formula used and therefore extract all the values provided in groups.

how to do a fast regex search on a hdf5 database

I have an HDF5 database with 100 million+ rows of text each storing a simple three column set of values:
ID WORD HEADWORD
1 the the
2 cats cat
3 sat sit
4 on on
5 the the
6 mats mat
...
I want to do a search on the "WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
In some other database (e.g. PostgresQL) I might do this with a simple regex search '?at?'. If I could search the HDF5 index using regex, that would be fine. But, I don't think this is possible. Any suggestions for how to do this kind of 'wildcard' (regex) search quickly?
Try following regex
[^\s]+[\s]+([a-zA-Z]*at[a-zA-Z]*)[\s]+[^\s]+
Group 1 in above regex will give you desired result.
"WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
Debuggex Demo
Regex Demo

MongoDB count and regex search count not matching

I have a huge mongoDB containing documents on which I am using a name as index.
So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)
To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).
My MongoDB table statistics tell me I have 48 000 016 which is fine.
The problem happens because I want to query the items on their names (which is a standard string) using this regex :
/^([A-Z]|\W|\s|\d|_)/i
So my checklist :
any letter - check
case insensitive - check
any number - check
underscore - check
\W for anything that is not a number, letter or underscore.
So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.
When I run the count on the result of the query, I have 48 000 011 items.
Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.
I ran this query on the Database as indicated by the comments.
db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}})
Result is :
{ "result" : [ ], "ok" : 1 }
Thanks a lot !
I have seen you are using the anchor ^ to match the beginnig of a line. It could be possible that the line start with an new line \n or carriage return character \r.
Try to include \n and \r to your regex
/^([A-Z]|\W|\s|\d|\r|\n|_)/i
Also check to remove the anchor.
/([A-Z]|\W|\s|\d|\r|\n|_)/i
At last option inverse your regex to see which records are not included. These regex expressions should also math empty strings.
/^(?![.*])/i
I want to thank #Paul Wasilewski for giving me some great solutions. I found my problem which was not related to a regex problem.
My 5 entries we're simply not indexed, their size was more than 1024 bytes in length so MongoDB could not index them.
So that's the reason why they could not be queried by regex.

Trying to check input textbox for time

I have to make this overview of questions and the user has to be able to insert a time.
To do this I made 2 textboxes, 1 is for the hour input and 1 is for the minute input.
What I want to do now is check if the values aren't to high to be correct.
Example:
The hour value cant be higher than 23 and the minute cant be higher than 59.
What is the best method for checking this?
I've been thinking about if statements but maybe there is a much more efficient way to get this done?
Maybe regular expressions, although I wouldnt know a correct syntax for this matter.
Thanks in advance.
If it has to be a regex:
^(?:2[0-3]|[01]?[0-9])$
will validate the hour and
^[0-5]?[0-9]$
will validate the minute.
Explanation for the "Hours" regex: (you can figure out the minutes yourself easily):
^ # Match start of string
(?: # Match either...
2[0-3] # 2, followed by 0, 1, 2 or 3,
| # or...
[01]? # 0 or 1 (optional; the empty string is OK, too), followed by
[0-9] # any digit
) # End of group
$ # Match end of string
If statements are definitely the way to go. There's no reason to use a regular expression for something so simple... it's like using a sledgehammer to place a small nail into a wall. If statements are also very efficient and easy to read... there's no reason to use regex for what you're doing.

How can I parse this without regex?

A friend of mine said if the regex I'm using is too long, it's probably the wrong tool for the job. Any thoughts here on a better way to parse this text? I have a regex that returns everything to an array I can easily just chunk out, but if there's another simpler way I'd really like to see it.
Here's what it looks like:
2 AB 123A 01JAN M ABCDEF AA1 100A 200A 02JAN T /ABCD /E
Here's a break down of that:
2 is the line number, these range from 1 all the way to 99. If you can't see because of formatting, there is a space charecter prepending numbers less than 10.
The space may or may not be replaced by an *
AB is an important unit of data (UOD).
AB may be prepended by /CD which is another important UOD.
123 is an important UOD. It can range from 1 (prepended by 4 spaces) to 99999.
A is an important UOD.
01JAN is a day/month combination, I need to extract both UODs.
M is a day name short form. This may be a number between 1 and 7.
ABC is an important UOD.
DEF is an important UOD.
The space after DEF may be an *
AA1 may be zero characters, or it may be 5. It is unimportant.
100A is a timestamp, but may be in the format 1300. The A may be N when the time is 1200 or P for times in the PM.
We then see another timestamp.
The next date part may not be there, for example, this is valid:
93*DE/QQ51234 30APR J QWERTY*QQ0 1250 0520 /ABCD*ASDFAS /E
The data where /ABCD*ASDFAS /E appears is irrelevant to the application, but, this is where the second date stamp may appear. The front-slash may be something else (such as a letter).
Note:
It is not space delimited, some parts of the body run into others. Character position is only accurate for the first two or three items on the list
I don't think I left anything out, but, if there's an easier way to parse out a string like this than writing a regex, please let me know.
This is a perfect task for regular expressions. The text does not contain nesting and the items you're matching are fairly simple taken individually.
Most regular expression syntaxes have an xtended flag or mode that allows whitespace and comments to improve readability. For example:
$regex = '#
# 2 is the line number, these range from 1 all the way to 99.
# There is a space character prepending numbers less than 10.
# The space may or may not be replaced by an *.
[ *]\d|\d\d
\s
# AB is an important unit of data (UOD).
# AB may be prepended by /CD which is another important UOD.
(/CD)?AB
\s
# 123 is an important UOD. It can range from 1 (prepended by 4 spaces)
# to 99999.
\s{4}\d{1}|\s{3}\d{2}|\s{2}\d{3}|\s{1}\d{4}|\d{5}
#x';
And so on.
A regex seems fine for this application, but for simplicity and readability, you might want to split this into several regexes (one for each field) so people can more easily follow which part of the regex corresponds to which variable.
You can always code your own parser by hand, but that would be more lines of code than a regex. The lines of code, however, will probably be simpler to follow for the reader.
Simply write a custom parser that handles it line by line. It seems like everything is at a fixed position rather than space/comma-delimited, so simply use those as indices into what you need:
line_number = int(line_text[0:1])
ab_unit = line_text[3:4]
...
If it is indeed space-delimited, simply split() each line and then parse through each, splitting each chunk into component parts where appropriate.