I currently have a PyMongo collection with around 100,000 documents. I need to perform a regex search on each of these documents, checking each document against around 1,800 values to see if a particular field (which is an array) contains one of the 1,800 strings. After testing a variety of ways of using regex, such as compiling into a regular expression, multiprocessing and multi-threading, the performance is still abysmal, and takes around 30-45 minutes.
The current regex I'm using to find the value at the end of the string is:
rgx = re.compile(string_To_Be_Compared + '$')
And then this is ran using a standard pymongo find query:
coll.find( { 'field' : rgx } )
I was wondering if anyone had any suggestions for querying these values in a more optimal way? Ideally the search to return all the values should take less than 5 minutes. Would the best course of action to be use something like ElasticSearch or am I missing something basic?
Thanks for you time
Related
I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you
I suggest you to use
message:*categorization started*
My Data like this..,
[123:1000,156,132,123,156,123]
[123:1009,392,132,123,156,123]
[234:987,789,132,123,156,123]
[234:8765,789,132,123,156,123]
I need to count number of times "123" exists in each line using expression language in nifi.
I need to do it in expression language only.How can i count it?
Any help appreciated.
You should use the SplitContent processor to split the flowfile content into individual flowfiles per line, then use ExtractText with a regex like pattern=(123)? which will result in an attribute being added to the flowfile for each matching group:
[123:1009,392,132,123,156,123] -> pattern.1, pattern.2, pattern.3
[234:987,789,132,123,156,123] -> pattern.1, pattern.2
Finally, you can use a ScanAttribute processor to detect the attribute with the highest group count in each of the flowfiles and route it to an UpdateAttribute to put that value into a common flowfile attribute (i.e. count). You could also replace some steps with an ExecuteStreamCommand and use a variety of OS-level tools (grep/awk/sed/cut/etc.) to perform the count, return that value, and update the content of the flowfile.
It would probably be simpler for you to perform this count action within an ExecuteScript processor, as it could be done in 1-2 lines of Groovy, Ruby, Python, or Javascript, and would not require multiple processors. Apache NiFi is designed for data routing and simple transformation, not complex event processing, so there are not standard processors developed for these tasks. There is an open Jira for "Add processor to perform simple aggregations" which has a patch available here, which may be useful for you.
According to the documentation a count is done like this:
${allMatchingAttributes(".*"):contains("123"):count()}
I have set up an Amazon CloudSearch domain with records that hold addresses. I want to do a fuzzy text search on an address field.
Say I have a record with the following address:
1600 Amphitheatre Parkway, Mountain View, CA 94043.
If I search for 'Amphitheatre Parkway, Muntain View'~5 I get no results. I basically deleted the 'o' in "Mountain" and it doesn't find any results.
If I search for Muntain~5 it finds it, but again if I search for Miunntain~5 it doesn't find anything.
I should add I created a free text Analysis Scheme, with no stemming, stopwords or synonyms. This is what is used for the address field which is of type text.
How should I set up CloudSearch to be able to do these sort of queries?
Querying 'Amphitheatre Parkway, Muntain View'~5 is actually performing a fuzzy/sloppy phrase search, where it's searching for those words within 5 words of one another. I don't think that's what you intended.
The Miunntain~5 query is really interesting: it does indeed return no results, but miunntain~5 (lowercase m) does:
I did notice that switching between lower and uppercase in my queries does slightly affect the match scores, so perhaps the capital M just makes it too weak a match. I don't have a good explaination for that; it's certainly counterintuitive so maybe it is a bug.
Finally your actual question about setting up CloudSearch to handle those queries: unfortunately CloudSearch doesn't expose the "Did you mean..." spellcheck feature from Solr so there isn't really a good way to do this; slapping some tildas on things is about the best you can do.
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
I am using a database to return a couple of values I place there. Let's just say the data is google, yahoo, bing.
The Code
dbCursor.execute('''SELECT ticker FROM SearchEngines''')
allEngines = dbCursor.fetchall()
for engine in engines:
print engine
Yields the following result:
(u'google')
(u'yahoo')
(u'bing)
This is troublesome because I require the result to be appended to a url in a string format. Does anybody know a way around this?
Thanks
fetchall() always returns a tuple, even if you're just selecting one field. So...
for engine in engines:
print engine[0]
Or:
for (engine,) in engines:
print engine
Hope this helps.
I want to be able to return useful records if a user searches for a keyword that is very, very common in a solr index. For example education.
In this case, close to 99% of the records would have that word in it. So searches for this word or similar take a long time.
This is for solr on ColdFusion but I'm open to solutions which are isolated to just solr.
Right now I'm thinking of coming up with a list of stopwords and preventing those searches from taking place altogether.
If searches are taking a long time, it could be because you are not limiting the number of results that are returned. The <cfsearch> tag has a maxrows attribute, as well as a startrow attribute, that you could use to limit or paginate the data. Alternately, you could call Solr's web service directly through a <cfhttp> call:
<cfhttp url="http://localhost:8983/solr/<collection_name>/select/?q=<searchterm>&fl=*,score&rows=100&wt=json" />
Solr will return 10 rows by default; you can change this with the rows parameter. You can use the start parameter as well (note that Solr starts counting with 0 instead of 1). I believe this solution is more flexible, especially if you're using CF 9, as it allows you to paginate while sorting on a field other than score.
You can find more detail here:
http://www.thefaberfamily.org/search-smith/coldfusion-solr-tutorial/
If the user searches on just one term that is exceedingly common then you need to limit your results and advise the user that there were too many matches.
In the more general case, you want to perform a two-pass (at least) approach. Take your search terms and perform a lookup to determine their 'common-ness'. You want to filter based on least common terms first, and more common terms last.
For example, user searches serendipitous education. You identify that you have 11 matches for serendipitous, and 900000 matches for education. Thus you apply the serendipitous filter first, resulting in 11 matches. Then apply the education filter, resulting in 7 matches.
The key to fast searching is indexing and precomputed statistics. If you have statistics like this on hand you can dynamic create an optimised approach.