Marklogic Diacritic Sensitive search not working for unfiltered searches - diacritics

If I do a diacritic sensitive cts:query for cts:search in unfiltered mode then I get correct result but doing the same in filtered mode gives me incorrect results.
For example :
cts:search($data,($cts:query('unfiltered','diacritic-sensitive')))
returns incorrect results.
but :
cts:search($data,($cts:query('filtered','diacritic-sensitive')))
returns correct results.
So, is there any way to get correct results for unfiltered searches too?
Please find below my details of the code.
for $result in cts:search (fn:collection ($searchable-collection), $cts-query, ('unfiltered', $relevance-score-algo), 0.0)
order by xs:dateTime ($result//c:created-on) descending
return $result/element()
Where the $cts-query is like this.
cts:element-query($element-to-query,
cts:word-query($search-text,
$search-options,
$weight)
In options I can pass "diacritic-sensitive" or not.

For accurate unfiltered diacritic-sensitive search, you must enable the fast diacritic sensitive searches index. Allow reindexing and monitor the database status until it is complete.
The docs have more on text indexing options and unfiltered search.

Related

Parse Days in Status field from Jira Cloud for Google Sheets

I am using Jira Cloud for Sheets Adds on in order to get Days in Status field from Jira, it seems to have the following syntax, from this post
<STATUS_ID>_*:*_<NUMBER_OF_TIMES_ISSUE_WAS_IN_THIS_STATUS>_*:*_<SECONDS>_*|
Here is an example:
10060_*:*_1_*:*_1121033406_*|*_3_*:*_1_*:*_7409_*|*_10000_*:*_1_*:*_270003163_*|*_10088_*:*_1_*:*_2595005_*|*_10087_*:*_1_*:*_1126144_*|*_10001_*:*_1_*:*_0
I am trying to extract for example how many times the issue was In QA status and the duration on a given status. I am dealing with parsing this pattern for obtaining this information and return it using an ARRAYFORMULA. Days in Status field information will be provided only when the issue was completed (is in Done status), otherwise, no information will be provided. if the issue is in Done status, but it didn't transition for a given status, this information will not be provided in the Days in Status string.
I am trying to use REGEXEXTRACT function to match a pattern for example:
=REGEXEXTRACT(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
and it returns an empty value, where I expect 10068. I brought my attention that when I use REGEXMATCH function it returns TRUE:
=REGEXMATCH(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
so the syntax is not clear. Google refers as a reference for Regular Expression to the following documentation. It seems to be an issue with the vertical bar |, per this documentation it is a special character that should be represented like this \v, but this doesn't work. The REGEXMATCH returns FALSE. I am trying to use some online RegEx tester, that implements Google Sheets syntax (RE2), I found ReGo, that I don't know if it is a valid one.
I was trying to use SPLITfunction like this:
=query(SPLIT(C2, "_*:*_"), "SELECT Col1")
but it seems to be a more complicated approach for getting all the values I need from Days in Status field string, but it separates well all the values from the previous pattern. In this case, I am getting the first Status ID. The number of columns returned by SPLITwill varies because it depends on the number of statuses the issues transitioned in order to get to DONE status.
It seems to be a complex task given all the issues I have encounter, but maybe some of you were dealing with this before and may advise about some ideas. It requires properly parsing the information and then extracting the information on specific columns using ARRAYFORMULA function when it applies for a given status from Status column.
Here is a google spreadsheet sample with the input information. I would like to populate the information for the following columns for Times In QA (C column) and Duration in QA (D column, the information is provided in seconds I would need in days but this is a minor task) for In QA status, then the same would apply for the rest of the other statuses. I added the tab Settings for mapping the Status ID to my Status, I would need to use a lookup function for matching the Status column in the Jira Issues tab. I would like to have a solution, without adding helper columns maybe it will require some script.
https://docs.google.com/spreadsheets/d/1ys6oiel1aJkQR9nfxWJsmEyd7XiNkVB-omcNL0ohckY/edit?usp=sharing
try:
=INDEX(IFERROR(1/(1/QUERY(1*IFNA(REGEXEXTRACT(C2:C, "10087.{5}(\d+).{5}(\d+)")),
"select Col1,Col2/86400 label Col2/86400''"))))
...so after we do REGEXEXTRACT some rows (which cannot be extracted from) will output as #N/A error so we wrap it into IFNA to remove those errors. then we multiply it by *1 to convert everything into numeric numbers (regex works & outputs always only plain text format). then we use QUERY to convert 2nd column into proper seconds in one go. at this point every row has some value so to get rid of zeros for rows we don't need (like row 2,3,5,8,9,etc) and keep the output numeric, we use IFERROR(1/(1/ wrapping. and finally, we use INDEX or ARRAYFORMULA to process our array.

Kibana search with regular expression not working

I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you
I suggest you to use
message:*categorization started*

How do I filter procmon results on time-of-day?

Filtering procmon results on time-of-day does not work as one would expect. Suppose the results show a line with time-of-day "7:44:26.4065994 AM".
If you filter on 'Time of Day' begins with '7:44:26', all results
are filtered out.
If you filter on 'Time of Day' contains '7:44:26', you get the expected results.
If you try to filter more precisely, specifying contains '7:44:26.40', all the results are filtered out.
Unlike the other procmon fields, which are uniformly treated as strings, the time-of-day field is not, apparently. There may be some way to filter precisely, but it's not obvious.
You can select a range of lines from the procmon listing and then, by right clicking, you can exclude lines before or after that range. This exclusion is based on date and time.
This has the desired effect. In addition you can look at the filter and see how this effect is achieved.

CloudSearch fuzzy matching of whole string doesn't work

I have set up an Amazon CloudSearch domain with records that hold addresses. I want to do a fuzzy text search on an address field.
Say I have a record with the following address:
1600 Amphitheatre Parkway, Mountain View, CA 94043.
If I search for 'Amphitheatre Parkway, Muntain View'~5 I get no results. I basically deleted the 'o' in "Mountain" and it doesn't find any results.
If I search for Muntain~5 it finds it, but again if I search for Miunntain~5 it doesn't find anything.
I should add I created a free text Analysis Scheme, with no stemming, stopwords or synonyms. This is what is used for the address field which is of type text.
How should I set up CloudSearch to be able to do these sort of queries?
Querying 'Amphitheatre Parkway, Muntain View'~5 is actually performing a fuzzy/sloppy phrase search, where it's searching for those words within 5 words of one another. I don't think that's what you intended.
The Miunntain~5 query is really interesting: it does indeed return no results, but miunntain~5 (lowercase m) does:
I did notice that switching between lower and uppercase in my queries does slightly affect the match scores, so perhaps the capital M just makes it too weak a match. I don't have a good explaination for that; it's certainly counterintuitive so maybe it is a bug.
Finally your actual question about setting up CloudSearch to handle those queries: unfortunately CloudSearch doesn't expose the "Did you mean..." spellcheck feature from Solr so there isn't really a good way to do this; slapping some tildas on things is about the best you can do.
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html

How to handle searches for very common keywords

I want to be able to return useful records if a user searches for a keyword that is very, very common in a solr index. For example education.
In this case, close to 99% of the records would have that word in it. So searches for this word or similar take a long time.
This is for solr on ColdFusion but I'm open to solutions which are isolated to just solr.
Right now I'm thinking of coming up with a list of stopwords and preventing those searches from taking place altogether.
If searches are taking a long time, it could be because you are not limiting the number of results that are returned. The <cfsearch> tag has a maxrows attribute, as well as a startrow attribute, that you could use to limit or paginate the data. Alternately, you could call Solr's web service directly through a <cfhttp> call:
<cfhttp url="http://localhost:8983/solr/<collection_name>/select/?q=<searchterm>&fl=*,score&rows=100&wt=json" />
Solr will return 10 rows by default; you can change this with the rows parameter. You can use the start parameter as well (note that Solr starts counting with 0 instead of 1). I believe this solution is more flexible, especially if you're using CF 9, as it allows you to paginate while sorting on a field other than score.
You can find more detail here:
http://www.thefaberfamily.org/search-smith/coldfusion-solr-tutorial/
If the user searches on just one term that is exceedingly common then you need to limit your results and advise the user that there were too many matches.
In the more general case, you want to perform a two-pass (at least) approach. Take your search terms and perform a lookup to determine their 'common-ness'. You want to filter based on least common terms first, and more common terms last.
For example, user searches serendipitous education. You identify that you have 11 matches for serendipitous, and 900000 matches for education. Thus you apply the serendipitous filter first, resulting in 11 matches. Then apply the education filter, resulting in 7 matches.
The key to fast searching is indexing and precomputed statistics. If you have statistics like this on hand you can dynamic create an optimised approach.