Kibana search with regular expression not working - regex

I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you

I suggest you to use
message:*categorization started*

Related

Parse Days in Status field from Jira Cloud for Google Sheets

I am using Jira Cloud for Sheets Adds on in order to get Days in Status field from Jira, it seems to have the following syntax, from this post
<STATUS_ID>_*:*_<NUMBER_OF_TIMES_ISSUE_WAS_IN_THIS_STATUS>_*:*_<SECONDS>_*|
Here is an example:
10060_*:*_1_*:*_1121033406_*|*_3_*:*_1_*:*_7409_*|*_10000_*:*_1_*:*_270003163_*|*_10088_*:*_1_*:*_2595005_*|*_10087_*:*_1_*:*_1126144_*|*_10001_*:*_1_*:*_0
I am trying to extract for example how many times the issue was In QA status and the duration on a given status. I am dealing with parsing this pattern for obtaining this information and return it using an ARRAYFORMULA. Days in Status field information will be provided only when the issue was completed (is in Done status), otherwise, no information will be provided. if the issue is in Done status, but it didn't transition for a given status, this information will not be provided in the Days in Status string.
I am trying to use REGEXEXTRACT function to match a pattern for example:
=REGEXEXTRACT(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
and it returns an empty value, where I expect 10068. I brought my attention that when I use REGEXMATCH function it returns TRUE:
=REGEXMATCH(C2, "(10060)_\*:\*_\d+_\*:\*_\d+_\*|")
so the syntax is not clear. Google refers as a reference for Regular Expression to the following documentation. It seems to be an issue with the vertical bar |, per this documentation it is a special character that should be represented like this \v, but this doesn't work. The REGEXMATCH returns FALSE. I am trying to use some online RegEx tester, that implements Google Sheets syntax (RE2), I found ReGo, that I don't know if it is a valid one.
I was trying to use SPLITfunction like this:
=query(SPLIT(C2, "_*:*_"), "SELECT Col1")
but it seems to be a more complicated approach for getting all the values I need from Days in Status field string, but it separates well all the values from the previous pattern. In this case, I am getting the first Status ID. The number of columns returned by SPLITwill varies because it depends on the number of statuses the issues transitioned in order to get to DONE status.
It seems to be a complex task given all the issues I have encounter, but maybe some of you were dealing with this before and may advise about some ideas. It requires properly parsing the information and then extracting the information on specific columns using ARRAYFORMULA function when it applies for a given status from Status column.
Here is a google spreadsheet sample with the input information. I would like to populate the information for the following columns for Times In QA (C column) and Duration in QA (D column, the information is provided in seconds I would need in days but this is a minor task) for In QA status, then the same would apply for the rest of the other statuses. I added the tab Settings for mapping the Status ID to my Status, I would need to use a lookup function for matching the Status column in the Jira Issues tab. I would like to have a solution, without adding helper columns maybe it will require some script.
https://docs.google.com/spreadsheets/d/1ys6oiel1aJkQR9nfxWJsmEyd7XiNkVB-omcNL0ohckY/edit?usp=sharing
try:
=INDEX(IFERROR(1/(1/QUERY(1*IFNA(REGEXEXTRACT(C2:C, "10087.{5}(\d+).{5}(\d+)")),
"select Col1,Col2/86400 label Col2/86400''"))))
...so after we do REGEXEXTRACT some rows (which cannot be extracted from) will output as #N/A error so we wrap it into IFNA to remove those errors. then we multiply it by *1 to convert everything into numeric numbers (regex works & outputs always only plain text format). then we use QUERY to convert 2nd column into proper seconds in one go. at this point every row has some value so to get rid of zeros for rows we don't need (like row 2,3,5,8,9,etc) and keep the output numeric, we use IFERROR(1/(1/ wrapping. and finally, we use INDEX or ARRAYFORMULA to process our array.

Searching a Mongo database using PyMongo, while using regex

I currently have a PyMongo collection with around 100,000 documents. I need to perform a regex search on each of these documents, checking each document against around 1,800 values to see if a particular field (which is an array) contains one of the 1,800 strings. After testing a variety of ways of using regex, such as compiling into a regular expression, multiprocessing and multi-threading, the performance is still abysmal, and takes around 30-45 minutes.
The current regex I'm using to find the value at the end of the string is:
rgx = re.compile(string_To_Be_Compared + '$')
And then this is ran using a standard pymongo find query:
coll.find( { 'field' : rgx } )
I was wondering if anyone had any suggestions for querying these values in a more optimal way? Ideally the search to return all the values should take less than 5 minutes. Would the best course of action to be use something like ElasticSearch or am I missing something basic?
Thanks for you time

CloudSearch fuzzy matching of whole string doesn't work

I have set up an Amazon CloudSearch domain with records that hold addresses. I want to do a fuzzy text search on an address field.
Say I have a record with the following address:
1600 Amphitheatre Parkway, Mountain View, CA 94043.
If I search for 'Amphitheatre Parkway, Muntain View'~5 I get no results. I basically deleted the 'o' in "Mountain" and it doesn't find any results.
If I search for Muntain~5 it finds it, but again if I search for Miunntain~5 it doesn't find anything.
I should add I created a free text Analysis Scheme, with no stemming, stopwords or synonyms. This is what is used for the address field which is of type text.
How should I set up CloudSearch to be able to do these sort of queries?
Querying 'Amphitheatre Parkway, Muntain View'~5 is actually performing a fuzzy/sloppy phrase search, where it's searching for those words within 5 words of one another. I don't think that's what you intended.
The Miunntain~5 query is really interesting: it does indeed return no results, but miunntain~5 (lowercase m) does:
I did notice that switching between lower and uppercase in my queries does slightly affect the match scores, so perhaps the capital M just makes it too weak a match. I don't have a good explaination for that; it's certainly counterintuitive so maybe it is a bug.
Finally your actual question about setting up CloudSearch to handle those queries: unfortunately CloudSearch doesn't expose the "Did you mean..." spellcheck feature from Solr so there isn't really a good way to do this; slapping some tildas on things is about the best you can do.
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html

Get goal conversion rate by Time on Page by Page group?

I want to answer this question:
Does the average time on page A (or more accurately page group A) affect the conversion rate of goal B?
So far in the GUI I have:
A) Created an advanced segment of Time on Page >= 120 ("per hit" option):
http://grab.by/tKOA
B) Modified the segment to also add a filter for Page = regex matching my group:
http://grab.by/tKOU
...But I don't know if this gives me the results I'm after; that is, if they are accurate
I have some other ideas, including assigning the page group as a funnel step and then segmenting by the Time on Page; still waiting on data to come in for that one
Wanting to know if there was a better solution or if I'm on the right track
Drewdavid,
Your approach is quite smart and correct, I would say, however keep in mind that in this context, you are mixing different scopes:
Time on Page is page-level metric
Page seen is visit-level dimension
What you would get in your report is the average time on page calculated from all the pages there were seen during visits which met the regex condition set in the filter (that's what segment does, it included all the pages, not just those that you want to filter). I know this can be confusing, but see this article that gives more examples and goes into greater detail.
To achieve what you are after, remove the segment filter and simply use the advanced filtering above the report table (and choose exactly the same regex you mentioned in your question).
Hope this helps!

SimpleDB Incremental Index

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.