Google Data Studio RegExp_Match not matching textPayload dimension in log payload - regex

I'm new to Google Data Studio, and currently stuck on the following problem.
I have some log data in BigQuery and I am trying to visualize some info out of my logs using Google Data Studio.
The problem is when I use REGEXP_MATCH on one specific dimension of my data, it cannot match the RegEx. When I use REGEXP_MATCH against any other dimension of my log data, it works without a problem.
I am wondering, whether that could be due to the long strings that I have in that specific dimension, or are there any other thoughts?
By the way, I am able to make changes to that dimension using REGEXP_REPLACE, but cannot even do REGEXP_MATCH against the text that REGEXP_REPLACE replaced in there.
I have been working around this for days now and any advise will be really appreciated.

I finally found out that REGEXP_EXTRACT works against that dimension. Then it's also possible to do REGEXP_MATCH on the output of REGEXP_EXTRACT with no problem.
I still don't know why REGEXP_MATCH does not work directly against that specific dimension, maybe due to long strings or some specific letters that I have in my data in that dimension or some other reason.

Related

How I extract number from text in Google Data Studio?

I tried this but it's still showing null as you can see in the screen. Please help to solve this.
REGEXP_EXTRACT(X,"[0-9]*[0-9]")
It can be achieved using the Calculated Field below which uses the REGEXP_EXTRACT function to extract the numbers and the CAST function to ensure that it's a Number field:
CAST(REGEXP_EXTRACT(Attività, "(\\d+)") AS NUMBER )
Google Data Studio Report and GIF to elaborate:

Google Sheets Data Validation not rejecting invalid input

I have a sheet where I control provided services with columns filled with execution and conclusion dates.
These columns have data validation for invalid dates and also, for the user not to input weekend days or holidays (which is listed on another page of the same spreadsheet). So it has to be custom formula validation.
Validation formula:
=AND(ISDATE(K2)=TRUE;K2>=J2;WEEKDAY(K2)<>1;WEEKDAY(K2)<>7;COUNTIF(Holidays!$A:$A;"="&K2)=0)
also tried
=AND(ISDATE(K2)=TRUE;K2>=J2;WEEKDAY(K2)<>1;WEEKDAY(K2)<>7;ISNA(MATCH(K2;Holidays!$A:$A;0))=TRUE)
and also tried using INDIRECT("Holidays!$A:$A") on both options
***Column K has the data validation and Conclusion date is the input. Column J has execution dates. And row 1 has titles.
The problem:
data validation input rejection seems to work fine for the first couple of hours, sometimes a full day, but after this random period of time, it stops working. Actually it does work, but with the red flag, even though "Reject input" option is still checked.
My guess is that the problem resides on the reference being in another sheet, but I don't see any other way to do this, as including the holiday list to the main sheet would pollute it and hiding columns wouldn't be as practical since users update the list constantly.
Is there a way to make it work?
P.S. Conditional Formatting used to return error even when using INDIRECT for external reference but now Google seems to have fixed it.
Hope someone can help me.
custom formula for data validation:
=(ISDATE(A1))*
(WEEKDAY(A1, 2)<>6)*
(WEEKDAY(A1, 2)<>7)*
(NOT(REGEXMATCH(TO_TEXT(A1), TEXTJOIN("|", 1, INDIRECT("Sheet2!H:H")))))
custom formula for conditional formatting (valid green):
=(ISDATE(A1))*
(WEEKDAY(A1, 2)<>6)*
(WEEKDAY(A1, 2)<>7)*
(NOT(REGEXMATCH(TO_TEXT(A1), TEXTJOIN("|", 1, INDIRECT("Sheet2!H:H")))))
custom formula for conditional formatting (invalid red):
=((ISDATE(A1))*
(WEEKDAY(A1, 2)<>6)*
(WEEKDAY(A1, 2)<>7)*
(NOT(REGEXMATCH(TO_TEXT(A1), TEXTJOIN("|", 1, INDIRECT("Sheet2!H:H")))))=0)*
(A1<>"")
spreadsheet demo
I find the issue happens when the user is copying and pasting into the cell, instead of typing in, as part of a larger section of information. This breaks up the data validation into "pieces" because the copy and paste doesn't come with the data validation. I'm not sure if this is the case with you, but it may mean training users to copy and paste values only or only hard keying the information.

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

Kibana search with regular expression not working

I am trying to find some logs in Kibana by using Regular Expressions. I am aware that Kibana doesn't support the "classical" RegEx, but rather Lucene Query Syntax. I have read through the documentation of it (https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-regexp-query.html#regexp-syntax) and imo my queries should work, but they don't.
Here is an example log entry that I want to target with my query:
Timings are: sync started at 2019-02-12 19:15:09.402; accounts
downloaded:+760ms/760ms; accounts data downloaded:+1221ms/1981ms;
categorization pushed:+0ms/1981ms; categorization
started:+131ms/2112ms; categorization completed:+123ms/2235ms; in
total:2235ms.
What I want to find in the end is all such log entries where the time of "categorization started" exceeds a certain threshold. However my queries fail already while just trying to approach the final query.
I get results when I query:
message:"/categorization started/"
But already when i modify it to:
message:/categorization started/
i get nothing. Any of the following attemps also give nothing:
message:/categorization\sstarted/
message:/.*categorization\sstarted.*/
message:/.*categorization.*started.*/
At this point I'm already lost - why do all these queries not match anything?
In my mind, the final query that should get what I want should be as follows (finding all entries where categorization started time was 10,000ms or more):
message:/.*categorization started:\+<10000-99999>ms.*/
It goes without saying that this of course also returns nothing, which doesn't surprise me when the above queries already fail.
Can anyone explain to me what I am doing wrong?
Thank you
I suggest you to use
message:*categorization started*

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)