search for specific characters within column and then create different columns from it

search for specific characters within column and then create different columns from it - regex

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;

I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

Related

nested if loop in splunk

I would like to write in splunk a nested if loop:
What I want to achieve
if buyer_from_France:
do eval percentage_fruits
if percentage_fruits> 10:
do summation
if summation>20:
total_price
if total_price>$50:
do(trigger bonus coupon)
My current code (that works):
> | eventstats sum(buyers_fruits) AS total_buyers_fruits by location
> | stats sum(fruits) as buyers_fruits by location buyers
> | eval percentage_fruits=fruits_bought/fruits_sold
> | table fruits_bought fruits_sold buyers
> | where percentage_fruits > 10
> | sort - percentage_fruits
How do I complete the syntax/expression for the 2nd (summation) and consequently, 3rd (total price), 4th if-loop (trigger)?

SPL doesn't do "loops". A close [enough] analog is that each line in SPL is similar to a single command in bash (hence the pipe separator between commands). IOW, SPL is purely linear in processing. Use a multi-condition eval..if like this:
index=ndx sourcetype=srctp
| eval myfield=if(match(fieldA,"someval") AND !match(fieldC,"notthis"),"all true","else val")
Or like this:
| eval myfield=if(match(fieldA,"someval"),if(match(fieldB,"otherval"),"matched A&B",if(!match(fieldC,"notthis"),"not A & not C","else val")))
If you can explain your use case/end goal better, we can probably provide better direction

Insert cell's logic into another cell's logic in Google Sheets

I have a column in Google Sheets where each cell contains pre-defined logic. For example, something like the second column in this table:
| 1 | =A1*-1 |
| 2 | =B2*-1 |
| -3 | =C2*-1 |
Let's say later I want to add the same logic to each cell in column B. For example, make it such that it looks like:
| 1 | =MAX(A1*-1,0) |
| 2 | =MAX(B2*-1,0) |
| -3 | =MAX(C2*-1,0) |
What is the fastest way to do this, besides manually typing MAX(...,0) in each cell? Normal Sheets functions act on the value of the cell, not the logic, so I'm a bit lost.
To my knowledge there isn't a function that pipes in the logic from one cell to another ...

try:
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)<0, A1:A*-1, 0)))
=ARRAYFORMULA(IF(A1:A="",,IF(SIGN(A1:A)>0, A1:A, 0)))

How do I find change point in a timeseries in PoweBi

I have a group of people who started receiving a specific type of social benefit called benefitA, I am interested in knowing what(if any) social benefits the people in the group might have received immediately before they started receiving BenefitA.
My optimal result would be a table with the number people who was receiving respectively BenefitB, BenefitC and not receiving any benefit “BenefitNon” immediately before they started receiving BenefitA.
My data is organized as a relation database with a Facttabel containing an ID for each person in my data and several dimension tables connected to the facttabel. The important ones here at DimDreamYdelse(showing type of benefit received), DimDreamTid(showing week and year). Here is an example of the raw data.
Data Example
I'm not sure how to approach this in PowerBi as I am fairly new to this program. Any advice is most welcome.
I have tried to solve the problem in SQL but as I need this as part of a running report i need to do it in PowerBi. This bit of code might however give some context to what I want to do.
USE FLISDATA_Beskaeftigelse;
SELECT dbo.FactDream.DimDreamTid , dbo.FactDream.DimDreamBenefit , dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar, dbo.DimDreamBenefit.Benefit,
FROM dbo.FactDream INNER JOIN
dbo.DimDreamTid ON dbo.FactDream.DimDreamTid = dbo.DimDreamTid.DimDreamTidID INNER JOIN
dbo.DimDreamYdelse ON dbo.FactDream.DimDreamBenefit = dbo.DimDreamYdelse.DimDreamBenefitID
WHERE (dbo.DimDreamYdelse.Ydelse LIKE 'Benefit%') AND (dbo.DimDreamTid.Aar = '2019')
ORDER BY dbo.DimDreamTid.Aar, dbo.DimDreamTid.UgeIAar

I suggest to use PowerQuery to transform your table into more suitable form for your analysis. Things would be much easier if each row of the table represents the "change" of benefit plan like this.
| Person ID | Benefit From | Benefit To | Date |
|-----------|--------------|------------|------------|
| 15 | BenefitNon | BenefitA | 2019-07-01 |
| 15 | BenefitA | BenefitNon | 2019-12-01 |
| 17 | BenefitC | BenefitA | 2019-06-01 |
| 17 | BenefitA | BenefitB | 2019-08-01 |
| 17 | BenefitB | BenefitA | 2019-09-01 |
| ...
Then you can simply count the numbers by COUNTROWS(BenefitChanges) filtering/slicing with both Benefit From and Benefit To.

Equivalent of SQL LIKE operator in R

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext

you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf

you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext

The best way to generate path pattern for materialized path tree structures

Browsing through examples all over the web, I can see that people generate the path using something like "parent_id.node_id". Examples:-
uid | name | tree_id
--------------------
1 | Ali | 1.
2 | Abu | 2.
3 | Ita | 1.3.
4 | Ira | 1.3.
5 | Yui | 1.3.4
But as explained in this question - Sorting tree with a materialized path?, using zero padding to the tree_id make it easy to sort it by the creation order.
uid | name | tree_id
--------------------
1 | Ali | 0001.
2 | Abu | 0002.
3 | Ita | 0001.0003.
4 | Ira | 0001.0003.
5 | Yui | 0001.0003.0004
Using fix length string like this also make it easy for me to calculate the level - length(tree_id)/5. What I'm worried is it would limit me to maximum 9999 users rather than 9999 per branch. Am I right here ?
9999 | Tar | 0001.9999
10000 | Tor | 0001.??

You are correct -- zero-padding each node ID would allow you to sort the entire tree quite simply. However, you have to make the padding width match the upper limit of digits of the ID field, as you have pointed out in your last example. E.g., if you're using an int unsigned field for your ID, the highest value would be 4,294,967,295. This is ten digits, meaning that the record set from your last example might look like:
uid | name | tree_id
9999 | Tar | 0000000001.0000009999
10000 | Tor | 0000000001.0000010000
As long as you know you're not going to need to change your ID field to bigint unsigned in the future, this will continue work, though it might be a bit data-hungry depending on how huge your tables get. You could shave off two bytes per node ID by storing the values in hexadecimal, which would still be sorted correctly in a string sort:
uid | name | tree_id
9999 | Tar | 00000001.0000270F
10000 | Tor | 00000001.00002710
I can imagine this would make things a real headache when trying to update the paths (pruning nodes, etc) though.
You can also create extra fields for sorting, e.g.:
uid | name | tree_id | name_sort
9999 | Tar | 00000001.0000270F | Ali.Tar
10000 | Tor | 00000001.00002710 | Ali.Tor
There are limitations, however, as laid out by this guy's answer to a similar materialized path sorting question. The name field would have to be padded to a set length (fortunately, in your example, each name seems to be three characters long), and it would take up a lot of space.
In conclusion, given the above issues, I've found that the most versatile way to do sorting like this is to simply do it in your application logic -- say, using a recursive function that builds a nested array, sorting the children of each node as it goes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js