How to detect specific subwords in text

How to detect specific subwords in text - regex

I have a column as a string with no spaces:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
end
I am using the following command:
gen = regex,(var, "(news)")
This outputs 1 1 1 because it finds that the 3 rows in the column var contain the word news.
I'm trying to alter the regular expression "(news)" to create two columns. One for news and one for newspaper. regexm(var, "(newspaper)") makes sure that the row contains a newspaper, but I need a command to make sure characters after news are not "paper" as I'm trying to quantify the two.
EDIT:
Is there a way to count the third entry as 1, because it has a news occurrence without however being a newspaper?

You can quantify as follows without a regular expression:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
"fdgdnews"
"fgogodigjhoigjnewspaper"
"fgeogeionnewsfgdgfpaper"
"45pap9358newsfjfgni"
end
generate news = strmatch(var, "*news*") & !strmatch(var, "*newspaper*")
list, separator(0)
+----------------------------------------+
| var news |
|----------------------------------------|
1. | ihaveanewspaper 0 |
2. | watchingthenewsonthetv 1 |
3. | watchthenewsandreadthenewspaper 0 |
4. | fdgdnews 1 |
5. | fgogodigjhoigjnewspaper 0 |
6. | fgeogeionnewsfgdgfpaper 1 |
7. | 45pap9358newsfjfgni 1 |
+----------------------------------------+
count if news
4
count if !news
3
EDIT:
One way to do this is to eliminate all instances of the word newspaper and repeat the process:
generate var2 = subinstr(var, "newspaper", "", .)
replace news = 1 if strmatch(var2, "*news*")
list, separator(0)
+------------------------------------------------------------------+
| var news var2 |
|------------------------------------------------------------------|
1. | ihaveanewspaper 0 ihavea |
2. | watchingthenewsonthetv 1 watchingthenewsonthetv |
3. | watchthenewsandreadthenewspaper 1 watchthenewsandreadthe |
4. | fdgdnews 1 fdgdnews |
5. | fgogodigjhoigjnewspaper 0 fgogodigjhoigj |
6. | fgeogeionnewsfgdgfpaper 1 fgeogeionnewsfgdgfpaper |
7. | 45pap9358newsfjfgni 1 45pap9358newsfjfgni |
+------------------------------------------------------------------+
count if news
5
count if !news
2

Related

Pandas group_by string column which values contained in a separate list

I have a hierarchy-based event stream, where each hierarchy parent node(represented as level0/1) has multiple children (level0(0/1/2) and sub child (level00(0/1/2)). "level" is just a placeholder, each hierarchy level has its own unique name. The only rule is that a parent node hierarchy string is always included in the child's hierarchy string name. Assume that this event stream has 300k and more entries.
| index | hierarchystr |
| ----- | --------------------- |
| 0 | level0level00level000|
| 1 | level0level01 |
| 2 | level0level02level021|
| 3 | level0level02level021|
| 4 | level0level02level020|
| 5 | level0level02level021|
| 6 | level1level02level021|
| 7 | level1level02level021|
| 8 | level1level02level021|
| 9 | level2level02level021|
Now I want to do an inclusive group_by by a separate list and the line should be included if the string in the array is included in the string of the hierarchystr column, expected output (beware hstrs is every time in a different order!):
#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
|index| 0 | Count |
|-----|---------------------|-------|
|0 |level0 | 6 |
|1 |level1 | 3 |
|2 |level0level01 | 1 |
|3 |level0level02 | 4 |
|4 |level0level02level021| 3 |
I tried the following solutions, but all are slow as hell:
#V1
for hstr in hstrs:
s = df[df.hierarchystr.str.contains(hstr)]
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
beforeset.append(hstr)
#V2
for hstr in hstrs:
s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
list.append(hstr)
#V3 - fastest, but also slow and not satisfying
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values
I also tried versions for all variables at once with pat and extract, but in return the size per group changes in every run, because the list hstrs is every run in a different order.
df.hierarchystr.extract[all](pat="|".join(hstrs))
Is there a regex and method possible that do this task in one step so this is also applicable for huge data frames at an appropriate time - that not depending on the order of the hstrs array?

You can try:
count = [df['hierarchystr'].str.startswith(hstr).sum() for hstr in hstrs]
out = pd.DataFrame({'hstr': hstrs, 'count': count})
print(out)
# Output
hstr count
0 level0 6
1 level1 3
2 level0level01 1
3 level0level02 4
4 level0level02level021 3

Conditional formatting on particular columns in Tableau

I have a sample dataset on which I want to perform conditional formatting. In the given sample of data, if values in column Item3>=Item1 then the corresponding records in Item3 should be highlighted in green else in red. Similarly, if values in column Item4>=Item2 then the corresponding records in Item4 should be highlighted in green else in red.
| Group | Item1 | Item2 | Item3 | Item4 |
|-------|-------|-------|-------|-------|
| A | 3 | 1 | 1 | 1 |
| B | 4 | 3 | 4 | 3 |
| C | 5 | 6 | 2 | 8 |
| D | 9 | 4 | 10 | 6 |
| E | 6 | 9 | 7 | 7 |
| F | 4 | 5 | 5 | 7 |
| G | 7 | 5 | 9 | 6 |
In the above example, rows 1 and 3 under Item3 column should be highlighted in red and rest of them in green while row 5 under Item4 column should be highlighted in red and rest in green.
I have tried creating a calculated field using if-else statement, but it highlights all the values. How can I achieve it for highlighting the cells in columns 'Item3' and 'Item4'?

One way to achieve this Viz is to have 3 sheets. First sheet is group, item1, and item 2. Second sheet is group and item3. Third Sheet is group and item4.
Create two calculated fields "3>1" and "4>2" and assign these as colors to second and third sheet respectively. Then make a dashboard with all three sheets floating, overlapping, adjusting which one is in front. I punted on titles.
Here's my result.
And here: https://public.tableau.com/app/profile/wade.schuette/viz/color-columns/Dashboard1?publish=yes

How to index match a condition set in a cell

I am trying to avoid having a multiple if formula by index matching a table instead, however what i need to match is the actual condition and a string.
Lookup table:
+---+------------------+-------------------+-------+
| | A | B | C |
+---+------------------+-------------------+-------+
| 1 | Current to Prior | Portfolio Comment | Error |
| 2 | =0 | "" | 1 |
| 3 | <>0 | "" | -1 |
| 4 | >0 | OK – Losses | 0 |
| 5 | <0 | OK – Losses | 1 |
| 6 | <0 | OK – New Sales | 0 |
| 7 | >0 | OK – New Sales | 1 |
+---+------------------+-------------------+-------+
Column A: Lookup Condition
Column B: Lookup string
Column C: Return value
Data example with correct hard coded output (column C):
+---+------------------+-------------------+-------+
| | A | B | C |
+---+------------------+-------------------+-------+
| 1 | Current to Prior | Portfolio comment | Error |
| 2 | 0 | | 1 |
| 3 | -100 | OK – Losses | 1 |
| 4 | 50 | | -1 |
| 5 | 200 | OK – Losses | 0 |
| 6 | 0 | | 1 |
| 7 | -400 | OK – New Sales | 0 |
| 8 | 0 | | 1 |
+---+------------------+-------------------+-------+
Column A: Data value
Column B: Data string
Column C: Output formula
I need a formula that matches the data value with the lookup condition, the data string with the lookup string and outputs the return value.

I know you weren't necessarily asking for a VBA solution, but myself (and many others) prefer using UDFs as, in my opinion, it makes reading formulas easier and cleaner - plus you can do without the helper cells.
We start off your UDF by creating a Select Case Statement. We could choose to use either the Numerical Value or String for the cases. I decided to go with the string.
Within each case, you will compare the numerical values provided to the lngCondition parameter, which will ultimately return the value to the function.
Since you didn't have any cases for when textual values could have a lngCondition = 0, I made it return a worksheet error code #VALUE, just as you'd expect from any other Excel formula. This is the reason for the UDF having a variant return type.
Public Function ReturnErrorCode(lngCondition As Long, strComment As String) As Variant
Select Case strComment
Case ""
If lngCondition = 0 Then
ReturnErrorCode = 1
Else
ReturnErrorCode = -1
End If
Case "OK - Losses"
If lngCondition > 0 Then
ReturnErrorCode = 0
ElseIf lngCondition < 0 Then
ReturnErrorCode = 1
Else
' Your conditions don't specify that 'OK - Losses'
' can have a 0 value
ReturnErrorCode = CVErr(xlErrValue)
End If
Case "OK - New Sales"
If lngCondition < 0 Then
ReturnErrorCode = 0
ElseIf lngCondition > 0 Then
ReturnErrorCode = 1
Else
' Your conditions don't specify that 'OK - New Sales'
' can have a 0 value
ReturnErrorCode = CVErr(xlErrValue)
End If
Case Else
ReturnErrorCode = CVErr(xlErrValue)
End Select
End Function
You would then use this formula in the worksheet as such:
=ReturnErrorCode(A1, B1)
Great! But I have no knowledge of VBA and don't know how to add a UDF.
First, you need to open the VBA Editor. You can do this by simultaneously pressing Alt + F11.
Next, you need to create a standard code module. In the VBE, click Insert then select Module (NOT Class module!).
Then copy the code above, and paste it into the new code module you just created.
Since you have now added VBA code to your workbook, you now need to save it as a macro-enabled workbook the next time you save.

Generate variable with odd and even labels

I have a variable num with values 1-10.
I would like to create a new variable type with values odd or even:
generate type = odd if inlist(num, 1,3,5,7,9)
Questions:
What is the cleanest way to also label even numbers?
Could I use a negation somewhere and keep the command all in one line?

The code you provide is not valid syntax:
clear
set obs 10
generate num = _n
generate type = odd if inlist(num, 1,3,5,7,9)
odd not found
r(111);
You could get what you want with:
generate type = "odd" if inlist(num, 1,3,5,7,9)
And you can do both at the same time using the cond() function:
generate type = cond(inlist(num, 1,3,5,7,9), "odd", "even")
However, having this variable as a string will be of limited value for later use.
You could subsequently use the encode command to create a new variable of numeric type:
encode type, generate(type2)
list
+--------------------+
| num type type2 |
|--------------------|
1. | 1 odd odd |
2. | 2 even even |
3. | 3 odd odd |
4. | 4 even even |
5. | 5 odd odd |
|--------------------|
6. | 6 even even |
7. | 7 odd odd |
8. | 8 even even |
9. | 9 odd odd |
10. | 10 even even |
+--------------------+
Although seemingly identical, type and type2 variables are indeed of a different type:
list, nolabel
+--------------------+
| num type type2 |
|--------------------|
1. | 1 odd 2 |
2. | 2 even 1 |
3. | 3 odd 2 |
4. | 4 even 1 |
5. | 5 odd 2 |
|--------------------|
6. | 6 even 1 |
7. | 7 odd 2 |
8. | 8 even 1 |
9. | 9 odd 2 |
10. | 10 even 1 |
+--------------------+
This is how you can do it with type being a numeric variable:
generate type = mod(num, 2)
list
+------------+
| num type |
|------------|
1. | 1 1 |
2. | 2 0 |
3. | 3 1 |
4. | 4 0 |
5. | 5 1 |
|------------|
6. | 6 0 |
7. | 7 1 |
8. | 8 0 |
9. | 9 1 |
10. | 10 0 |
+------------+
You then create the value label and attach it to the variable type:
label define numlab 0 "even" 1 "odd"
label values type numlab
list
+------------+
| num type |
|------------|
1. | 1 odd |
2. | 2 even |
3. | 3 odd |
4. | 4 even |
5. | 5 odd |
|------------|
6. | 6 even |
7. | 7 odd |
8. | 8 even |
9. | 9 odd |
10. | 10 even |
+------------+
If you only want the odd numbers labeled you can simply do:
label define numlab 1 "odd"
If you later change your mind and want to add a label for even numbers:
label define numlab 0 "even", add

When your command has been run, the value of type for the odd numbers is "odd" and the value for the even numbers is "", that is a missing string.
You could tag the even numbers using
replace type = "even" if type==""
I cannot think of a way to keep it all in one line, since you have to both generate the variable and fill in two different string values.
If you could use a numeric variable (I name it flag) as your type variable, you could try this:
gen flag = mod(num,2)
This will flag the odd numbers as 1 and the even numbers as 0. You could then create a label for the flag variable, if you need to display its values as "odd" and "even".

How to create a table (or list) with the order codes of orders with both products

I have a Transactions table with the following structure:
ID | Product | OrderCode | Value
1 | 8 | ABC | 100
2 | 5 | ABC | 150
3 | 4 | ABC | 80
4 | 5 | XPT | 100
5 | 6 | XPT | 100
6 | 8 | XPT | 100
7 | 5 | XYZ | 100
8 | 8 | UYI | 90
How do I create a table (or list) with the order codes of orders with both products 5 and 8?
In the example above it should be the orders ABC and XPT.

There are probably many ways to do this, but here's a fairly general solution what I came up with:
FilteredList =
VAR ProductList = {5, 8}
VAR SummaryTable = SUMMARIZE(Transactions,
Transactions[OrderCode],
"Test",
COUNTROWS(INTERSECT(ProductList, VALUES(Transactions[Product])))
= COUNTROWS(ProductList))
RETURN SELECTCOLUMNS(FILTER(SummaryTable, [Test]), "OrderCode", Transactions[OrderCode])
The key here is if the set of products for a particular order code contains both 5 and 8, then the intersection of VALUES(Transations[Product]) with the set {5,8} is exactly that set and has a count of 2. If it doesn't have both, the count will be 1 or 0 and the test fails.

Please elaborate more on your question, From your above post I understood is you want to filter the list, For that, you can use below code
List<Transactions> listTransactions = listTransactions.FindAll(x=>x.Product == 5 || x.Product == 8)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to detect specific subwords in text - regex

Related

Pandas group_by string column which values contained in a separate list

Conditional formatting on particular columns in Tableau

How to index match a condition set in a cell

Generate variable with odd and even labels

How to create a table (or list) with the order codes of orders with both products

Categories

Resources