How do I combine COUNTIF with OR - regex

In Google Spreadsheets, I need to use the COUNTIF function on a range with multiple criteria. So in the table below, I would need to have something like =COUNTIF(B:B,"Mammal"or"Bird") and return a value of 4.
A |B
-------------------
Animal | Type
-------------------
Dog | Mammal
Cat | Mammal
Lizard | Reptile
Snake | Reptile
Alligator | Reptile
Dove | Bird
Chicken | Bird
I've tried a lot of different approaches with no luck.

One option:
=COUNTIF(B:B; "Mammal") + COUNTIF(B:B; "Bird")
According to the documentation:
Notes
COUNTIF can only perform conditional counts with a single criterion.
To use multiple criteria, use COUNTIFS or the database functions
DCOUNT or DCOUNTA.
COUNTIFS: This function is only available in the new Google Sheets.
Example:
=DCOUNTA(B:B; 2; {"Type"; "Mammal"; "Bird"})

You can also use ArrayFormula around a SUM(COUNTIFS()) construct:
=ArrayFormula(SUM(COUNTIF(B:B,{"Mammal", "Bird"}))
Source: Google Docs Product Forum

you can use regex like this:
=ARRAYFORMULA(SUM(N(REGEXMATCH(B:B, "Mammal|Bird"))))

You can use QUERY which can be very powerful (and is generally easier to remember, particularly if you came from the DB world) :
QUERY(A:B;"SELECT COUNT(B) WHERE B='Mammal' OR B='Bird'";0)
(0 at the end is used to avoid including header rows if you have such in your range)
Documentation about the QUERY function here : https://support.google.com/docs/answer/3093343?hl=en

Related

How to make full text search in PostgreSQL to search any order of words in the input search_text and also if there are words between them

I'm trying to use Django full-text search but I'm having a problem:
I've followed this documentation, and it works quite well. But my problem is that I don't want the postgress to consider the order or permutation of the words I search.
I mean I want the result of searching "good boy" and "boy good" to be the same.
and also when I search "good boy" I want to see the "good bad boy" in the results.
But none of these happen and I can't query "good bad boy" with typing "boy good" or even "good boy" (Because of the "bad" missing).
I've tried to split search_text by space and then & the search queries like this in order to remove the order of words but it didn't work.
I changed this code:
search_query = SearchQuery(
search_text
)
search_rank = SearchRank(search_vectors, search_query)
to this:
s = SearchQuery(search_text.split(' ')[0])
for x in search_text.split(' ')[1:]:
s = s & SearchQuery(x)
search_query = s
search_rank = SearchRank(search_vectors, search_query)
You can achieve this quite easily in PostgreSQL.
I suggest reading carefully the documentation on Basic Text Matching.
What you need is to use <->, (FOLLOWED BY) tsquery operator.
If you use an integer instead of - you can express the desired proximity :
<N>, where N is an integer standing for the difference between the positions of the matching lexemes.
So you should find a way to make Python to execute this PostgreSQL function:
to_tsquery('good <1> boy | boy <1> good' );.
Maybe just calling SearchQuery with the string 'good <1> boy | boy <1> good' just works, but you should refer to the tutorial documentation to find how to use <-> with the method SearchQuery.
Edited after comment.
Looking at Django source code you can see that the constructor for SearchQuery class pass a default parameter search_type='plain'.
You can pass a different parameter to implement a different kind of search.
In the Django project documentation, you can find one example for each accepted string value for the named parameter search_type: 'plain', 'phrase', 'raw', 'websearch'.
If you pass search_type=raw you can provide a formatted search query with terms and operators, but I'm not sure if Django supports the <Ineger> operator as I said before you should try it.
Have you tried a search with search_type='raw', and a value of "'good <1> boy | boy <1> good'"?
If the FOLLOWED BY operator (<->) is not supported the close you can get to what you wanted is calling SearchQuery with search_type='phrase'.
This will find both good boy and boy good but the exact behaviour depends on the definition of the stop words used by the Dictionary set for PostgreSQL.

Why is this ATL helper wrong?

I'm new to ATL and OCL and I'm trying to transform this metamodel:
enter image description here
into this one:
enter image description here
The helper is meant to take all the tests created by the user admin and after that sum the id's of the Actions of that test.
I've done this helper:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act.id.toInteger())->sum();
But when I run the transformation I'm having this error:
org.eclipse.m2m.atl.engine.emfvm.VMException: Collections do not have properties, use ->collect()
This error is in the collect(n | n.act.id.toInteger()) part of the helper.
The rest of my code is this:
rule Testset2Testcase{
from s: Test!Test
to r: Testcase!Testcase(
ident <- thisModule.actionId.toString(),
date <- s.md.date,
act <- thisModule.resolveTemp(s.act,'a')
)
do{
'Bukatuta'.println();
}
}
rule Action2Activity{
from s: Test!Action
to a: Testcase!Activity(
ident <- s.id
)
}
Sorry for my bad english.
My teacher helped me with this.
The problem was in the helper. Doing this:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act.id.toInteger())->sum();
I was trying to take the id of a collection of collections of the type Action instead of taking the id of each objects.
With that helper I was taking a collection of collections so using flattener this collection of collections became a collection of Actions.
The helper written in a correct way looks like this:
helper def: actionsId: Integer = Test!Test.allInstances()->select(i | i.md.user='admin')->collect(n | n.act)->flatten()->collect(x | x.id.toInteger())->sum();
Your expression looks plausible, but without your metamodel it is difficult to see where ATL is unhappy about use of a Collection property. If Test::md is a collection, the expression would just be stupid, though not for the reason given.
If ATL's hovertext doesn't help you understand your types, you might enter the same expression into the OCL Xtext Console and carefully hover over "." and "md" to get its accurate type analysis.
But beware, ATL has an independently developed embedded OCL that is not as rich as Eclipse OCL. Perhaps your expression is too complex for ATL; try breaking it up with let's.

How to optimize regexp_replace performance for multiple substrings in PostgreSQL

I need to remove multiple substrings from a column of strings. I have ~200k rows, as well as a large bag of custom "stopwords."
Specifically, I have a table (NameTable), with a Names column full of multiple-word strings. Only a small subset of the words in each string is relevant for what I'm doing next. names has a trigram/GIN index. Is there any way to speed up a query like the following, but with a much larger string of target words? I'm using PG10.
...
Create Index Names_trgm on NameTable using gin(Name gin_trgm_ops);
Update NameTable
set CleanedName = regexp_replace(Name, '( fuzzy| wuzzy| was| a | bear | the |
first| time | yossarian| saw | the | chaplain | he | fell| madly| in | love|
with| him)',' ','g');

pyparsing using both OneOrMore and ZeroOrMore regardless of order

I am trying to get into pyparsing, I want to create what I think is a simple grammar for say a shopping basket. The following illustrates
basket=['metal basket','wicker basket','plastic basket']
fish=['haddock','plaice','dover sole']
meat=['beef','lamb','pork']
vegetable=['tomatoe','onion','cabbage','carrot']
fruit=['apple','mango','orange','strawberry']
So the rule for shopping is you must have
1 Shopping basket
1 or more vegetables
zero or more fruits
fish are optional
The resulting parser must enforce the list of requirements above. It should not matter what order the items are placed in the basket, i.e a list
of
metal basket
haddock
tomatoe
apple
cabbage
orange
is just as valid as
wicker basket
tomatoe
apple
orange
apple
The one that should fail however is
- lamb
- tomatoe
- apple
- wicker basket
- apple
Because the basket must always be first in the list. I am at a loss as to how to do this
I have tired:
basket + OneOrMore(vegetable) + ZeroOrMore(fruit) + StringEnd()
But doesn't seem to work. I'm using pyparsing on python 2.7 on Windows 7. Thanks
Each is the pyparsing class for specifying "all of these things, but in any order". Think of it as a special form of And. And in fact, the operator for Each is &.
You want to define various valid combinations of basket contents, after the basket is given first.
basket + (OneOrMore(vegetable) & ZeroOrMore(fruit) & ZeroOrMore(fish))
You can leave off the StringEnd() at the end - just specify parseAll=True in your call to parseString.
Alternatively, you could just put all the ingredients into a single clump like:
basket + ZeroOrMore(vegetable | fruit | fish)
and then put the validation into a parse action. I'm actually more in favor of this second approach than of using Each in the parser itself. For one thing, a parse action, implemented in Python code, can contain much more complex rules ("more vegetables than fruit", "at least as many vegetables as fish", "oysters only in months containing an 'R'", etc.). Also, I think these rules are more likely to change over time, and all such changes will be localized to the parse action, instead of forcing changes in the parser itself.

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);