Google Sheets Array formula for counting the number of values in each column - if-statement

I'm trying to create an array formula to auto-populate the total count of values for each column as columns are added.
I've tried doing this using a combination of count and indirect, as well as tried my hand at query, but I can't seem to get it to show unique value counts for each column.
This is my first time attempting to use query, and at first it seemed possible from reading through the documentation on the query language, but I haven't been able to figure it out.
Here's the shared document: https://docs.google.com/spreadsheets/d/15VwsL7uTsORLqBDrnT3VdwAWlXLh-JgoJVbz7wkoMAo/edit?usp=sharing
I know I can do this by writing a custom function in apps script, but I'd like to use the built-in functions if I can for performance reasons (there is going to be a lot of data), and I want quick refresh rates.

try:
=ARRAYFORMULA(IF(B5:5="",,TRANSPOSE(MMULT(TRANSPOSE(N(B6:99<>"")), SIGN(ROW(B6:99))))))

In B3 try
=ArrayFormula(IF(LEN(B5:5), COUNTIF(IF(B6:21<>"", COLUMN(B6:21)), COLUMN(B6:21)),))

Related

MarkLogic Optic javaScript Geospatial Difference

I want to reduce the selected items by their distance from a point using MarkLogic Optic.
I have a table with data and a lat long
const geoData = op.fromView("namespace","coordinate");
geoData.where(op.le(distance(-28.13,153.4,geoData.col(lat),geoData(long)),100))
The distance function I have already written and utilises geo.distance(cts.point(lat,long),cts.point(lat,long)) but the geoData.col("lat") passes an object that describes the full names space of the col not the value.
op.schemaCol('namespace', 'coordinate', 'long')
I suspect I need to do a map/reduce function but MarkLogic documentation gives the normal simplistic examples that are next to useless.
I would appreciate at some help.
FURTHER INFORMATION
I have mostly solved most of this problem except that some column have null values. The data is sparse and not all rows have a long lat.
So when the cts.points runs in the where statement and two null values are passed it raises an exception.
How do I coalesce or prevent execution of cts.points when data columns are null? I dont want to reduce the data set as the null value records still need to be returned they will just have a null distance.
Where possible, it's best to do filtering by passing a constraining cts.query() to where().
A constraining query matches the indexed documents and filters the set of rows to the rows that TDE projected from those documents before retrieving the filtered rows from the indexes.
If the lat and long columns are each distinct JSON properties or XML elements in the indexed documents, it may be possible to express the distance constraint using techniques similar to those summarized here:
http://docs.marklogic.com/guide/search-dev/geospatial#id_42017
In general, it's best to use map/reduce SJS functions for postprocessing on the filtered result set because the rows have to be retrieved to the enode to process in SJS.
Hoping that helps,

How can I use ArrayFormula within a formula containing Vlookup, Filter and RegexMatch

I'm making a Google Spreadsheet which checks if a Value in Column A contains keywords out of a List in Column F. Problem is that I want to check if the value in A is exactly the same OR partly the same.
With a lot of help i've found over here I created this working formula:
=VLOOKUP(FILTER(ArrayFormula((LOWER(F:F)));REGEXMATCH(LOWER(A2);ArrayFormula((LOWER(F:F)))));ArrayFormula((LOWER(F:G)));1;FALSE)
Because I automatically import new lines of data I want to use ARRAYFORMULA. Unfortunately, I can't get it done.
This are my working formulas:
=VLOOKUP(FILTER(ArrayFormula((LOWER(F:F)));REGEXMATCH(LOWER(A2);ArrayFormula((LOWER(F:F)))));ArrayFormula((LOWER(F:F)));1;FALSE)
=VLOOKUP(FILTER(ArrayFormula((LOWER(F:F)));REGEXMATCH(LOWER(A3);ArrayFormula((LOWER(F:F)))));ArrayFormula((LOWER(F:F)));1;FALSE)
You can find my spreadsheet over here:
https://docs.google.com/spreadsheets/d/1aIdQ65SdeXW-4cTr8azQIiLNGcRCvTexGS_lFu8mECs/edit#gid=1308644379
=ARRAYFORMULA(PROPER(IFERROR(REGEXEXTRACT(LOWER(A2:A); LOWER(TEXTJOIN("|"; 1; F2:F))))))

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?

Custom Date Aggregate Function

I want to sort my Store models by their opening times. Store models contains is_open function which controls Store's opening time ranges and produces a boolean if it's open or not. The problem is I don't want to sort my queryset manually because of efficiency problem. I thought if I write a custom annotate function then I can filter the query more efficiently.
So I googled and found that I can extend Django's aggregate class. From what I understood, I have to use pre-defined sql functions like MAX, AVG etc. The thing is I want to check that today's date is in a given list of time intervals. So anyone can help me that which sql name should I use ?
Edit
I'd like to put the code here but it's really a spaghetti one. One pages long code only generates time intervals and checks the suitable one.
I want to avoid :
alg= lambda r: (not (s.is_open() and s.reachable))
sorted(stores,key=alg)
and replace with :
Store.objects.annotate(is_open = CheckOpen(datetime.today())).order_by('is_open')
But I'm totally lost at how to write CheckOpen...
have a look at the docs for extra

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.