cloudant index: count number of unique users per time period - mapreduce

A very similar post was made about this issue here. In cloudant, I have a document structure storing when users access an application, that looks like the following:
{"username":"one","timestamp":"2015-10-07T15:04:46Z"}---| same day
{"username":"one","timestamp":"2015-10-07T19:22:00Z"}---^
{"username":"one","timestamp":"2015-10-25T04:22:00Z"}
{"username":"two","timestamp":"2015-10-07T19:22:00Z"}
What I want to know is to count the # of unique users for a given time period. Ex:
2015-10-07 = {"count": 2} two different users accessed on 2015-10-07
2015-10-25 = {"count": 1} one different user accessed on 2015-10-25
2015 = {"count" 2} two different users accessed in 2015
This all just becomes tricky because for example on 2015-10-07, username: one has two records of when they accessed, but it should only return a count of 1 to the total of unique users.
I've tried:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
emit([time.getUTCFullYear(),time.getUTCMonth(),time.getUTCDay(),doc.username], 1);
}
This suffers from several issues, which are highlighted by Jesus Alva who commented in the post I linked to above.
Thanks!

There's probably a better way of doing this, but off the top of my head ...
You could try emitting an index for each level of granularity:
function(doc) {
var time = new Date(Date.parse(doc['timestamp']));
var year = time.getUTCFullYear();
var month = time.getUTCMonth()+1;
var day = time.getUTCDate();
// day granularity
emit([year,month,day,doc.username], null);
// year granularity
emit([year,doc.username], null);
}
// reduce function - `_count`
Day query (2015-10-07):
inclusive_end=true&
start_key=[2015, 10, 7, "\u0000"]&
end_key=[2015, 10, 7, "\uefff"]&
reduce=true&
group=true
Day query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,10,7,"one"],"value":2},
{"key":[2015,10,7,"two"],"value":1}
]}
Year query:
inclusive_end=true&
start_key=[2015, "\u0000"]&
end_key=[2015, "\uefff"]&
reduce=true&
group=true
Query result - your application code would count the number of rows:
{"rows":[
{"key":[2015,"one"],"value":3},
{"key":[2015,"two"],"value":1}
]}

Related

DynamoDB query performance is low

First of all some specs:
number of entries: 110k
total size of table: 700MB
number of columns per data set: can be up to 450
data set size: can be up to 25kB
read/write capacity: "on demand"
Problem: when trying to query for some rows by a column it takes easily up to 10 seconds or more.
The column we query by is an UUID column (index exists), not unique, used kind of like an external ID. So we say give me all records with that UUID and we expect up to ca. 1000 rows.
Even if I remove our application completely out of the equation (testing directly in the AWS management console) it makes no difference, still very poor performance (means also about 10 seconds or more).
So my question: do you have any ideas or concrete tips that I should check/test/adjust to improve the performance?
After request, here's the example code in PHP (reduced to relevant parts):
We use the official aws/aws-sdk-php package.
do {
// Marshal a native PHP array of data to a DynamoDB item.
$transformedValue = $this->getDynamoArrayFromNativeArray(
[':uuid' => $uuid]
);
$params = [
'TableName' => 'our-table-name',
'IndexName' => 'our-uuid-index',
'KeyConditionExpression' => 'our-uuid-column = :uuid',
'ExpressionAttributeValues' => $transformedValue,
];
if ($queryResult !== null) {
$lastEvaluatedKey = $queryResult['LastEvaluatedKey'];
$params['ExclusiveStartKey'] = $lastEvaluatedKey;
}
$queryResult = $this->client->query($params);
// (push results to some array)
} while ($queryResult['LastEvaluatedKey'] !== null);
Example data set:
{
"_id": "82ee23ce-d7ff-11eb-bf92-0aa84964df0a",
"_meta": {
"creation_date": 1624877797,
"uuid": "820025c0-d7ff-11eb-a5f4-0aa84964df0a"
},
"some_key.data.id": 63680,
(couple of hundred more simple key => value pairs, nothing special, no huge values or anything like that)
Read capacity chart for the index in question:
Query latency chart:

Calculate the difference between 2 rows in PowerBI using DAX

I'm trying to complete something which should be quite simple but for the life of me, I can't work it out.
I'm trying to calculate the difference between 2 rows that share the same 'Scan type'.
I have attached a photo showing sample data from production. We run a scan and depending on the results of the scan, it's assigned a color.
I want to find the difference in Scan IDs between each Red scan.
Using the attached Photo of Sample data, I would expect a difference of 0 for id 3. A difference of 1 for id 4 and a difference of 10 for id 14.
I have (poorly) written something that works based on the maximum value from the scan id.
I have also tried following a few posts to see if I can get it to work..
var _curid= MAX(table1[scanid])
var _curclueid = MAX(table1[scanid])
var _calc =CALCULATE(SUM(TABLE1[scanid],FILTER(ALLSELECTED(table1[scanid]),table1[scanid]))
return if(_curid-_calc=curid,0,_curid-_calc)
Edit;
Forgot to mention I have checked threads;
57699052
61464745
56703516
57710425
Try the following DAX and if it helps then accept it as the answer.
Create a calculated column that returns the ID where the colour is Red as follows:
Column = IF('Table'[Colour] = "Red", 'Table'[ID])
Create another column as following:
Column 2 =
VAR Colr = 'Table'[Colour]
VAR SCAN = 'Table'[Scan ID]
VAR Prev_ID =
CALCULATE(MAX('Table'[Column 2]),
FILTER('Table', 'Table'[Colour] = Colr && 'Table'[Scan ID] < SCAN))
RETURN
'Table'[Column] - Prev_ID
Output:
EDIT:-
If you want your first value(ID3) to be 0 then relace the RETURN line with the following line:
IF(ISBLANK(Prev_ID) && 'Table'[Colour] = "Red", 0, 'Table'[Column] - Prev_ID )
This will give you the following result:

AmCharts4: Datagrouping, Tooltiptext and changing ranges

From a server, I get an entry for every hour in the range I request. If I request a range of 1 Year, I get something like 8000 Datapoints.
When rendering the graph, I want to group my data to hours(which is the raw data without grouping), days and months. However, the chart looks like this:
The tooltip does only display on the very first column, all other columns are above 1.5, but my ValueAxis does not scale automatically. I already checked if I set fixed min and max for the valueAxis, this is not the case.
Interestingly, if i use the scrollbar to zoom in until grouping kicks in, everything seems to work:
After zooming out again, it also works, but i cannot see the tooltip on the "June-Column":
And finally, if I trigger "invalidateData" the graph goes back to the state it was before.
My grouping looks as follows:
series = entry.chart.series.push(new am4charts.ColumnSeries());
dateAxis.groupData = true;
dateAxis.groupCount = 40;
dateAxis.groupIntervals.setAll([
{ timeUnit: "hour", count: 1 },
{ timeUnit: "day", count: 1 },
{ timeUnit: "month", count: 1 },
{ timeUnit: "year", count: 1 },
{ timeUnit: "year", count: 10 }
]);
series.groupFields.valueY = "sum";
I am also not very sure what I should set those values to. I want to see:
months when there is a period of 3 months or more
days when there is a period of 3 days until 3 months
hours when there is a period below 3 days
It is very difficult to do a fiddle for this, as there already is so much code and its hard to extract only the essential parts.
Maybe I am missing something obvious, please help!
Edit:
I forgot another question which is part of datagrouping:
How can I make the tooltip to show the date in a formatted matter so that:
hour-columns shows "dd hh:mm"(where mm obviously is 00 all the time)
day-columns shows: "dd.mm"
month-columns shows: "MM. YYYY"
Nevermind, I solved this issue. This was actually a bug fixed in this release:
https://github.com/amcharts/amcharts4/releases/tag/4.7.19
so updating my local amCharts4 files did the trick.
I still do not know how to change the tooltiptext and the grouping as described in my question

How to write CouchDB view to get currently active servers given start timestamp and end timestamp of each server?

I have set of documents which has the server name, with the start timestamp and end timestamp of that server. eg.
[
{
serverName: "Houston",
startTimestamp: "2018/03/07 17:52:13 +000",
endTimestamp: "2018/03/07 18:50:10 +000"
},
{
serverName: "Canberra",
startTimestamp: "2018/03/07 18:48:09 +000",
endTimestamp: "2018/03/07 20:10:00 +000"
},
{
serverName: "Melbourne",
startTimestamp: "2018/03/08 01:43:13 +000",
endTimestamp: "2018/03/08 12:09:10 +000"
}
]
With this data, given a Timestamp I need to get the list of active servers at that point of time.
For example. for TS="2018/03/07 18:50:00 +000" from the above data the list of active servers are ["Huston", "Canberra"]
Is it possible to achieve this using only CouchDB views. If so how to go about it?
Note: Initially I tried the following approach. In the map function I emit two documents
1 with key=doc.startTimestsamp and value={"station_add": doc.station}
1 with key=doc.startEndtsamp and value={"station_rem": doc.station}
My intention was to iterate through these in the reduce function adding stations present in "station_add" and removing stations in "stations_rem". But I found that CouchDB does not mention anything about the ordering of values in the reduce function.
If you can live with fixed periods and don't mind the extra disk space that might be needed for the view results, you can create a view of active servers per hour, for example.
Iterate over the periods between start and end and emit the time that each server was online during this period:
function(doc) {
var start = new Date(doc.startTimestamp).getTime()
var end = new Date(doc.endTimestamp).getTime()
var msPerPeriod = 60*60*1000
var msOfflineInFirstPeriod = start % msPerPeriod
var firstPeriod = start - msOfflineInFirstPeriod
var msOnlineInLastPeriod = end % msPerPeriod
var lastPeriod = end - msOnlineInLastPeriod
if (firstPeriod === lastPeriod) {
// The server was only online within one period.
emit([new Date(firstPeriod), doc.serverName], [1, msOnlineInLastPeriod - msOfflineInFirstPeriod])
} else {
// The server was online over multiple periods.
emit([new Date(firstPeriod), doc.serverName], [1,msPerPeriod - msOfflineInFirstPeriod])
for (var period = firstPeriod + msPerPeriod; period < lastPeriod; period += msPerPeriod) {
emit([new Date(period), doc.serverName], [1, msPerPeriod])
}
emit([new Date(lastPeriod), doc.serverName], [1,msOnlineInLastPeriod])
}
}
If you want the total without the server names, just add a reduce function with the built-in shortcut _sum. You'll get the number of servers online during the period as the first number and the milliseconds that the servers were online in that period as the second number.
You can play with the view if you emit the year, month and day as the first keys. Then you can use the group_level at query time to get a finer or more coarse overview.
Bear in mind that this view might get large on disk, as each row has to be stored, and also the intermediate results for each group level are stored. So you shouldn't set the period duration too small – emitting a row for each second would take a lot of disk space, for example.

Get objects created in last 30 days, for each past day

I am looking for fast method to count model's objects created within past 30 days, for each day separately. For example:
27.07.2013 (today) - 3 objects created
26.07.2013 - 0 objects created
25.07.2013 - 2 objects created
...
27.06.2013 - 1 objects created
I am going to use this data in google charts API. Have you any idea how to get this data efficiently?
items = Foo.objects.filter(createdate__lte=datetime.datetime.today(), createdate__gt=datetime.datetime.today()-datetime.timedelta(days=30)).\
values('createdate').annotate(count=Count('id'))
This will (1) filter results to contain the last 30 days, (2) select just the createdate field and (3) count the id's, grouping by all selected fields (i.e. createdate). This will return a list of dictionaries of the format:
[
{'createdate': <datetime.date object>, 'count': <int>},
{'createdate': <datetime.date object>, 'count': <int>},
...
]
EDIT:
I don't believe there's a way to get all dates, even those with count == 0, with just SQL. You'll have to insert each missing date through python code, e.g.:
import datetime
# needed to use .append() later on
items = list(items)
dates = [x.get('createdate') for x in items]
for d in (datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,30)):
if d not in dates:
items.append({'createdate': d, 'count': 0})
I think this can be somewhat more optimized solution with #knbk 's solution with python. This has fewer iterations and iterations inside SET is highly optimized in python (both in processing and in CPU-cycles).
from_date = datetime.date.today() - datetime.timedelta(days=7)
orders = Order.objects.filter(created_at=from_date, dealer__executive__branch__user=user)
orders = orders.annotate(count=Count('id')).values('created_at').order_by('created_at')
if len(orders) < 7:
orders_list = list(orders)
dates = set([(datetime.date.today() - datetime.timedelta(days=i)) for i in range(6)])
order_set = set([ord['created_at'] for ord in orders])
for dt in (order_set - dates):
orders_list.append({'created_at': dt, 'count': 0})
orders_list = sorted(orders_list, key=lambda item: item['created_at'])
else:
orders_list = orders