CouchDB: How to use array keys in Map functions when using Reduce? - mapreduce

I would like to write a MapReduce function in CouchDB where the Map function emitted keys as arrays, but the reduce function used only one of the values in the map key. for example:
The Map function:
function(doc) {
if (doc.type_ === 'survey') {
emit([doc.timeRecorded_, doc.imei_], 1);
};
};
The Reduce function:
function(k,v) {
// How to handle only the doc.imei_ as the value?
// Or, alternatively, how to filter based on timeRecorded_ somewhere other than the map function?
return sum(v)
}
timeRecorded_ in an EPOCH number, so there will be no duplications (except by chance). If I were to aggregate on it then it would need to be rounded to a 'day' value. Alternatively the data could be prepared in such a way that the timeRecorded_ was already rounded in the source data (maybe changed to dateRecorded_)

A well-known pattern for this problem is to split the date into an array (e.g. [year, month, day, hour, minute]; intervals could be different but the order is to be kept) and use the array as the key in the map function.
Therefore you'll be able to reduce rows according to the group_level you need (e.g. "by year", "by month", "by day", "by hour", "by minute", etc.).
Source: http://blog.couchbase.com/understanding-grouplevel-view-queries-compound-keys

Related

Need to change a column (multiple data types) to be able to subtract it from a column (type number)

My data has multiple data types in it (text, true/false, and number) which I need to subtract from a column that is just number. I cannot just sort out the other data types because it keeps taking the entire column.
I keep getting an error that says "DAX comparison operations do not support comparing values of type Text with values of type True/False. Consider using the VALUE or FORMAT function to convert on one of the values."
I've tried this "NewValue = ([Value] = IsNumber(True), [Value], 0.0) as well as trying to use Format and Value inside an If statement. Nothing seems to work.
Any recommendations?

How can I get the row view of data read from parquet file?

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.
void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
std::vector<int> columns_to_tabulate(this->total_row);
for (int idx = 0; idx < this->total_row; idx++)
columns_to_tabulate[idx] = idx;
this->file_reader->set_num_threads(4);
int rg = this->total_row_group;
// Read into table as row group rather than the whole Parquet file.
std::shared_ptr<arrow::Table> table;
this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
auto rows = table->num_rows();
//TODO
// Now I am confused how to proceed from here
}
Any suggestions?
I am confused if converting the ColumnarTableToVector will work?
It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.
If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.
You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.
As far as I know what you are trying to do isn't easy. You'd have to:
iterate through each row
iterate through each column
figure out the type of the column
cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
get the value for that column, convert it to string and append it to your output
This is further complciated by:
the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
you also need to deal with list and struct types.
It's not impossible, it's a lot of code (but you'd only have to write it once).
Another option is to write that table to CSV in memory and print it:
arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
auto outputResult = arrow::io::BufferOutputStream::Create();
ARROW_RETURN_NOT_OK(outputResult.status());
std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
auto finishResult = output->Finish();
ARROW_RETURN_NOT_OK(finishResult.status());
std::cout << finishResult.ValueOrDie()->ToString();
return arrow::Status::OK();
}

spotfire plot list of elements

I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?
This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia
because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.

How to conditionally execute a SET operation in DynamoDB

I have an aggregations table in DynamoDb with the following columns: id, sum, count, max, min, and hash. I will ALWAYS want to update sum and count but will want to update min and max only when I have values greater than/lesser than the values already in the database. Also, I only want this operation to succeed when the stored hash is different from what I am sending, to prevent reprocessing the same data.
I currently have these:
UpdateExpression: ADD sum :sum ADD count :count SET hash :hash
UpdateCondition: attribute_not_exists(hash) OR hash <> :hash
The thing is that I need something like this for min and max:
SET min :min IF :min < min and something alike for max. Of course, this doesn't currently work. I could not find a suitable update function that would perform this comparision in DynamoDb. What is the proper way to achieve this.
PS.: I already was suggested doing multiple requests to dynamodb and place the max/min as UpdateConditions, but I want to avoid these multiple requests approach for data consistency reasons.
PS2.: Another way to express what I want in a JavaScript-sh way would be something like SET :min < min ? :min : min
I got to a solution to this problem by realizing that what I wanted was just not possible. There must be just one condition to the entire update and since there is no such thing as SET min = minimum(:min, min) I had to accept my fate and make more than one UpdateItem request to DynamoDB.
The nice thing is that the order of execution of these updates doesn't matter. The hard thing here is to make sure that each update is executed exactly once. Because we are firing a lot of requests (and having peaks eventually) there is a real chance of some failing updates due to ProvisionedThroughputExceededException or maybe just some rate limiting from AWS.
So here is my final solution;
Lambda function receives payload with hundreds of data points.
Lambda function aggregates this data points in memory and produces an intermediary aggregation object of the form {id, sum, count, min, max}.
Lambda function generates 3 update objects per aggregation object, of the forms (these updates are referring to the same record):
{UpdateExpression: 'ADD #SUM :sum, #COUNT :count'}
{ConditionExpression: '#MAX < :max OR attribute_not_exists(#MAX)', UpdateExpression: 'SET #MAX = :max'}
{ConditionExpression: '#MIN > :min OR attribute_not_exists(#MIN)', UpdateExpression: 'SET #MIN = :min'}
Because we need to be 100% sure that these updates will always be processed with success, then the lambda function sends them to a FIFO SQS queue (as 3 separate messages). I am not using a FIFO queue here because I want the order to be preserved but because I want the guarantee of exactly once delivery.
A consumer keeps pooling the queue and whenever there are messages it just shoots them to DynamoDB as the parameter of .updateItem.
At the end of this process, I was able to do real-time aggregations for thousands of records :)
PS.: Got rid of the hash column
It is not possible to do this in a single update since UpdateExpression doesn't support functions like max() and min(). The documentation for supported operations and functions can be found here
The best way to achieve the same effect is to add a field called latest or something similar which stores the latest value. You will need to change your update expression to be something like the following.
UpdateExpression: SET hash = :hash, latest = :latest, sum = sum + :latest, count = count + :num
Where :hash is of course your update hash to guard against replays, :latest is the latest value, and :num is 1 or whatever your increment is.
Then you can use DynamoDB Streams with a Lambda that looks at each update and checks if latest is less than min or greater than max. If not, ignore the update, otherwise perform a second update to set min or max to the latest value accordingly.
The main drawback to this approach is that there will be a small window where latest might be outside of the range of min or max however, this can be normalized easily in your application code when you read the records.
You should also consider the additional cost that will result from the DynamoDB Stream and Lambda invocations
I had a similar situation where I needed to atomically update a min value, and ended up doing this:
Let each item have an attribute of type Set (NS) keeping the candidate values for the minvalue, and when you want to set a new value that might be the new min, just add it to the set. Then at read time, find the lowest number in the set on the client side.
This is atomic and requires no condition expression, but has the downside that the set grows over time, so I added a clean up request to run as needed, for example when the set has more than N values, or simply on every get. The clean up might need to use a condition expression to be concurrent safe though, depending on if you also remove values through other use cases.
This does not solve all scenarios, but worked for me. In my case the value was a timestamp of an event in the future, and I wanted to store when the next event occurs. I could then easily also clean up by removing all values in the past.
Summary:
Set new potentially minimum value: ADD #values :value.
Read minimum value: GetItem followed by finding the lowest value in values client-side. This could if needed be combined with a clean up that finds all obsolete values, then calls UpdateItem DELETE #values [x, y, z...]

How to sort sizes like 5/16, 1-1/4, 1-1/2

I'm trying to think of a way to sort a list of sizes, for example 5/16, 1/4, 7/8, 1, 1-1/8, 1-1/2, 10mm, 12mm, etc.
The list is a varchar column in sql server 2008.
I'm thinking regular expressions might be a viable option, just wondering if a good way to do this already exists.
Thanks
You could have another column with the numeric measure in the same units, and then sort it. But if you sort it as varchar, you'd be sorting it alphanumerically.
Is 5/16 in inches? It doesn't say.
I would store another column with normalized data. Convert all values to mm and store it as well. So your db has a "display column" which is 13mm or 1/2", but both records have an "in mm" column with a value of 13 (or 1/2" would have a value of 12.7 if you aren't rounding).
Then when you sort, you are sorting everything by the same unit of measure. It will be faster (since you're sorting numbers) and you don't need to do conversions on the fly.
Or store units and value seperately, like this question, and sort on the result of the case statement. but I wouldn't recommend this, it's over complex, and slower.
How to conditionally convert inches to cm in MySQL (or similar conversions during SELECT)?