How to map a function to a triple nested list and keep the triple nested list intact? - list

I've have been building an analysis workflow for my PhD and have been using a triple nested list to represent my data structure because I want it to be able to expand to an arbitrary amount of data in its second and third levels. The first level is the whole dataset, the second level is each subject in the dataset and third level is a row for each measure that each subject.
[dataset]
|
[subject]
|
[measure1, measure2, measure3]
I am trying to map a function to each measure - for instance convert all the points into floats or replace anomalous values with None - and wish to return the whole dataset according to its nesting but my current code:
for subject in dataset:
for measure in subject:
map(float, measure)
...the result is correct and exactly what I want but the problem is that I can't think how to assign the result back to the dataset efficiently or without losing a level of the nest. Ideally, I would like it to change the measure *in place but I can't think how to do it.
Could you suggest an efficient and pythonic way of doing that? Is a triple nested list a silly way to organize my data in the program?

Rather than doing it in place, make a new list
dataset = [[[float(value) for value in measure]
for measure in subject]
for subject in dataset]

return [[map(float, measure) for measure in subject] for subject in dataset]
You can return a list instead of altering it in place -- this is still remarkably efficient and preserves all the information you want. (aside: In fact, it's often faster than assigning to list indexes [citation needed], which is what others have suggested here!)

A straight-forward way to do that in place would be:
for subject in dataset:
for measure in subject:
for i, elem in enumerate(measure):
measure[i] = float(elem)
Alternatively, use the slice operator to upate the list in-place with the results of map
for subject in dataset:
for measure in subject:
measure[:] = map(float, measure)

This should do the job
for subject in dataset:
for measure in subject:
for i, m in enumerate(measure):
measure[i] = float(m)

Related

Checking if context is total

What might be an alternative way, possibly more effective, for checking if context is total. I use this measure as benchmark:
IsTotal1 = CALCULATE(COUNT(Tab[Store]), ALLSELECTED(Tab)) = COUNT(Tab[Store])
The idea is that it calculates COUNT on a table with filters removed (left side, so we get counts for all dimensions in context) and checks it against the COUNT with current context. If both are the same, we have total.
I know that using the function HASONEVALUE might be tempting:
IsTotal2 = NOT(HASONEVALUE(Tab[Store]))
However, using this approach has a serious drawback. If we make a table displaying sales by store and by product then the first measure will work and the second will fail. Moreover, if we display sales by product only the first measure will still work, and the second should be retyped to HASONEVALUE(Tab[Product]).
So I want the measure to be resistant to any change of granularity due to adding new dimension to table visual.
Based on the information you provided in the comments, it sounds like you have a page- or report level filter. In that case, you can't rely on functions such as ISFILTERED(...) or ISCROSSFILTERED(...), as these external filters or slicers could impact the result returned from these two functions.
So you have to either stick with your approach (perhaps changing COUNT(...) to COUNTROWS(Tab) could improve the performance slightly), or write something like
ISINSCOPE('Tab'[Store]) || ISINSCOPE('Tab'[Product]) || etc...
where you repeat ISINSCOPE for every column that could potentially be used to slice the data, as ISINSCOPE is the only function that distinguishes using a column on a filter/slicer vs. using it as a row/column grouping on a table/matrix visual.

spotfire plot list of elements

I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?
This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia
because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.

Proper Python data structure for real-time analysis?

Community,
Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?
I want to be able to do things like:
Slice the data to get the value of the last logged datapoint.
Slice the data to get the mean of the datapoints for the last n seconds.
Perform a regression on the last n data points to get g/s.
Remove from the log data points older than n seconds.
Current Attempts:
Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.
log = {}
def log_data():
log[round(time.time(), 4)] = read_data()
Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.
This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.
Goal:
So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?
Thanks,
Michael
To start, I would question two assumptions:
You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?
Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.
Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?
data = []
new_observation = (timestamp, value)
# new data comes in
data.append(new_observation)
# Slice the data to get the value of the last logged datapoint.
data[-1]
# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])
# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)

Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?

Assuming I have lines of data like the following that show user names and their favorite fruits:
Alice\tApple
Bob\tApple
Charlie\tGuava
Alice\tOrange
I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". For example, the result with the data above should be:
Alice\tMultiple
Bob\tApple
Charlie\tGuava
In SQL, this could be done something like this (although it wouldn't necessarily perform very well):
select user, case when count(fruit) > 1 then 'Multiple' else max(fruit) end
from FruitPreferences
group by user
But I can't figure out the equivalent PigLatin. Any ideas?
Write a "Aggregate Function" Pig UDF (scroll down to "Aggregate Functions"). This is a user-defined function that takes a bag and outputs a scalar. So basically, your UDF would take in the bag, determine if there is more than one item in it, and transform it accordingly with an if statement.
I can think of a way of doing this without a UDF, but it is definitely awkward. After your GROUP, use SPLIT to split your data set into two: one in which the count is 1 and one in which the count is more than one:
SPLIT grouped INTO one IF COUNT(fruit) == 0, more IF COUNT(fruit) > 0;
Then, separately use FOREACH ... GENERATE on each to transform it:
one = FOREACH one GENERATE name, MAX(fruit); -- hack using MAX to get the item
more = FOREACH more GENERATE name, 'Multiple';
Finally, union them back:
out = UNION one, more;
I haven't really found a better way of handing the same data set in two different ways based on some conditional, like you want. I typically do some sort of split/recombine like I did here. I believe Pig will be smart and make a plan that doesn't use more than 1 M/R job.
Disclaimer: I can't actually test this code at the moment, so it may have some mistakes.
Update:
In looking harder, I was reminded of the bicond operator and I think that will work here.
b = FOREACH a GENERATE name, (COUNT(fruit)==1 ? MAX(FRUIT) : 'Multiple');

Define a calculated member in MDX by filtering a measure's value

I need to define a calculated member in MDX (this is SAS OLAP, but I'd appreciate answers from people who work with different OLAP implementations anyway).
The new measure's value should be calculated from an existing measure by applying an additional filter condition. I suppose it will be clearer with an example:
Existing measure: "Total traffic"
Existing dimension: "Direction" ("In" or "Out")
I need to create a calculated member "Incoming traffic", which equals "Total traffic" with an additional filter (Direction = "In")
The problem is that I don't know MDX and I'm on a very tight schedule (so sorry for a newbie question). The best I could come up with is:
([Measures].[Total traffic], [Direction].[(All)].[In])
Which almost works, except for cells with specific direction:
So it looks like the "intrinsic" filter on Direction is overridden with my own filter). I need an intersection of the "intrinsic" filter and my own. My gut feeling was that it has to do with Intersecting [Direction].[(All)].[In] with the intrinsic coords of the cell being evaluated, but it's hard to know what I need without first reading up on the subject :)
[update] I ended up with
IIF([Direction].currentMember = [Direction].[(All)].[Out],
0,
([Measures].[Total traffic], [Direction].[(All)].[In])
)
..but at least in SAS OLAP this causes extra queries to be performed (to calculate the value for [in]) to the underlying data set, so I didn't use it in the end.
To begin with, you can define a new calculated measure in your MDX, and tell it to use the value of another measure, but with a filter applied:
WITH MEMBER [Measures].[Incoming Traffic] AS
'([Measures].[Total traffic], [Direction].[(All)].[In])'
Whenever you show the new measure on a report, it will behave as if it has a filter of 'Direction > In' on it, regardless of whether the Direction dimension is used at all.
But in your case, you WANT the Direction dimension to take precendence when used....so things get a little messy. You will have to detect if this dimension is in use, and act accordingly:
WITH MEMBER [Measures].[Incoming Traffic] AS
'IIF([Direction].currentMember = [Direction].[(All)].[Out],
([Measures].[Total traffic]),
([Measures].[Total traffic], [Directon].[(All)].[In])
)'
To see if the Dimension is in use, we check if the current cell is using OUT. If so we can return Total Traffic as it is. If not, we can tell it to use IN in our tuple.
I think you should put a column in your Total Traffic fact table for IN/OUT indication & create a Dim table for the IN & Out values. You can then analyse your data based on IN & Out.