Maths Operations on Columns from Different Data Frames

Maths Operations on Columns from Different Data Frames - python-2.7

I have two data frames, imported through Pandas from Fama French and Yahoo. I am trying to compare column values from the two data frames (more specifically, subtract one from the other), but a value error occurs whenever I try doing so. The data frames have different indexing and I don't know how to take this factor into account (I'm quite new to python & pandas).
Here is the code in question:
start, end = dt.datetime.now()-dt.timedelta(days=60*30), dt.datetime.now()
f = data.DataReader('F-F_Research_Data_Factors', 'famafrench', start, end)[0]
s = data.get_data_yahoo('aapl', start, end)
s = s.resample('M', how='last')
s['returns'] = s['Adj Close'].pct_change()
Ideally, I would like to create a series with row values = f['RF'] - s['returns']
Any help would be much appreciated.

Convert f.index
f.index = f.index.to_datetime() + pd.offsets.MonthEnd()
f['RF'] - s['returns']

Ask yourself, how you could possibly define a difference between two matrices when they have a different size?
First thing to do, is to match the two dataframes on a commmon value (say the date). Then you will be able to do any operation you want

Related

How can I get the row view of data read from parquet file?

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.
void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
std::vector<int> columns_to_tabulate(this->total_row);
for (int idx = 0; idx < this->total_row; idx++)
columns_to_tabulate[idx] = idx;
this->file_reader->set_num_threads(4);
int rg = this->total_row_group;
// Read into table as row group rather than the whole Parquet file.
std::shared_ptr<arrow::Table> table;
this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
auto rows = table->num_rows();
//TODO
// Now I am confused how to proceed from here
}
Any suggestions?
I am confused if converting the ColumnarTableToVector will work?

It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.
If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.
You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.

As far as I know what you are trying to do isn't easy. You'd have to:
iterate through each row
iterate through each column
figure out the type of the column
cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
get the value for that column, convert it to string and append it to your output
This is further complciated by:
the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
you also need to deal with list and struct types.
It's not impossible, it's a lot of code (but you'd only have to write it once).
Another option is to write that table to CSV in memory and print it:
arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
auto outputResult = arrow::io::BufferOutputStream::Create();
ARROW_RETURN_NOT_OK(outputResult.status());
std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
auto finishResult = output->Finish();
ARROW_RETURN_NOT_OK(finishResult.status());
std::cout << finishResult.ValueOrDie()->ToString();
return arrow::Status::OK();
}

Line chart with a lot of data: skip labels

I'd like to draw a line chart with data for a year. Total number of points will be about 264 pieces.
I'd like to organize labels on x-axis not for every point. Say, in a month I have about 22 points, and I'd like to have 3 labels per month.
Could you help me: what is the most elegant way to do that?

Use an array that stores all values you have (if you want to update your chart, display other values or to reduce AJAX calls). Then use another array with the data you want to display. You need an appropriate function to copy and filter your old array.
Here I use every fifth element:
let delta = 5
let displayedData = []
for (let i = 0; i < allData.length; i=i+delta) {
displayedData.push(allData[i]);
}
You can calculate the delta in a different way or use a completely different approach to get your data.
Side note: don't use .filter(), you don't want to lop through all your data. Use a direct access like I did above.

How to prepare the multilevel multivalued training dataset in python

I am a beginner in machine learning. My academic project involves detecting human posture from acceleration and gyro data. I am stuck at the beginning itself. My accelerometer data has x,y,z values and gyro also has x,y,z values stored in file acc.csv and gyro.csv. I want to classify the 'standing', 'sitting', 'walking' and 'lying' position. The idea is to train the machine using some ML algorithm (supervised) and then throw a new acc + gyro data set to identify what this new dataset predict (what the subject is doing at present). I am facing the following problems--
Constructing a training dataset -- I think my activities will be dependent variable, and acc & gyro axis readings will be independent. So if I like to combine it in single matrix with each element of the matrix again has it's own set of acc and gyro value [Something like main and sub matrix], how can I do that? or is there any alternative idea to do the same?
How can I take the data of multiple activities with multiple readings in a single training matrix,
I mean 10 walking data each with it's own acc(xyz) and gyro (xyz) + 10 standing data each with it's own acc(xyz) and gyro (xyz) + 10 sitting data each with it's own acc(xyz) and gyro (xyz) and so on.
Each data file has different number of records and time stamp, how to bring them into a common platform.
I know I am asking very basic things but these are the confusion part nobody has clearly explained to me. I am feeling like standing in front of a big closed door, inside very interesting things are happening where I cannot participate at this moment with my limited knowledge. My mathematical background is high school level only. Please help.
I have gone through some projects on activity recognition in Github. But they are way too complicated for a beginner like me.
import pandas as pd
import os
import warnings
from sklearn.utils import shuffle
warnings.filterwarnings('ignore')
os.listdir('../input/testtraindata/')
base_train_dir = '../input/testtraindata/Train_Set/'
#Train Data
train_data = pd.DataFrame(columns = ['activity','ax','ay','az','gx','gy','gz'])
train_folders = os.listdir(base_train_dir)
for tf in train_folders:
files = os.listdir(base_train_dir+tf)
for f in files:
df = pd.read_csv(base_train_dir+tf+'/'+f)
train_data = pd.concat([train_data,df],axis = 0)
train_data = shuffle(train_data)
train_data.reset_index(drop = True,inplace = True)
train_data.head()
The Data Set
Problem in Train_set
Surprisingly if I remove the last 'gz' from
train_data = pd.DataFrame(columns =['activity','ax','ay','az','gx','gy','gz'])
Everything is working fine.

You have the data labeled? --> position of x,y,z... = positure?
I have no clue about the values (as I have not seen the dataset, and have no clue about positions, acc or gyro), but Im guessing you should have a dataset within a matrise with x, y, z as categories and a target category ;"positions".
If you need all 6 (3 from one csv and 3 from the other) to define the positions you can make 6 categories + positions.
Something like : x_1, y_1 z_1 , x_2, y_2, and z_2 + position label ("position" category).
You can also make each position an own category with 0/1 as true/false.
"sitting" , "walking" etc... and have 0 and 1 as the values in the columns.
Is the timestamp of any importance towards the position? If it is not a feature of importance I would just drop it. If it is important in some way, you might want to bin them.
Here is a beginners guide from Medium in which you can see a bit how to preprocess your data. It also shows one hot encoding :)
https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d
Also try googling Preprocessing your data, then you will probably find the right recipe

spotfire plot list of elements

I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?

This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia

because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.

What it the best way to compute Difference data and send it over network?

Well straight to the point, I have two arrays say oldArray[SIZE] and newArray[SIZE]. I want to find the difference between the each element of both arrays eg:
oldArray[0]-newArray[0] =
oldArray[1]-newArray[1] =
oldArray[2]-newArray[2] =
:
:
oldArray[SIZE]-newArray[SIZE] =
If the difference is zero no worries but if the diff is >0 store the data along with index. What the best way to store. I want to send this difference data to the client over network. Only ways that I am aware of is using a vector or a dynamic array. I'd really appreciate help with this.
Update: oldArray[] and newArra[] are two image frames of a video sequence which have depth values for each pixel, I want to compute the difference between the two frames and send only the difference over the network and on the other end I will again reconstruct the image frame, data is integer range from 0 to 1024. Hope this helps

I'd go for a std::map<int,std::pair<T,T>> where key is the index in question, and the std::pair contains the old value in first and the new value in second. No entries for equal first and second.
As for your edit a std::map<int,int> where key is the index, and value is the difference might be sufficient to keep your bitmaps synchronized.
How to serialize that properly over the network is a different kettle of fish.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Maths Operations on Columns from Different Data Frames - python-2.7

Convert f.index f.index = f.index.to_datetime() + pd.offsets.MonthEnd() f['RF'] - s['returns']

Ask yourself, how you could possibly define a difference between two matrices when they have a different size? First thing to do, is to match the two dataframes on a commmon value (say the date). Then you will be able to do any operation you want

Related

How can I get the row view of data read from parquet file?

Line chart with a lot of data: skip labels

How to prepare the multilevel multivalued training dataset in python

spotfire plot list of elements

What it the best way to compute Difference data and send it over network?

Categories

Resources