I believe I understand when Concatenate needs to be called, on what data and why. What I'm trying to understand is what physically happens to the input columns data when Concatenate is called.
Is this some kind of a hash function that hashes all the input data from the columns and generates a result?
In other words, I would like to know if that is technically possible to restore original values from the value generated by Concatenate?
Is the order of data columns passed into Concatenate affects the resulting model and in what way?
Why I'm asking all that. I'm trying to understand what input parameters and in what way affect the quality of the produced model. I have many input columns of data. They are all rather important and it is important the relation between those values. If Concatenate does something simple and loses the relations between values I would try one approach to improve the quality of the model. If it is rather complex and keeps details of the values I would use other approaches.
In ML.NET, Concatenate takes individual features (of the same type) and creates a feature vector.
In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.
To my understanding there's no hashing involved. Conceptually you can think of it like the String.Join method, where you're taking individual elements and join them into one. In this case, that single component is a feature vector that as a whole represents the underlying data as an array of type T where T is the data type of the individual columns.
As a result, you can always access the individual components and order should not matter.
Here's an example using F# that takes data, creates a feature vector using the concatenate transform, and accesses the individual components:
#r "nuget:Microsoft.ML"
open Microsoft.ML
open Microsoft.ML.Data
// Raw data
let housingData =
seq {
{| NumRooms = 3f; NumBaths = 2f ; SqFt = 1200f|}
{| NumRooms = 2f; NumBaths = 1f ; SqFt = 800f|}
{| NumRooms = 6f; NumBaths = 7f ; SqFt = 5000f|}
}
// Initialize MLContext
let ctx = new MLContext()
// Load data into IDataView
let dataView = ctx.Data.LoadFromEnumerable(housingData)
// Get individual column names (NumRooms, NumBaths, SqFt)
let columnNames =
dataView.Schema
|> Seq.map(fun col -> col.Name)
|> Array.ofSeq
// Create pipeline with concatenate transform
let pipeline = ctx.Transforms.Concatenate("Features", columnNames)
// Fit data to pipeline and apply transform
let transformedData = pipeline.Fit(dataView).Transform(dataView)
// Get "Feature" column containing the result of applying Concatenate transform
let features = transformedData.GetColumn<float32 array>("Features")
// Deconstruct feature vector and print out individual features
printfn "Rooms | Baths | Sqft"
for [|rooms;baths;sqft|] in features do
printfn $"{rooms} | {baths} | {sqft}"
The result output to the console is:
Rooms | Baths | Sqft
2 | 3 | 1200
1 | 2 | 800
7 | 6 | 5000
If you're looking to understand the impact individual features have on your model, I'd suggest looking at Permutation Feature Importance (PFI) and Feature Contribution Calculation
Related
Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.
void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
std::vector<int> columns_to_tabulate(this->total_row);
for (int idx = 0; idx < this->total_row; idx++)
columns_to_tabulate[idx] = idx;
this->file_reader->set_num_threads(4);
int rg = this->total_row_group;
// Read into table as row group rather than the whole Parquet file.
std::shared_ptr<arrow::Table> table;
this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
auto rows = table->num_rows();
//TODO
// Now I am confused how to proceed from here
}
Any suggestions?
I am confused if converting the ColumnarTableToVector will work?
It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.
If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.
You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.
As far as I know what you are trying to do isn't easy. You'd have to:
iterate through each row
iterate through each column
figure out the type of the column
cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
get the value for that column, convert it to string and append it to your output
This is further complciated by:
the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
you also need to deal with list and struct types.
It's not impossible, it's a lot of code (but you'd only have to write it once).
Another option is to write that table to CSV in memory and print it:
arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
auto outputResult = arrow::io::BufferOutputStream::Create();
ARROW_RETURN_NOT_OK(outputResult.status());
std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
auto finishResult = output->Finish();
ARROW_RETURN_NOT_OK(finishResult.status());
std::cout << finishResult.ValueOrDie()->ToString();
return arrow::Status::OK();
}
I'm trying to create a descriptive table by treatment group. For my analysis, I have 3 different partitions of the data (because I'm running 3 separate analyses) from a complete data set, but I only have one statistic from each subset that I am trying to describe, so I think it'd look better in one complete table. At the end, I'd like an output that can convert to latex (as I'm using bookdown).
I've been using the compareGroups package to easily create each table individually. I know that there is an rbind function that allows to create a stacked table, but it won't let me combine them because the n of each separate data frame is different (due to missingness). For instance, I'm trying to study marriage in one of my analyses, and later divorce (which is a separate analysis), and so the n's of these two data frames differ, but the definition of treatment group is the same.
Ideally, I'd have two columns, one for the treatment group and one for the control group. There would be two rows, one that has age of first marriage, and the second row which would have length of that first marriage, and then the respective ns of the cells.
library(compareGroups)
d1 <- compareGroups(treat ~ time1mar,
data = nlsy.mar,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d2 <- compareGroups(treat ~ time1div,
data = nlsy.div,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d.tot <- rbind(`First Age at Marriage` = d1, `Length of First Marriage` = d2)
This is the error that I get:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 6626, 5057
Any suggestions?
The problem might be that you're using na.omit which delets the cases/rows with NAs from both of your datasets. Probably a different amount of cases get removed from each data set. But actually different numbers of row should only be a problem with cbind. However you might try to change the na.action option.
I'm just guessing. As said by joshpk without sample data is difficult to reproduce your problem.
I am a beginner in machine learning. My academic project involves detecting human posture from acceleration and gyro data. I am stuck at the beginning itself. My accelerometer data has x,y,z values and gyro also has x,y,z values stored in file acc.csv and gyro.csv. I want to classify the 'standing', 'sitting', 'walking' and 'lying' position. The idea is to train the machine using some ML algorithm (supervised) and then throw a new acc + gyro data set to identify what this new dataset predict (what the subject is doing at present). I am facing the following problems--
Constructing a training dataset -- I think my activities will be dependent variable, and acc & gyro axis readings will be independent. So if I like to combine it in single matrix with each element of the matrix again has it's own set of acc and gyro value [Something like main and sub matrix], how can I do that? or is there any alternative idea to do the same?
How can I take the data of multiple activities with multiple readings in a single training matrix,
I mean 10 walking data each with it's own acc(xyz) and gyro (xyz) + 10 standing data each with it's own acc(xyz) and gyro (xyz) + 10 sitting data each with it's own acc(xyz) and gyro (xyz) and so on.
Each data file has different number of records and time stamp, how to bring them into a common platform.
I know I am asking very basic things but these are the confusion part nobody has clearly explained to me. I am feeling like standing in front of a big closed door, inside very interesting things are happening where I cannot participate at this moment with my limited knowledge. My mathematical background is high school level only. Please help.
I have gone through some projects on activity recognition in Github. But they are way too complicated for a beginner like me.
import pandas as pd
import os
import warnings
from sklearn.utils import shuffle
warnings.filterwarnings('ignore')
os.listdir('../input/testtraindata/')
base_train_dir = '../input/testtraindata/Train_Set/'
#Train Data
train_data = pd.DataFrame(columns = ['activity','ax','ay','az','gx','gy','gz'])
train_folders = os.listdir(base_train_dir)
for tf in train_folders:
files = os.listdir(base_train_dir+tf)
for f in files:
df = pd.read_csv(base_train_dir+tf+'/'+f)
train_data = pd.concat([train_data,df],axis = 0)
train_data = shuffle(train_data)
train_data.reset_index(drop = True,inplace = True)
train_data.head()
The Data Set
Problem in Train_set
Surprisingly if I remove the last 'gz' from
train_data = pd.DataFrame(columns =['activity','ax','ay','az','gx','gy','gz'])
Everything is working fine.
You have the data labeled? --> position of x,y,z... = positure?
I have no clue about the values (as I have not seen the dataset, and have no clue about positions, acc or gyro), but Im guessing you should have a dataset within a matrise with x, y, z as categories and a target category ;"positions".
If you need all 6 (3 from one csv and 3 from the other) to define the positions you can make 6 categories + positions.
Something like : x_1, y_1 z_1 , x_2, y_2, and z_2 + position label ("position" category).
You can also make each position an own category with 0/1 as true/false.
"sitting" , "walking" etc... and have 0 and 1 as the values in the columns.
Is the timestamp of any importance towards the position? If it is not a feature of importance I would just drop it. If it is important in some way, you might want to bin them.
Here is a beginners guide from Medium in which you can see a bit how to preprocess your data. It also shows one hot encoding :)
https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d
Also try googling Preprocessing your data, then you will probably find the right recipe
While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py
By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.
I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.
I've have been building an analysis workflow for my PhD and have been using a triple nested list to represent my data structure because I want it to be able to expand to an arbitrary amount of data in its second and third levels. The first level is the whole dataset, the second level is each subject in the dataset and third level is a row for each measure that each subject.
[dataset]
|
[subject]
|
[measure1, measure2, measure3]
I am trying to map a function to each measure - for instance convert all the points into floats or replace anomalous values with None - and wish to return the whole dataset according to its nesting but my current code:
for subject in dataset:
for measure in subject:
map(float, measure)
...the result is correct and exactly what I want but the problem is that I can't think how to assign the result back to the dataset efficiently or without losing a level of the nest. Ideally, I would like it to change the measure *in place but I can't think how to do it.
Could you suggest an efficient and pythonic way of doing that? Is a triple nested list a silly way to organize my data in the program?
Rather than doing it in place, make a new list
dataset = [[[float(value) for value in measure]
for measure in subject]
for subject in dataset]
return [[map(float, measure) for measure in subject] for subject in dataset]
You can return a list instead of altering it in place -- this is still remarkably efficient and preserves all the information you want. (aside: In fact, it's often faster than assigning to list indexes [citation needed], which is what others have suggested here!)
A straight-forward way to do that in place would be:
for subject in dataset:
for measure in subject:
for i, elem in enumerate(measure):
measure[i] = float(elem)
Alternatively, use the slice operator to upate the list in-place with the results of map
for subject in dataset:
for measure in subject:
measure[:] = map(float, measure)
This should do the job
for subject in dataset:
for measure in subject:
for i, m in enumerate(measure):
measure[i] = float(m)