Fast data processing on large python dataframe - python-2.7

I have a huge data frame that contains 4 columns and 9 millions rows. For example my MainDataframe has :
NY_resitor1 NY_resitor2 SF_type SF_resitor2
45 36 Resis 40
47 36 curr 34
. . . .
49 39 curr 39
45 11 curr 12
12 20 Resis 45
I would like to have two dataframes and save them as csv file based on the SF_type namely Resis and curr.
This is what i wrote
FullDataframe=pd.read_csv("hdhhdhd.csv")
resis=pd.DataFrame()
curr=pd.DataFrame()
for i in range(len(FullDataframe["SF_type"].values)):
if Resis in FullDataframe["SF_type"].values[i]:
resis.loc[i]=FullDataframe[["NY_resitor1", "NY_resitor2", "SF_type","SF_resitor2"]].values[i]
elif curr in in FullDataframe["SF_type"].values[i]:
curr.loc[i]=FullDataframe[["NY_resitor1", "NY_resitor2", "SF_type","SF_resitor2"]].values[i]
resis.to_csv("jjsjjjsjs.csv")
curr.to_csv("jjsj554js.csv")
This is what i wrote and i have been running it for the past week but it is still not yet complete. Is there a better and faster way to do this?

You will have better luck with a pandas filter rather than a for loop. Just to stick with convention I'm calling your FullDataFrame df instead:
resis = df[df.SF_type == 'Resis']
curr = df[df.SF_type == 'curr']
Then run your:
resis.to_csv("jjsjjjsjs.csv")
curr.to_csv("jjsj554js.csv")
I'm not sure what your index is, but if you are not using just the default pandas index (i.e. 0, 1, 2, 3 etc.), then you will see a performance boost by sorting your index (.sort_index() method).

Related

NetLogo is reading empty cells from CSV file

I have 3 csv files, and each of a file contains numbers in 4 rows. I created a list of list from those files (code below). The problem is that NetLogo reads empty cells from csv files and put them in the list (picture below). And I don't know why, I used this code and method for a million times, and this never happened before, there weren't any empty spaces. Can someone help me? Thanks in advance!
foreach [ 1 2 3 ]
[ i ->
set filename (word "../data/dataset_" i ".csv")
set dataset-list lput (csv:from-file filename) dataset-list
show word "dataset-list " dataset-list
]
EDIT: I realized this happen because lines in csv file (excel) are not the same length. E.g. I have 5 lines with numbers:
1 2 3 0
18 45 56 0 89 34 45 56
5 10 56 0 89 34 45 56 56 0 89 34 45 56 56 0 89 34 45
0
However, I had this situation before, and there weren't any problems.
If I understand your problem correctly, it might depend on the csv file. If there was something before, even if you deleted it, it might be seen as an empty space, instead of just nothing. This happened to me with cvs:from-file after I used an Excel macro to transpose rows vs columns in a csv file.
Try this: open the csv file in Excel, delete the columns where the empty spaces are, and then (if you need the empty columns) insert new columns there. At this point it should not read them as empties, but it should skip them... (If I understand cvs:from-file correctly).

Data format and pandas

I am using Pandas to format things nicely in a tabular format
data = []
for i in range (start, end_value):
data([i, value])
# modify value in some way
print pd.DataFrame(data)
gives me
0 1
0 38 2.500000e+05
1 39 2.700000e+05
2 40 2.916000e+05
3 41 3.149280e+05
How can I modify this to remove scientific notation and for extra points add thousands separator?
data['column_name'] = data['column_name'].apply('{0:,.2f}'.format)
thanks to John Galt's previous SO answer

Eliminate duplicate pairs of numbers and total their quantities in openoffice Calc

I have a Calc sheet listing a cut-list for plywood in two columns with a quantity in a third column. I would like to remove duplicate matching pairs of dimensions and total the quantity. Starting with:
A B C
25 35 2
25 40 1
25 45 3
25 45 2
35 45 1
35 50 3
40 25 1
40 25 1
Ending with:
A B C
25 35 2
25 40 1
25 45 5
35 45 1
35 50 3
40 25 2
I'm trying to automate this. Currently I have multiple lists which occupy the same page which need to be totaled independently of each other.
Put a unique different ListId, ListCode or ListNumber for each of the lists. Let all rows falling into the same list, have the same value for this field.
Concatenate A & B and form a new column, say, PairAB.
If the list is small and handlable, filter for PairAB and collect totals.
Otherwise, use Grouping and subtotals to get totals for each list and each pair, grouping on ListId and PairAB.
If the list is very large, you are better off taking it to CSV, and onward to a database, such things are simple child's play in SQL.

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Python Pandas read_csv issue

I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))