create dataframe by randomly sampling from multiple files - python-2.7

I have a folder with several 20 million record tab delimited files in it. I would like to create a pandas dataframe where I randomly sample say 20 thousand records from each file, and then append them together in the dataframe. Does anyone know how to do that?

You could read in all the text files in a particular folder. Then you could make use of pandas Dataframe.sample (link to docs).
I've provided a fully reproducible example with two example .txt file created with 200 rows. I then take a random sample of ten rows and append the sample to a final datframe.
import pandas as pd
import numpy as np
import glob
# Change the path for the directory
directory = r'C:\some\example\folder'
# I create two test .txt files for demonstration purposes with 200 rows each
df_test = pd.DataFrame(np.random.randn(200, 2), columns=list('AB'))
df_test.to_csv(directory + r'\test_1.txt', sep='\t', index=False)
df_test.to_csv(directory + r'\test_2.txt', sep='\t', index=False)
df = pd.DataFrame()
for filename in glob.glob(directory + r'\*.txt'):
df_full = pd.read_csv(filename, sep='\t')
df_sample = df_full.sample(n=10)
df = df.append(df_sample)

Related

merge 1000s of csv with same name in different subdirectories

I have 1000 of subdirectories (error1 - error1000) with three different csv files (rand.csv, run_error.csv, swe_error.csv). Each vsc has index row. I need to merge the csv files that have the same filename, so I end up with e.g. rand_merge.csv with index row and 1000 rows of data.
I followed Merge multiple csv files with same name in 10 different subdirectory, which gets me
KeyError: 'filename'
I can't figure out how to fix it, so any help is appreciated.
Thx
Update: Here's the exact code, which came from linked post above:
import pandas as pd
import glob
CONCAT_DIR = "./error/files_concat/"
# Use glob module to return all csv files under root directory. Create DF from this.
files = pd.DataFrame([file for file in glob.glob("error/*/*")], columns=["fullpath"])
# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("\\", 1, expand=True).rename(columns={0: 'path', 1:'filename'})
# Join these into one DataFrame
files = files.join(files_split)
# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
dfs = [pd.read_csv(path, header=None) for path in paths] # Get list of dataframes from CSV file paths
concat_df = pd.concat(dfs) # Concat dataframes into one
concat_df.to_csv(CONCAT_DIR + f) # Save dataframe
I found my mistake. I needed a "/" after rsplit, not "\"
files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1:'filename'})

Python - Copy specific columns from other excel files to a new one based on file name

I have a script which generates CSV files and names them as per time stamp
-rw-rw-r-- 1 9949 Oct 13 11:57 2018-10-13-11:57:10.796516.csv
-rw-rw-r-- 1 9649 Oct 13 12:58 2018-10-13-12:58:12.907835.csv
-rw-rw-r-- 1 9649 Oct 13 13:58 2018-10-13-13:58:10.502635.csv
I need to pick column C from these sheets and write to a new CSV file. However order of columns in new sheet should be as per the name of existing sheets.
Example, Column C from the file generated at 11:57 should be in column A , from 12:58 in Column B and 13:38 in Column C of new sheet.
EDIT -- Code tried based on Bilal Input. It does move the C column from all existing sheets to a new sheet, however not in proper order. It just picks them randomly and keeps adding in columns on new file.
import os
import re
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
# get a list of your csv files and put them files
files = [f for f in os.listdir('.') if os.path.isfile(f)]
results = []
for f in files:
if re.search('.csv', f):
results += [f]
for file in results:
df = pd.read_csv(file,usecols=[2])
newCSV = pd.concat((newCSV, df), axis=1)
newCSV.to_csv("new.csv")
EDIT -- Final Code that worked, Thanks Bilal
import os
import re
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
files = [f for f in os.listdir('.') if os.path.isfile(f)]
# get a list of your csv files and put them files
results = []
for f in files:
if re.search('.csv', f):
results += [f]
result1=sorted(results)
for file in result1:
df = pd.read_csv(file,usecols=[2])
newCSV = pd.concat((newCSV, df), axis=1)
newCSV.to_csv("new.csv")
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
# get a list of your csv files and put them files
for f in files:
df = pd.read_csv(f)
newCSV = pd.concat((newCSV, df.colum_name), axis=1)
newCSV.to_csv("new.csv")
See if that works for you.
If you don't know how to find all the files with a specific extension, look here.

How to create a bag of words from csv file in python?

I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."

Faster approach to transform a bunch of .csv file into HDF dataframe

1. Background
HDF is a superb file format for data storage and management.
I have a source data (365 .csv files) which containing the air quality data (time resolution 1h) for all monitoring sites (more than 1500) of China. Each file is consisted of many feature (particulate matter, SO2, etc) and its corresponding time.
I have uploaded some template files here for someone interested.
My goal ==> Merge all the files into one dataframe for efficient management
2. My code
# -*- coding: utf-8 -*-
#coding=utf-8
import pandas as pd
from pandas import HDFStore, DataFrame
from pandas import read_hdf
import os,sys,string
import numpy as np
### CREAT A EMPTY HDF5 FILE
hdf = HDFStore("site_2016_whole_year.h5")
### READ THE CSV FILES AND SAVE IT INTO HDF5 FORMAT
os.chdir("./site_2016/")
files = os.listdir("./")
files.sort()
### Read an template file to get the name of columns
test_file= "china_sites_20160101.csv"
test_f = pd.read_csv(test_file,encoding='utf_8')
site_columns = list(test_f.columns[3:])
print site_columns[1]
feature = ['pm25','pm10','O3','O3_8h','CO',"NO2",'SO2',"aqi"]
fe_dict = {"pm25":1,"aqi":0, 'pm10':3, 'SO2':5,'NO2':7, 'O3':9,"O3_8h":11, "CO": 13}
for k in range(0,len(feature),1):
data_2016 = {"date":[],'hour':[],}
for i in range(0,len(site_columns),1):
data_2016[site_columns[i]] = []
for file in files[0:]:
filename,extname = os.path.splitext(file)
if (extname == ".csv"):
datafile =file
f_day = pd.read_csv(datafile,encoding='utf_8')
site_columns = list(f_day.columns[3:])
for i in range(0,len(f_day),15):
datetime = str(f_day["date"].iloc[i])
hour = "%02d" % ((f_day["hour"].iloc[i]))
data_2016["date"].append(datetime)
data_2016["hour"].append(hour)
for t in range(0,len(site_columns),1):
data_2016[site_columns[t]].\
append(f_day[site_columns[t]].iloc[i+fe_dict[feature[k]]])]
data_2016 = pd.DataFrame(data_2016)
hdf.put(feature[k], data_2016, format='table', encoding="utf-8")
3. My problem
Using my code above, the hdf5 file can be created but with slow speed.
My lab has a Linux cluster with 32 core CPU. Is there any method to transform my program into multi-processing ones?
maybe I do not understand your problem properly, but I would use something like this:
import os
import pandas as pd
indir = <'folder with downloaded 12 csv files'>
indata = []
for i in os.listdir(indir):
indata.append(pd.read_csv(indir + i))
out_table = pd.concat(indata)
hdf = pd.HDFStore("site_2016_whole_year.h5", complevel=9, complib='blosc')
hdf.put('table1',out_table)
hdf.close()
for 12 input files it takes 2.5 sec on my laptop, so even for 365 files it should be done in a minute or so. I do not see need for parallelisation in this case.

How to remove columns from a file using pandas DataFrame.drop with a list of column headers read in from a second file

Python newbie and stackoverflow posting newbie.
My goal is to create a python script which will take two files entered at the command line and drop columns from the first file if column headers are present within the second file, and write the output to a new file.
I've tried several approaches to this, and currently I am attempting to use Pandas DataFrame.drop
On a very small test set, I can achieve the removal of columns by manually specifying headers in a string (thanks to Delete column from pandas DataFrame), but can't figure out how to import a list of column headers from a file and format them correctly for DataFrame.drop.
I have two files
One large: quarter of a million rows and up to 21,000 columns
The columns relate to samples, and the rows relate to genetic markers.
I also have a smaller file containing up to a 1000 sample IDs, which correspond to the column headers in the large file. These relate to columns that I wish to drop from the large file.
I have attempted many things (creating lists, creating labels), one example below, but failed.
I would be grateful if anyone could point me in the right direction.
large file
Name Chr Position 8077686010_R04C02.GType 8077686010_R04C02.X 8077686010_R04C02.Y 8131566005_R01C02.GType 8131566005_R01C02.X 8131566005_R01C02.Y
exm-rs1000026 21 38934599 NC 0.0144234 1.112413 NC 0.01250324 1.084685
exm-rs1000053 2 12790328 NC 0.04906762 1.495594 NC 0.07344548 1.552252
exm-rs1000110 9 117908721 NC 0.02433169 1.314785 NC 0.05954991 1.356415
exm-rs1000113 5 150240076 NC 0.015468 0.793373 NC 0.02498361 0.8621324
exm-rs1000158 20 36599904 NC 0.01016421 0.7593179 NC 0.4537758 0.5095596
exm-rs1000192 16 6747139 NC 0.01774782 0.8661015 NC 0.01103768 0.9004255
exm-rs1000203 14 40896108 NC 0.7707067 0.006222768 NC 0.7400684 0.003768863
smallerfile
8077686010_R04C02.GType
8077686010_R04C02.X
8077686010_R04C02.Y
outfile
Name Chr Position 8131566005_R01C02.GType 8131566005_R01C02.X 8131566005_R01C02.Y
exm-rs1000026 21 38934599 NC 0.01250324 1.084685
exm-rs1000053 2 12790328 NC 0.07344548 1.552252
exm-rs1000110 9 117908721 NC 0.05954991 1.356415
exm-rs1000113 5 150240076 NC 0.02498361 0.8621324
exm-rs1000158 20 36599904 NC 0.4537758 0.5095596
exm-rs1000192 16 6747139 NC 0.01103768 0.9004255
exm-rs1000203 14 40896108 NC 0.7400684 0.003768863
Working code
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
new_data = largefile.drop(['8077686010_R04C02.GType','8077686010_R04C02.X','8077686010_R04C02.Y',], axis=1)
new_data.to_csv(outfile,sep="\t")
Failing code - one of many
import pandas as pd
import numpy as np
outfile = open("myout.txt", "w")
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0)
largefile = largefile.astype(object)
dropcols = open("smallerfile",'r').read().split('\n')
new_data = largefile.drop(dropcols, axis=1)
new_data.to_csv(outfile,sep="\t")
List generated
['8131566005_R01C02.GType', '8131566005_R01C02.X', '8131566005_R01C02.Y', '8131566013_R02C01.GType', '8131566013_R02C01.X', '8131566013_R02C01.Y', '']
Output
Traceback (most recent call last):
File "my.py", line 59, in <module>
new_data = largefile.drop(dropcolslst, axis=1)
File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 174, in drop
new_axis = axis.drop(labels)
File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 881, in drop
raise ValueError('labels %s not contained in axis' % labels[mask])
ValueError: labels ["] not contained in axis
To get your code to work all you need to do is drop the empty string from your dropcols list. Something like this:
dropcols = [x for x in dropcols if x != '']
Of if you want to handle the case where your dropcols list works even if you specify a column not in the larger dataframe, you could do something like this - taking the intersection of your dropcols and the columns in the dataframe.
dropcols = set(dropcols) & set(largefile.columns)
A more memory efficient way to do this. The key is to apply usecols in pd.read_csv.
import pandas as pd
import numpy as np
dropcols = open("smallerfile",'r').read().split('\n')
cols = open("large",'r').read().rstrip().split('\t')
usecols = [ i for i in range(cols) if cols[i] not in dropcols]
Tell pd.read_csv to only load usecols and specify date type as object.
Next save the loaded file.
largefile = pd.read_csv('large',sep='\t',header=0,index_col=0, usecols=usecols, dtype='object')
with open("myout.txt", "w") as outfile:
largefile.to_csv(outfile,sep="\t")