Read multiple *.txt files into Pandas Dataframe with filename as column header - python-2.7

I am trying to import a set of *.txt files. I need to import the files into successive columns of a Pandas DataFrame in Python.
Requirements and Background information:
Each file has one column of numbers
No headers are present in the files
Positive and negative integers are possible
The size of all the *.txt files is the same
The columns of the DataFrame must have the name of file (without extension) as the header
The number of files is not known ahead of time
Here is one sample *.txt file. All the others have the same format.
16
54
-314
1
15
4
153
86
4
64
373
3
434
31
93
53
873
43
11
533
46
Here is my attempt:
import pandas as pd
import os
import glob
# Step 1: get a list of all csv files in target directory
my_dir = "C:\\Python27\Files\\"
filelist = []
filesList = []
os.chdir( my_dir )
# Step 2: Build up list of files:
for files in glob.glob("*.txt"):
fileName, fileExtension = os.path.splitext(files)
filelist.append(fileName) #filename without extension
filesList.append(files) #filename with extension
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(filesList[ijk])
df = df.append(frame)
print df
Steps 1 and 2 work. I am having problems with step 3. I get the following error message:
Traceback (most recent call last):
File "C:\Python27\TextFile.py", line 26, in <module>
frame = pd.read_csv(filesList[ijk])
TypeError: list indices must be integers, not str
Question:
Is there a better way to load these *.txt files into a Pandas dataframe? Why does read_csv not accept strings for file names?

You can read them into multiple dataframes and concat them together afterwards. Suppose you have two of those files, containing the data shown.
In [6]:
filelist = ['val1.txt', 'val2.txt']
print pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist], axis=1)
val1 val2
0 16 16
1 54 54
2 -314 -314
3 1 1
4 15 15
5 4 4
6 153 153
7 86 86
8 4 4
9 64 64
10 373 373
11 3 3
12 434 434
13 31 31
14 93 93
15 53 53
16 873 873
17 43 43
18 11 11
19 533 533
20 46 46

You're very close. ijk is the filename already, you don't need to access the list:
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(ijk)
df = df.append(frame)
print df
In the future, please provide working code exactly as is. You import from pandas import * yet then refer to pandas as pd, implying the import import pandas as pd.
You also want to be careful with variable names. files is actually a single file path, and filelist and filesList have no discernible difference from the variable name. It also seems like a bad idea to keep personal documents in your python directory.

Related

elasticsearch bulk insert from s3 using lambda and python

I have a task to delete all docs from an ElasticSearch index, and repopulate it from all the files in a S3 bucket. I can do this singly but it is extremely slow, but doing it with bulk insert is eluding me and is just as slow. What am I missing? I feel that I am making the benefit of paginator's speed null and void, plus I am getting a MemoryError at the end when I am trying to gather all 1000 at at time and bulk insert it into Elasticsearch. If Bulk insert is possible, how do I do it when creating the file to insert into ES from scratch and storing it into memory. I want it to be as fast as paginator, but I am failing miserably here. What am I missing please?
I have put the data examples in the code as comments (lines 15, 38, 47-48)
1 s3_bucket='s3_bucket_name'
2 s3 = boto3.resource('s3')
3 paginator = s3_client.get_paginator("list_objects_v2")
4 es_url='https://my_es_url'
5 aws_auth=my_auth_info
6 es_client=Elasticsearch([es_url], .....)
7
8 # get keys from s3 bucket
9 for page in paginator.paginate(Bucket=s3_bucket, Prefix=''):
10 if page['KeyCount']<=0 or "Contents not in page:
11 continue
12 keys= [object["Key"] for object in page["contents"]]
13 if not keys:
14 continue # nothing to see here
15 objects=[{"Key":key"} for key in keys] # [{'key': 'path/filename1_02032022T000007Z.json'}, etc...]
16 # paginator maxes out at 1000, so the number of keys in objects are 1000 at a time (total count > 100,000)
17 count = len(objects)
18 total +=count
19 if dryrun:
20 print("dryrun: {0} objects".format(count)) # 1000
21 print("{0} total objects".format(total)) # 102701 total objects
22 else
23 es_index='es_index_name'
24 # delete docs from ES index
25 es_client.delete_by_query(index=es_index, body={"query":{"match_all":{}}})
26
27 # copy form s3 to ES ##really slow after this because getting data one at a time
28 actions=[] #list
29 index={} #dict
30 index[_index']=es_index
31 doc={}
32 for obj in objects:
33 key_id=obj['Key']
34 id.partition('.json')[0]
35 index['_id']=id # {'_index': 'es_index_name', '_id': 'filename'}
36 # get object body
37 data_obj = s3.Object(s3_bucket,key_id)
38 data_json=data_obj.get() # {'data1':'value1','data2':'value2','data3':'value3'}
39 data_json=data_json.get('Body').read().decode('utf-8')
40 data_json=data_json.replace("'",'"')
41 doc=data_string
42 doc['id']=id
43 actions.append(index)
44 actions+=str(actions)+'\n'
45 actions+=str(doc)+'\n' # GETTING MEMORYERROR HERE
46 # actions:
47 # [{'_index':'es_index_name','_id':'filename1_02032022T000007Z'},
48 # {'data1':'value1','data2':'value2','data3':'value3'}]
49 print('--------------------------------')
50 response=helpers.bulk(es_client,actions)
51 print('Data successfully inserted into ES')
52 print("PROCESS_ALL: {0} objects".format(count)) # 1000
53 print("{0} total objects".format(total)) # 102701 total objects

Fast data processing on large python dataframe

I have a huge data frame that contains 4 columns and 9 millions rows. For example my MainDataframe has :
NY_resitor1 NY_resitor2 SF_type SF_resitor2
45 36 Resis 40
47 36 curr 34
. . . .
49 39 curr 39
45 11 curr 12
12 20 Resis 45
I would like to have two dataframes and save them as csv file based on the SF_type namely Resis and curr.
This is what i wrote
FullDataframe=pd.read_csv("hdhhdhd.csv")
resis=pd.DataFrame()
curr=pd.DataFrame()
for i in range(len(FullDataframe["SF_type"].values)):
if Resis in FullDataframe["SF_type"].values[i]:
resis.loc[i]=FullDataframe[["NY_resitor1", "NY_resitor2", "SF_type","SF_resitor2"]].values[i]
elif curr in in FullDataframe["SF_type"].values[i]:
curr.loc[i]=FullDataframe[["NY_resitor1", "NY_resitor2", "SF_type","SF_resitor2"]].values[i]
resis.to_csv("jjsjjjsjs.csv")
curr.to_csv("jjsj554js.csv")
This is what i wrote and i have been running it for the past week but it is still not yet complete. Is there a better and faster way to do this?
You will have better luck with a pandas filter rather than a for loop. Just to stick with convention I'm calling your FullDataFrame df instead:
resis = df[df.SF_type == 'Resis']
curr = df[df.SF_type == 'curr']
Then run your:
resis.to_csv("jjsjjjsjs.csv")
curr.to_csv("jjsj554js.csv")
I'm not sure what your index is, but if you are not using just the default pandas index (i.e. 0, 1, 2, 3 etc.), then you will see a performance boost by sorting your index (.sort_index() method).

csv parsing and manipulation using python

I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.
I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1
Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])
I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.

Python Pandas read_csv issue

I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))

Pandas dataframe applying NA to part of the data

Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object
Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan
The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.