How do I import a CSV file? - python-2.7

I am new to using Python and Pandas and I am trying to import a CSV or text file to an array with quotes in between the issues like
sp500 = ['appl', 'ibm', 'csco']
df = pd.read_csv('C:\\data\\stock.txt', index_col=[0])
df
which gets me:
Out[20]:
Empty DataFrame
Columns: []
Index: [AAPL, IBM, CSCO]
Any help would be great.

Related

Adding constant values at the begining of a dataframe in pyspark

I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)
Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])

create dataframe by randomly sampling from multiple files

I have a folder with several 20 million record tab delimited files in it. I would like to create a pandas dataframe where I randomly sample say 20 thousand records from each file, and then append them together in the dataframe. Does anyone know how to do that?
You could read in all the text files in a particular folder. Then you could make use of pandas Dataframe.sample (link to docs).
I've provided a fully reproducible example with two example .txt file created with 200 rows. I then take a random sample of ten rows and append the sample to a final datframe.
import pandas as pd
import numpy as np
import glob
# Change the path for the directory
directory = r'C:\some\example\folder'
# I create two test .txt files for demonstration purposes with 200 rows each
df_test = pd.DataFrame(np.random.randn(200, 2), columns=list('AB'))
df_test.to_csv(directory + r'\test_1.txt', sep='\t', index=False)
df_test.to_csv(directory + r'\test_2.txt', sep='\t', index=False)
df = pd.DataFrame()
for filename in glob.glob(directory + r'\*.txt'):
df_full = pd.read_csv(filename, sep='\t')
df_sample = df_full.sample(n=10)
df = df.append(df_sample)

How to create a bag of words from csv file in python?

I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."

NLTK applied to dataframes , how to iterate through list

Apologies in advance as this is my first question. I am using nltk to tokenize a series of tweets from a csv that I have loaded into a df. The tokenization works fine and outputs something like this [[My, uncle, ...]] into a cell in a df. I want to then apply a POS tagger to the tokenized text for the whole column of the df. I use the code below to do it. The line I am having difficulty with is df['tagged'] = df['tokenized'].apply(lambda row: [nltk.pos_tag(row) for item in row]). I know that I am iterating on the wrong element (row versus item) but can't figure out the correct way to do it. The code is below:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize,wordpunct_tokenize
from nltk.tag import pos_tag
read_test = pd.read_csv("simontwittertest.csv")
df = read_test
df['tokenized'] = df['content'].apply(lambda row: [nltk.wordpunct_tokenize(row) for item in row])
df['tagged'] = df['tokenized'].apply(lambda row: [nltk.pos_tag(row) for item in row])
print(df['tagged'])`
Out of interest I found a small bug with pos_tag which only works with NLTK 3.1 not NLTK 3.2 (at least with Python 2.7)
Many Thanks`
If you are applying a lambda function to a row, you need to specify axis=1:
df['tokenized'] = df['content'].apply(
lambda row: [nltk.wordpunct_tokenize(row) for item in row], axis=1)
df['tagged'] = df['tokenized'].apply(
lambda row: [nltk.pos_tag(row) for item in row], axis=1)

Python - Create An Empty Pandas DataFrame and Populate From Another DataFrame Using a For Loop

Using: Python 2.7 and Pandas 0.11.0 on Mac OSX Lion
I'm trying to create an empty DataFrame and then populate it from another dataframe, based on a for loop.
I have found that when I construct the DataFrame and then use the for loop as follows:
data = pd.DataFrame()
for item in cols_to_keep:
if item not in dummies:
data = data.join(df[item])
Results in an empty DataFrame, but with the headers of the appropriate columns to be added from the other DataFrame.
That's because you are using join incorrectly.
You can use a list comprehension to restrict the DataFrame to the columns you want:
df[[col for col in cols_to_keep if col not in dummies]]
What about just creating a new frame based off of the columns you know you want to keep, instead of creating an empty one first?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5),
'b':np.random.randn(5),
'c':np.random.randn(5),
'd':np.random.randn(5)})
cols_to_keep = ['a', 'c', 'd']
dummies = ['d']
not_dummies = [x for x in cols_to_keep if x not in dummies]
data = df[not_dummies]
data
a c
0 2.288460 0.698057
1 0.097110 -0.110896
2 1.075598 -0.632659
3 -0.120013 -2.185709
4 -0.099343 1.627839