I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."
I have following code that generates a histogram. How can I save the histogram automatically using the code? I tried what we do for other plot types but that did not work for histogram.a is a 'numpy.ndarray'.
a = [-0.86906864 -0.72122614 -0.18074998 -0.57190212 -0.25689268 -1.
0.68713553 0.29597819 0.45022949 0.37550592 0.86906864 0.17437203
0.48704826 0.2235648 0.72122614 0.14387731 0.94194514 ]
fig = pl.hist(a,normed=0)
pl.title('Mean')
pl.xlabel("value")
pl.ylabel("Frequency")
pl.savefig("abc.png")
This works for me:
import matplotlib.pyplot as pl
import numpy as np
a = np.array([-0.86906864, -0.72122614, -0.18074998, -0.57190212, -0.25689268 ,-1. ,0.68713553 ,0.29597819, 0.45022949, 0.37550592, 0.86906864, 0.17437203, 0.48704826, 0.2235648, 0.72122614, 0.14387731, 0.94194514])
fig = pl.hist(a,normed=0)
pl.title('Mean')
pl.xlabel("value")
pl.ylabel("Frequency")
pl.savefig("abc.png")
a in the OP is not a numpy array and its format also needs to be modified (it needs commas, not spaces as delimiters). This program successfully saves the histogram in the working directory. If it still does not work, supply it with a full path to the location where you want to save it like this
pl.savefig("/Users/atru/abc.png")
The pl.show() statement should not be placed before savefig() as it creates a new figure which makes savefig() save a blank figure instead of the desired one as explained in this post.
My code below is supposed to print a graph/network using Networkx, Pandas and data from a CSV file. The code is (networkx3.py) -
import csv
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx
g = nx.Graph()
csv_dict = pd.read_csv('Book1.csv', index_col=[0])
csv_1 = csv_dict.values.tolist()
ini = 0
for row in csv_1:
for i in row:
if type(row[i]) is str:
g.add_edge(ini, int(i), conn_prob=(float(row[i])))
max_wg_ngs = sorted(g[ini].items(), key=lambda e: e[1]["conn_prob"], reverse=True)[:2]
sarr = [str(a) for a in max_wg_ngs]
print "Neighbours of Node %d are:" % ini
#print(max_wg_ngs)
for item in sarr:
print ''.join(str(item))[1:-1]
ini += 1
pos = nx.spring_layout(g, scale=100.)
nx.draw_networkx_nodes(g, pos)
nx.draw_networkx_edges(g, pos)
nx.draw_networkx_labels(g, pos)
#plt.axis('off')
plt.show()
The data in the CSV file is (Book1.csv) -
,1,2,3,4,5,6,7,8,9,10
1,0,0.257905291,0.775104118,0.239086843,0.002313744,0.416936603,0.194817214,0.163350301,0.252043807,0.251272559
2,0.346100279,0,0.438892758,0.598885794,0.002263231,0.406685237,0.523850975,0.257660167,0.206302228,0.161385794
3,0.753358102,0.222349243,0,0.407830809,0.001714776,0.507573592,0.169905687,0.139611318,0.187910832,0.326950557
4,0.185342928,0.571302688,0.51784403,0,0.003231018,0.295197533,0.216184462,0.153032751,0.216331326,0.317961522
5,0,0,0,0,0,0,0,0,0,0
6,0.478164621,0.418192795,0.646810223,0.410746629,0.002414973,0,0.609176897,0.203461461,0.157576977,0.636747837
7,0.24894327,0.522914349,0.33948832,0.316240267,0.002335929,0.639377086,0,0.410011123,0.540266963,0.587764182
8,0.234017887,0.320967208,0.285193773,0.258198079,0.003146737,0.224412057,0.411725737,0,0.487081815,0.469526333
9,0.302955306,0.080506624,0.261610132,0.22856311,0.001746979,0.014994905,0.63386228,0.486096957,0,0.664434415
10,0.232675407,0.121596312,0.457715027,0.310618067,0.001872929,0.57556548,0.473562887,0.32185564,0.482351246,0
The code however doesn't work. I don't understand where I'm going wrong. The error is -
Traceback (most recent call last):
File "networkx3.py", line 13, in <module>
if type(row[i]) is str:
TypeError: list indices must be integers, not float
I do not want to modify the CSV file or its data. The index column and header are supposed to be ignored.
I have previously asked this question but I did not get satisfactory answers. Can anybody help?
Thanks a lot in advance :) (Using Ubuntu 14.04 32-bit VM. Credits to #Adonis for helping in creating the original code)
A little late in answering my own question, but with some valuable help from #Joel and #Adonis, I finally figured out where I was going wrong.
The problem was in the 2nd for loop where I tried to pass a float value as a string into the Graph which gave me an error. Other minor changes would result in an output but without any edges, just nodes.
Finally, after using an enumerate function to define a connecting node (using its index giving power), I got the required output. The only changes to be made are in the 2nd for loop and the if condition:
for row in csv_1:
for idx, i in enumerate(row):
if type(row[idx]) is float:
g.add_edge(ini, idx, conn_prob=(float(row[idx])))
Thanks to all the selfless guys at SOF for the help, couldn't have done it without you :)
I have a file like so that I am reading from excel:
Year Month Day
1 2 1
2 1 2
I want to specify the column width that excel recognizes. I would like to do it in pandas but I don't see a option. I have tried to do it with the module StyleFrame.
This is my code:
from StyleFrame import StyleFrame
import pandas as pd
df=pd.read_excel(r'P:\File.xlsx')
excel_writer = StyleFrame.ExcelWriter(r'P:\File.xlsx')
sf=StyleFrame(df)
sf=sf.set_column_width(columns=['Year', 'Month'], width=4.0)
sf=sf.set_column_width(columns=['Day'], width=6.00)
sf=sf.to_excel(excel_writer=excel_writer)
excel_writer.save()
but the formatting isn't saved when I open the new file.
Is there a way to do it in pandas? I would even take a pure python solution to this, pretty much anything that works.
As for your question on how to remove the headers, you can simply pass header=False to to_excel:
sf.to_excel(excel_writer=excel_writer, header=False).
Note that this will still result with the first line of the table being bold.
If you don't want that behavior you can update to 0.1.6 that I just released.
I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])