I am trying to join (vertically) some of the tuples, better to say I am inserting these tuples in the dataframe. But unable to do so till now. Problem arises since I am trying to add them horizentally and not vertically.
data_frame = pandas.DataFrame(columns=("A","B","C","D"))
str1 = "Doodles are the logo-incorporating works of art that Google regularly features on its homepage. They began in 1998 with a stick figure by Google co-founders Larry Page and Sergey Brin -- to indicate they were attending the Burning Man festival. Since then the doodles have become works of art -- some of them high-tech and complex -- created by a team of doodlers. Stay tuned here for more of this year's doodles"
aa = str1.split()
bb = zip(aa[0:4])
data_frame.append(bb,ignore_index=True,verify_integrity=False)
Is it possible or do I have to iterate over each word in tuple to use insert
You could do this
In [8]: index=list('ABCD')
In [9]: df = pd.DataFrame(columns=index)
In [11]: df.append(Series(aa[0:4],index=index),ignore_index=True)
Out[11]:
A B C D
0 Doodles are the logo-incorporating
alternatively if you have many of these rows that you are going to append, just
create a list, then DataFame(list_of_series) at the end
In [13]: DataFrame([ aa[0:4], aa[5:8] ],columns=list('ABCD'))
Out[13]:
A B C D
0 Doodles are the logo-incorporating
1 of art that None
Related
I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object
I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution
There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().
Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()
I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."
For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.
I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).
Taking another crack at an older question of mine since I still do not understand how to properly do what I want.
I have data stored in a dataframe and need to extract averaged chunks of it to use later. My index is datetime values, but this is not terribly important. Unfortunately, I cannot do a simple df.resample() operation, since the data I need to extract is not regularly spaced. Example:
import pandas as pd
from numpy import *
# Build example dataframe
df = pd.DataFrame(data=random.rand(10,3),index=None,columns=list('ABC'))
# Build dummy dataframe to store averaged data from "df"
dummy = pd.DataFrame(columns=df.columns)
# Perform averaging of "df"
for r in xrange(1,10,2):
ave = df.ix[r-1:r+1].mean()
# Store averaged data in dummy dataframe
# Here is where I hit my problem, since ave is a Series
dummy = dummy.append(ave)
I cannot append a series to dataframe.
I can work around by converting ave to a dictionary, then appending to dummy:
for r in xrange(1,10,2):
ave = df.ix[r-1:r+1].mean().to_dict()
ave = pd.DataFrame(ave,index=[r])
dummy = dummy.append(ave)
First: does my overall goal make sense?
Second: Is there a better way to achieve this? Converting to dictionary, then dataframe, then appending seems kludgey, but it is the best I have.
Begin Edit
unutbu raised a good point. As written, rolling_mean() will work. But, I am interested only in very few rows of data, everything else is considered garbage.
# Now creating larger dataframe for illustration
df = pd.DataFrame(data=random.rand(10000,3),index=None,columns=list('ABC'))
# Now, most of the data are not averaged
for r in xrange(1,10000,50):
ave = df.ix[r-1:r+1].mean().to_dict()
ave = pd.DataFrame(ave,index=[r])
The main problem I have with my examples is showing the irregularity with which the averaging is done. The averaging is event driven (i.e. if something happened at 2013-01-01 14:23 then average the data about 2013-01-01 14:23 +/- 2.5min.
Unfortunately, the data timestamps are also highly irregular, which makes rolling_mean() ineffective in this case. So I have irregular events determining when I should average my irregularly recorded data, making a nice problem.
I can achieve what I want, but only by converting ave from series to dictionary, then to dataframe. Perhaps in this case "good enough" should be let alone.
End Edit
dummy = dummy.append(ave)
It sounds like what you are looking for is pd.rolling_mean:
import pandas as pd
import numpy as np
np.random.seed(1)
# Build example dataframe
df = pd.DataFrame(data=np.random.rand(10,3), index=None, columns=list('ABC'))
print(df)
# A B C
# 0 0.417022 0.720324 0.000114
# 1 0.302333 0.146756 0.092339
# 2 0.186260 0.345561 0.396767
# 3 0.538817 0.419195 0.685220
# 4 0.204452 0.878117 0.027388
# 5 0.670468 0.417305 0.558690
# 6 0.140387 0.198101 0.800745
# 7 0.968262 0.313424 0.692323
# 8 0.876389 0.894607 0.085044
# 9 0.039055 0.169830 0.878143
dummy = pd.rolling_mean(df, window=3).dropna()
print(dummy)
yields
A B C
2 0.301872 0.404214 0.163073
3 0.342470 0.303837 0.391442
4 0.309843 0.547624 0.369792
5 0.471245 0.571539 0.423766
6 0.338436 0.497841 0.462274
7 0.593039 0.309610 0.683919
8 0.661679 0.468711 0.526037
9 0.627902 0.459287 0.551836
Here's another way with a datelike index.
In [67]: df = pd.DataFrame(data=np.random.rand(10,3), index=None, columns=list('ABC'))
In [68]: df
Out[68]:
A B C
0 0.417022 0.720324 0.000114
1 0.302333 0.146756 0.092339
2 0.186260 0.345561 0.396767
3 0.538817 0.419195 0.685220
4 0.204452 0.878117 0.027388
5 0.670468 0.417305 0.558690
6 0.140387 0.198101 0.800745
7 0.968262 0.313424 0.692323
8 0.876389 0.894607 0.085044
9 0.039055 0.169830 0.878143
This is a regular index, but irregular in time (or at least pretend)
In [69]: df.index=date_range('20130101 09:00:58',periods=10,freq='s')
In [70]: df
Out[70]:
A B C
2013-01-01 09:00:58 0.417022 0.720324 0.000114
2013-01-01 09:00:59 0.302333 0.146756 0.092339
2013-01-01 09:01:00 0.186260 0.345561 0.396767
2013-01-01 09:01:01 0.538817 0.419195 0.685220
2013-01-01 09:01:02 0.204452 0.878117 0.027388
2013-01-01 09:01:03 0.670468 0.417305 0.558690
2013-01-01 09:01:04 0.140387 0.198101 0.800745
2013-01-01 09:01:05 0.968262 0.313424 0.692323
2013-01-01 09:01:06 0.876389 0.894607 0.085044
2013-01-01 09:01:07 0.039055 0.169830 0.878143
Take every 3s of data (whether its their or not) and mean it (or you could do fancier if you want). Their are a bunch more options (e.g. which side to include, where to put the labels etc, see here
In [71]: df.resample('3s',how=lambda x: x.mean())
Out[71]:
A B C
2013-01-01 09:00:57 0.359677 0.433540 0.046226
2013-01-01 09:01:00 0.309843 0.547624 0.369792
2013-01-01 09:01:03 0.593039 0.309610 0.683919
2013-01-01 09:01:06 0.457722 0.532219 0.481593