Display/Print one column from a DataFrame of Series in Pandas - python-2.7

I created the following Series and DataFrame:
import pandas as pd
Series_1 = pd.Series({'Name': 'Adam','Item': 'Sweet','Cost': 1})
Series_2 = pd.Series({'Name': 'Bob','Item': 'Candy','Cost': 2})
Series_3 = pd.Series({'Name': 'Cathy','Item': 'Chocolate','Cost': 3})`
df = pd.DataFrame([Series_1,Series_2,Series_3], index=['Store 1', 'Store 2', 'Store 3'])
I want to display/print out just one column from the DataFrame (with or without the header row):
Either
Adam
Bob
Cathy
Or:
Sweet
Candy
Chocolate
I have tried the following code which did not work:
print(df['Item'])
print(df.loc['Store 1'])
print(df.loc['Store 1','Item'])
print(df.loc['Store 1','Name'])
print(df.loc[:,'Item'])
print(df.iloc[0])
Can I do it in one simple line of code?

By using to_string
print(df.Name.to_string(index=False))
Adam
Bob
Cathy

For printing the Name column
df['Name']

Not sure what you are really after but if you want to print exactly what you have you can do:
Option 1
print(df['Item'].to_csv(index=False))
Sweet
Candy
Chocolate
Option 2
for v in df['Item']:
print(v)
Sweet
Candy
Chocolate

Related

Query dataframe for key words and return matching rows [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I have the following dataframe:
pd.DataFrame({
'Code': ['XAW', 'PAK', 'I', 'QP', 'TOPZ', 'XAW', 'APOL'],
'Name': ['George Truck', 'Fred Williams', 'Jessica Weir', 'Tony P.', 'John Truck', 'Liz Moama', 'Emily Truck'],
'Color': ['Blue', 'Green', 'Green', 'Red', 'Pink', 'Blue', 'Pink']
})
Code Name Color
0 XAW George Truck Blue
1 PAK Fred Williams Green
2 I Jessica Weir Green
3 QP Tony P. Red
4 TOPZ John Truck Pink
5 XAW Liz Moama Blue
6 APOL Emily Truck Pink
Given a keyword, such as 'blue', I would like to retrieve the following rows:
0 XAW George Paul Blue
5 XAW Liz Moama Blue
The search can contain multiple keywords, for example, 'truck pink' would return:
4 TOPZ John Truck Pink
6 APOL Emily Truck Pink
Imagine that this dataframe has half a million rows and a few extra columns. Is there a fast way I can query the entire dataframe for specific keywords?
With search string s = 'truck pink', set up a search column:
t = (df['Name'] + ' ' + df['Color']).str.lower()
I force everything to lower case, because your search example doesn't seem to be case-sensitive. If you have dynamic search inputs, also force the search field to lower case. Then do searches for contains like so:
d = {}
for i in s.split(' '):
d[i] = t.str.contains(i, na=False)
I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.
Regardless, take results and reduce them with bit-wise 'and'. You need two imports:
from functools import reduce
from operator import and_
df[reduce(and_, d.values())]
And thus:
Code Name Color
4 TOPZ John Truck Pink
6 APOL Emily Truck Pink

How to add multiple Sentences (which are stored in a list) into a pandas dataframe

I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object

Collating data based on a parameter in python

I have the following type of data:
Col1 Col2 Col3
heyA 123 ABC
heyB 456 VCV
heyA 123 SDF
heyA 123 ABC
I want to collate them such that
The output should be:
Col1 Col2 Col3
heyA 123 ABC,SDF
heyB 456 VCV
Please suggest me ways of doing this. Thanks a lot in advance!
I have tried:
for i in Col1:
for k in Col1:
if i==k:
//dosome
else:
//dosomethingelse
but this isnt giving me the desired result. It is matching the same entry with itself and hence, the incorrect result.
According to your question, I assume that you're a beginner in python. So, my answer here will be boring to the pros.
First, you need to install pandas which is a cool python module that deals with datasets.
Second, you need to custom your data points and make them look like the following dictionary (I will leave this part to you):
d = {"Col1": ["heyA", "heyB", "heyA", "heyA"],
"Col2": [123, 456, 123, 123],
"Col3": ["ABC", "VCV", "SDF", "ABC"]}
Now, the fun part begins!
We will use a function from pandas module, this function is called group_by(). This function groups the data based on a specified column values. So, let's try it out and group our data accoring to the first two columns which are Col1 and Col2:
>>> import pandas as pd
>>>
>>> df = pd.DataFrame(d)
>>> grouped = df.groupby( ["Col1", "Col2"] )
>>> grouped
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AB9A6A0>
>>> grouped.groups
{('heyA', 123L): Int64Index([0, 2, 3], dtype='int64'),
('heyB', 456L): Int64Index([1], dtype='int64')}
As you can see, now we have two groups ('heyA', 123L) and ('heyB', 456L).
Now, let's use the groupby object and apply a function to it, we will apply the set casting to remove duplicated values. Then, we will use the function reset_index() to reset the indices.
>>> grouped['Col3'].apply(set).reset_index()
Col1 Col2 Col3
0 heyA 123 {SDF, ABC}
1 heyB 456 {VCV}
If you're concerned about the {} brackets, you can run the following line instead:
>>> grouped['Col3'].apply(set).apply(list).reset_index()
Col1 Col2 Col3
0 heyA 123 [SDF, ABC]
1 heyB 456 [VCV]

Pandas: Iterate on a column one row at a time to automate a google search?

I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!
Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))

Appending Tuples to Pandas DataFrame

I am trying to join (vertically) some of the tuples, better to say I am inserting these tuples in the dataframe. But unable to do so till now. Problem arises since I am trying to add them horizentally and not vertically.
data_frame = pandas.DataFrame(columns=("A","B","C","D"))
str1 = "Doodles are the logo-incorporating works of art that Google regularly features on its homepage. They began in 1998 with a stick figure by Google co-founders Larry Page and Sergey Brin -- to indicate they were attending the Burning Man festival. Since then the doodles have become works of art -- some of them high-tech and complex -- created by a team of doodlers. Stay tuned here for more of this year's doodles"
aa = str1.split()
bb = zip(aa[0:4])
data_frame.append(bb,ignore_index=True,verify_integrity=False)
Is it possible or do I have to iterate over each word in tuple to use insert
You could do this
In [8]: index=list('ABCD')
In [9]: df = pd.DataFrame(columns=index)
In [11]: df.append(Series(aa[0:4],index=index),ignore_index=True)
Out[11]:
A B C D
0 Doodles are the logo-incorporating
alternatively if you have many of these rows that you are going to append, just
create a list, then DataFame(list_of_series) at the end
In [13]: DataFrame([ aa[0:4], aa[5:8] ],columns=list('ABCD'))
Out[13]:
A B C D
0 Doodles are the logo-incorporating
1 of art that None