Collating data based on a parameter in python - python-2.7

I have the following type of data:
Col1 Col2 Col3
heyA 123 ABC
heyB 456 VCV
heyA 123 SDF
heyA 123 ABC
I want to collate them such that
The output should be:
Col1 Col2 Col3
heyA 123 ABC,SDF
heyB 456 VCV
Please suggest me ways of doing this. Thanks a lot in advance!
I have tried:
for i in Col1:
for k in Col1:
if i==k:
//dosome
else:
//dosomethingelse
but this isnt giving me the desired result. It is matching the same entry with itself and hence, the incorrect result.

According to your question, I assume that you're a beginner in python. So, my answer here will be boring to the pros.
First, you need to install pandas which is a cool python module that deals with datasets.
Second, you need to custom your data points and make them look like the following dictionary (I will leave this part to you):
d = {"Col1": ["heyA", "heyB", "heyA", "heyA"],
"Col2": [123, 456, 123, 123],
"Col3": ["ABC", "VCV", "SDF", "ABC"]}
Now, the fun part begins!
We will use a function from pandas module, this function is called group_by(). This function groups the data based on a specified column values. So, let's try it out and group our data accoring to the first two columns which are Col1 and Col2:
>>> import pandas as pd
>>>
>>> df = pd.DataFrame(d)
>>> grouped = df.groupby( ["Col1", "Col2"] )
>>> grouped
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000AB9A6A0>
>>> grouped.groups
{('heyA', 123L): Int64Index([0, 2, 3], dtype='int64'),
('heyB', 456L): Int64Index([1], dtype='int64')}
As you can see, now we have two groups ('heyA', 123L) and ('heyB', 456L).
Now, let's use the groupby object and apply a function to it, we will apply the set casting to remove duplicated values. Then, we will use the function reset_index() to reset the indices.
>>> grouped['Col3'].apply(set).reset_index()
Col1 Col2 Col3
0 heyA 123 {SDF, ABC}
1 heyB 456 {VCV}
If you're concerned about the {} brackets, you can run the following line instead:
>>> grouped['Col3'].apply(set).apply(list).reset_index()
Col1 Col2 Col3
0 heyA 123 [SDF, ABC]
1 heyB 456 [VCV]

Related

PySpark - How to pass a list to User Define Function?

I have a DataFrame with 2 columns. Column 1 is "code" which can repeat more than 1 time and column 2 which is "Values". For example, column 1 is 1,1,1,5,5 and Column 2 is 15,18,24,38,41. What I want to do is first sort by the 2 columns ( df.sort("code","Values") ) and then do a ("groupBy" "Code") and (agg Values) but I want to apply a UDF on values so I need to pass the "Values" of each code as a "list" to the UDF. I am not sure how many "Values" each Code will have. As you can see in this example "Code" 1 has 3 values and "Code" 5 has 2 Values. So for each "Code" I need to pass all the "Values" of that "Code" as a list to the UDF.
You can do a groupBy and then use the collect_set or collect_list function in pyspark. Below is an example dataframe of your use case (I hope this is what are you referring to ):
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("code1", "val1"),
("code1", "val2"),
("code1", "val3"),
("code2", "val1"),
("code2", "val2"),
], ["code", "val"])
df.show()
+-----+-----+
| code| val |
+-----+-----+
|code1|val1 |
|code1|val2 |
|code1|val3 |
|code2|val1 |
|code2|val2 |
+---+-------+
Now the groupBy and collect_list command:
(df
.groupby("code")
.agg(F.collect_list("val"))
.show())
Output:
+------+------------------+
|code |collect_list(val) |
+------+------------------+
|code1 |[val1, val2, val3]|
|code2 |[val1, val2] |
+------+------------------+
Here above you get list of aggregated values in second column

Remove square brackets from cells using pandas

I have a Pandas Dataframe with data as below
id, name, date
[101],[test_name],[2019-06-13T13:45:00.000Z]
[103],[test_name3],[2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z]
[104],[],[]
I am trying to convert it to a format as below with no square brackets
Expected output:
id, name, date
101,test_name,2019-06-13T13:45:00.000Z
103,test_name3,2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z
104,,
I tried using regex as below but it gave me an error TypeError: expected string or bytes-like object
re.search(r"\[([A-Za-z0-9_]+)\]", df['id'])
Figured I am able to extract the data using the below:
df['id'].str.get(0)
Loop through the data frame to access each string then use:
newstring = oldstring[1:len(oldstring)-1]
to replace the cell in the dataframe.
Try looping through columns:
for col in df.columns:
df[col] = df[col].str[1:-1]
Or use apply if your duplication of your data is not a problem:
df = df.apply(lambda x: x.str[1:-1])
Output:
id name date
0 101 test_name 2019-06-13T13:45:00.000Z
1 103 test_name3 2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00....
2 104
Or if you want to use regex, you need str accessor, and extract:
df.apply(lambda x: x.str.extract('\[([A-Za-z0-9_]+)\]'))

Display/Print one column from a DataFrame of Series in Pandas

I created the following Series and DataFrame:
import pandas as pd
Series_1 = pd.Series({'Name': 'Adam','Item': 'Sweet','Cost': 1})
Series_2 = pd.Series({'Name': 'Bob','Item': 'Candy','Cost': 2})
Series_3 = pd.Series({'Name': 'Cathy','Item': 'Chocolate','Cost': 3})`
df = pd.DataFrame([Series_1,Series_2,Series_3], index=['Store 1', 'Store 2', 'Store 3'])
I want to display/print out just one column from the DataFrame (with or without the header row):
Either
Adam
Bob
Cathy
Or:
Sweet
Candy
Chocolate
I have tried the following code which did not work:
print(df['Item'])
print(df.loc['Store 1'])
print(df.loc['Store 1','Item'])
print(df.loc['Store 1','Name'])
print(df.loc[:,'Item'])
print(df.iloc[0])
Can I do it in one simple line of code?
By using to_string
print(df.Name.to_string(index=False))
Adam
Bob
Cathy
For printing the Name column
df['Name']
Not sure what you are really after but if you want to print exactly what you have you can do:
Option 1
print(df['Item'].to_csv(index=False))
Sweet
Candy
Chocolate
Option 2
for v in df['Item']:
print(v)
Sweet
Candy
Chocolate

python/pandas:need help adding double quotes to columns

I need to add double quotes to specific columns in a csv file that my script generates.
Below is the goofy way I thought of doing this. For these two fixed-width fields, it works:
df['DATE'] = df['DATE'].str.ljust(9,'"')
df['DATE'] = df['DATE'].str.rjust(10,'"')
df['DEPT CODE'] = df['DEPT CODE'].str.ljust(15,'"')
df[DEPT CODE'] = df['DEPT CODE'].str.rjust(16,'"')
For the following field, it doesn't. It has a variable length. So, if the value is shorter than the standard 6-digits, I get extra double-quotes: "5673"""
df['ID'] = df['ID'].str.ljust(7,'"')
df['ID'] = df['ID'].str.rjust(8,'"')
I have tried zfill, but the data in the column is a series-- I get "pandas.core.series.Series" when i run
print type(df['ID'])
and I have not been able to convert it to string using astype. I'm not sure why. I have not imported numpy.
I tried using len() to get the length of the ID number and pass it to str.ljust and str.rjust as its first argument, but I think it got hung up on the data not being a string.
Is there a simpler way to apply double-quotes as I need, or is the zfill going to be the way to go?
You can add a speech mark before / after:
In [11]: df = pd.DataFrame([["a"]], columns=["A"])
In [12]: df
Out[12]:
A
0 a
In [13]: '"' + df['A'] + '"'
Out[13]:
0 "a"
Name: A, dtype: object
Assigning this back:
In [14]: df['A'] = '"' + df.A + '"'
In [15]: df
Out[15]:
A
0 "a"
If it's for exporting to csv you can use the quoting kwarg:
In [21]: df = pd.DataFrame([["a"]], columns=["A"])
In [22]: df.to_csv()
Out[22]: ',A\n0,a\n'
In [23]: df.to_csv(quoting=1)
Out[23]: '"","A"\n"0","a"\n'
With numpy, not pandas, you can specify the formatting method when saving to a csv file. As very simple example:
In [209]: np.savetxt('test.txt',['string'],fmt='%r')
In [210]: cat test.txt
'string'
In [211]: np.savetxt('test.txt',['string'],fmt='"%s"')
In [212]: cat test.txt
"string"
I would expect the pandas csv writer to have a similar degree of control, if not more.

AttributeError: 'DataFrame' object has no attribute 'Height'

I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.