Converting a specific column data in .csv to text using Python pandas - list

I have a .csv file like below where all the contents are text
col1 Col2
My name Arghya
The Big Apple Fruit
I am able to read this csv using pd.read_csv(index_col=False, header=None).
How do I combine all the three rows in Col1 into a list separated by a full stop.

If need convert column values to list:
print (df.Col1.tolist())
#alternative solution
#print (list(df.Col1))
['This is Apple', 'I am in Mumbai', 'I like rainy day']
And then join values in list - output is string:
a = '.'.join(df.Col1.tolist())
print (a)
This is Apple.I am in Mumbai.I like rainy day
print (df)
0 1
0 Col1 Col2
1 This is Apple Fruit
2 I am in Mumbai Great
3 I like rainy day Flood
print (list(df.loc[:, 0]))
#alternative
#print (list(df[0]))
['Col1', 'This is Apple', 'I am in Mumbai', 'I like rainy day']

Related

regular expression in hive for a specific string

I have a column in hive table which is a address column and i want to split that into 2.
There are 2 scenarios to take care of.
Example:
Scenario 1:
Input column value:
ABC DEF123 AD
Output column values:
Column 1 should have ABC DEF
Column 2 should have 123 AD
Another example can be like below.
MICHAEL POSTON875 HYDERABAD
In this case separation should be based on a number which is part of a string value, if a string is having number in it then both should separate
Scenario 2:
Input value: ABC DEFPO BOX 5232
Output:
Column 1:- ABC DEF
Column 2:- PO BOX 5232
Another example can be like below.
Hyderabad jhillsPO BOX 522002
In this case separation should be based on PO BOX
Both the data is in same column and i would like to update the data into target based on the string format..like a case statement not sure about the approach.
NOTE:- The string length can be varied as this is address column.
Can some one please help me to provide a hive query and pyspark for the same?
Using CASE expression you can check which template does it match and using regexp_replace insert some delimiter, then split by the same delimiter.
Demo (Hive):
with mytable as (
select stack(4,
'ABC DEF123 AD',
'MICHAEL POSTON875 HYDERABAD',
'ABC DEFPO BOX 5232',
'Hyderabad jhillsPO BOX 522002'
) as str
) --Use your table instead of this
select columns[0] as col1, columns[1] as col2
from
(
select split(case when (str rlike 'PO BOX') then regexp_replace(str, 'PO BOX','|||PO BOX')
when (str rlike '[a-zA-Z ]+\\d+') then regexp_replace(str,'([a-zA-Z ]+)(\\d+.*)', '$1|||$2')
--add more cases and ELSE part
end,'\\|{3}') columns
from mytable
)s
Result:
col1 col2
ABC DEF 123 AD
MICHAEL POSTON 875 HYDERABAD
ABC DEF PO BOX 5232
Hyderabad jhills PO BOX 522002

How to get the first row where value is found using itertuples()

I am using the code below to get the first row where "value" is found, but I am getting the last row of the file. What am I doing wrong? Is there a way to get the first row?
Suppose my dataframe look like this:
Summary no
This is an analysis
of some data
Phone: 452-354-4456
col1 Value col2 col3
bac15 job $16.00 $0.00
khs bank $19.25 $0.00
jsg foot $0.00 $70,000.00
eyhf water $15.00 $0.00
edf drink $15.00 $0.00
for fname in os.listdir(root_dir):
file_path = os.path.join(root_dir, fname)
if fname.endswith(('.csv')):
df = pd.read_csv(file_path)
for row in df.itertuples():
if row == "value":
print(row)
You are comparing the entire row to the string "value", which will never be true, sine row is a tuple of the values in each row of the dataframe.
Rows containing "value" can be found using if "value" in row:
rather than if row == "value":

Pandas drop duplicates; values in reverse order

I'm trying to find a way to utilize pandas drop_duplicates() to recognize that rows are duplicates when the values are in reverse order.
An example is if I am trying to find transactions where customers purchases both apples and bananas, but the data collection order may have reversed the items. In other words, when combined as a full order the transaction is seen as a duplicate because it is made up up of the same items.
I want the following to be recognized as duplicates:
Item1 Item2
Apple Banana
Banana Apple
First sort by rows with apply sorted and then drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana
#if need specify columns
cols = ['Item1','Item2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
Item1 Item2
0 Apple Banana
Another solution with numpy.sort and DataFrame constructor:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana

Compare columns in CSV and overide newlist

I'm new to python. I have two .csv files with identical columns, but my oldlist.csv has been edited for row[9] with employee names, the newlist.csv when generated defaults to certain domains for names. I want to be able to take the oldlist.csv compare to newlist.csv and override columns in newlist.csv with the data in row[9] from oldlist.csv. Thanks for your help.
Example: (oldlist) col1, col2 (newlist) col1, col2
1234, Bob 1234, Jane
I want to read oldlist, if col1 == col1 in newlist override col2 and I want to contine to write.write(row) for everything matching in col from oldlist

Creating a pandas.DataFrame from a dict

I'm new to using pandas and I'm trying to make a dataframe with historical weather data.
The keys are the day of the year (ex. Jan 1) and the values are lists of temperatures from those days over several years.
I want to make a dataframe that is formatted like this:
... Jan1 Jan2 Jan3 etc
1 temp temp temp etc
2 temp temp temp etc
etc etc etc etc
I've managed to make a dataframe with my dictionary with
df = pandas.DataFrame(weather)
but I end up with 1 row and a ton of columns.
I've checked the documentation for DataFrame and DataFrame.from_dict, but neither were very extensive nor provided many examples.
Given that "the keys are the day of the year... and the values are lists of temperatures", your method of construction should work. For example,
In [12]: weather = {'Jan 1':[1,2], 'Jan 2':[3,4]}
In [13]: df = pd.DataFrame(weather)
In [14]: df
Out[14]:
Jan 1 Jan 2
0 1 3
1 2 4