What is my problem with join mapper code? - list

I'm trying to practice joining data using mapreduce, but when i run this line
cat join1_File*.txt | ./join1_mapper.py | sort | ./join1_reducer.py
it displays this erorr :
Traceback (most recent call last):
File "./join1_mapper.py", line 24, in
value_in = key_value[1] #value is 2nd item
IndexError: list index out of range
Apr-04 able 13 n-01 5
Dec-15 able 100 n-01 5
Feb-02 about 3 11
Mar-03 about 8 11
Feb-22 actor 3 22
Feb-23 burger 5 15
Mar-08 burger 2 15
i expect the output to be like that :
Apr-04 able 13 n-01 5
Dec-15 able 100 n-01 5
Feb-02 about 3 11
Mar-03 about 8 11
Feb-22 actor 3 22
Feb-23 burger 5 15
Mar-08 burger 2 15
This is my join1_mapper.py code:
`for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split(",") #split line, into key and value, returns a list
key_in = key_value[0].split(" ") #key is first item in list
value_in = key_value[1] #value is 2nd item
#print key_in
if len(key_in)>=2: #if this entry has <date word> in key
date = key_in[0] #now get date from key field
word = key_in[1]
value_out = date+" "+value_in #concatenate date, blank, and value_in
print( '%s\t%s' % (word, value_out) ) #print a string, tab, and string
else: #key is only <word> so just pass it through
print( '%s\t%s' % (key_in[0], value_in) ) #print a string tab and string
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value`

Related

Remove string before first underscore in pandas column names

I want to remove the string and first underscore in column names.
My attempt:
import re
import pandas as p
pathways[pathways.columns.str.replace(r"^[^KEGG_]", "", "regex=True")]
pathways.columns
Traceback:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FutureWarning:
The default value of regex will change from True to False in a future version.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-190-953d2f19fd5e> in <module>()
1 # Remove the "KEGG_" string from pathways.index
----> 2 pathways[pathways.columns.str.replace(r"^[^KEGG_]", "", "regex=True")]
3 pathways.columns
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/strings/object_array.py in _str_replace(self, pat, repl, n, case, flags, regex)
156 pat = re.compile(pat, flags=flags)
157
--> 158 n = n if n >= 0 else 0
159 f = lambda x: pat.sub(repl=repl, string=x, count=n)
160 else:
TypeError: '>=' not supported between instances of 'str' and 'int'
KEGG_1
KEGG_1_2
KEGG_1_2_3
First
row
row
row
Second
row
row
row
output:
1
1_2
_2_3
First
row
row
row
Second
row
row
row
You can use
pathways.columns = pathways.columns.str.replace(r"^KEGG_", "", regex=True)
You get " FutureWarning: The default value of regex will change from True to False in a future version." because you qouoted the regex=True argument.
You need to match KEGG_ pattern at the start of string with ^KEGG_ pattern.

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

How do I compare two lists to a python tuple, identify items, and append value based on conditionals?

How do I:
Identify which item from the dataframe df falls within each list (list1 or list2)
Create a new column ('new_item')
Determine which variable should be appended to the 'item' value and add it to the new column
Two lists of unique items:
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
If item is in list1, append the following variable value to the end of the item and append it to the 'new_item' column:
twentysix_above = '_26+' (value is equal or greater than 26)
six_to_twentyfive = '_25' (value is between 6 and 25)
one_to_five = '_5' (value is between 1 and 5)
If item is in list2, append the following variable value to the end of each item and append it to the 'new_item' column:
twentyone_above = '_21+' (value is equal or greater than 21)
one_to_twenty = '_20' (value is between 1 and 20)
If the item isn't in either list, carry over the item name to the 'new_item' column.
Dataframe column will have one, some, or none of the 'items' from each list in it and an associated number from the 'number' column. I've gotten partially there, but I'm not sure how to compare to the other list and put that all into the 'new_item' column? Any help is appreciated, thanks!
>> print df
item number
0 one 4
1 door 55
2 sun 2
3 tires 62
4 tires 7
5 water 94
>> list1 = ['one','two','shoes']
>> list2 = ['door','four','tires']
>> df['match'] = df.item.isin(list1)
>> bucket = []
>> for row in df.itertuples():
if row.match == True and row.item > 25:
bucket.append(row.item + '_26+')
elif row.match == True and row.item >5:
bucket.append(row.item + '_25')
elif row.match == True and row.item >0:
bucket.append(row.item +'_5')
else:
bucket.append(row.item)
df['new_item'] = bucket
>> print df
item number match new_item
0 one 4 True one_5
1 door 55 True door
2 sun 2 False sun
3 tires 62 True tires
4 tires 7 True tires
5 water 94 False water
Desired Result: (comparing both lists and potentially not needing the boolean check column)
item number new_item
0 one 4 one_20
1 door 55 door__21+
2 sun 2 sun
3 tires 62 tires_21
4 tires 7 tires_20
5 water 94 water
It looks like your desired result is a bit off. The first row is in list one and has a value of 4, so it should be 'one_5' right?
Anyway, this can be accomplished with boolean masking. DataFrames have a useful isin() function making it easy to find if the value is in your lists. Then you have two more conditions, if you need a value between two numbers, or just one more condition if the range is unbounded.
import pandas as pd
import numpy as np
df = pd.DataFrame({'item': ['one', 'door', 'sun', 'tires', 'tires', 'water'],
'number': [4, 55, 2, 62, 7, 94]})
list1 = ['one','two','shoes']
list2 = ['door','four','tires']
df['new_item'] = df['item']
logic1 = np.logical_and(df.item.isin(list1), df.number > 25)
logic2 = np.logical_and.reduce([df.item.isin(list1), df.number > 5, df.number <= 25])
logic3 = np.logical_and.reduce([df.item.isin(list1), df.number > 1, df.number <= 5])
logic4 = np.logical_and(df.item.isin(list2), df.number >= 21)
logic5 = np.logical_and.reduce([df.item.isin(list2), df.number > 1, df.number < 20])
df.loc[logic1,'new_item'] = df.loc[logic1,'item']+'_26+'
df.loc[logic2,'new_item'] = df.loc[logic2,'item']+'_25'
df.loc[logic3,'new_item'] = df.loc[logic3,'item']+'_5'
df.loc[logic4,'new_item'] = df.loc[logic4,'item']+'_21+'
df.loc[logic5,'new_item'] = df.loc[logic5,'item']+'_20'
And we have this as the output

How to pass variable as a column name pandas

I'm using Python 2.7
I try do create new column based on variable form a list
tickers=['BAC','JPM','WFC','C','MS']
returns=pd.DataFrame
for tick in tickers:
returns[tick]=bank_stocks[tick][]1'Close'].pct_change()**
But I get this error
TypeError Traceback (most recent call last)
in ()
2 returns=pd.DataFrame
3 for tick in tickers:
----> 4 returns[tick]=bank_stocks[tick]['Close'].pct_change()
5
TypeError: 'type' object does not support item assignment
IIUC you need:
np.random.seed(100)
mux = pd.MultiIndex.from_product([['BAC','JPM','WFC','C','MS', 'Other'], ['Close', 'Open']])
df = pd.DataFrame(np.random.rand(10,12), columns=mux)
print (df)
BAC JPM WFC C \
Close Open Close Open Close Open Close
0 0.543405 0.278369 0.424518 0.844776 0.004719 0.121569 0.670749
1 0.185328 0.108377 0.219697 0.978624 0.811683 0.171941 0.816225
2 0.175410 0.372832 0.005689 0.252426 0.795663 0.015255 0.598843
3 0.980921 0.059942 0.890546 0.576901 0.742480 0.630184 0.581842
4 0.285896 0.852395 0.975006 0.884853 0.359508 0.598859 0.354796
5 0.376252 0.592805 0.629942 0.142600 0.933841 0.946380 0.602297
6 0.173608 0.966610 0.957013 0.597974 0.731301 0.340385 0.092056
7 0.395036 0.335596 0.805451 0.754349 0.313066 0.634037 0.540405
8 0.254258 0.641101 0.200124 0.657625 0.778289 0.779598 0.610328
9 0.976500 0.166694 0.023178 0.160745 0.923497 0.953550 0.210978
MS Other
Open Close Open Close Open
0 0.825853 0.136707 0.575093 0.891322 0.209202
1 0.274074 0.431704 0.940030 0.817649 0.336112
2 0.603805 0.105148 0.381943 0.036476 0.890412
3 0.020439 0.210027 0.544685 0.769115 0.250695
4 0.340190 0.178081 0.237694 0.044862 0.505431
5 0.387766 0.363188 0.204345 0.276765 0.246536
6 0.463498 0.508699 0.088460 0.528035 0.992158
7 0.296794 0.110788 0.312640 0.456979 0.658940
8 0.309000 0.697735 0.859618 0.625324 0.982408
9 0.360525 0.549375 0.271831 0.460602 0.696162
First select columns by slicers, then call pct_change and last remove second level of MultiIndex in column by droplevel:
tickers=['BAC','JPM','WFC','C','MS']
idx = pd.IndexSlice
df = df.sort_index(axis=1)
returns = df.loc[:, idx[tickers,'Close']].pct_change()
returns.columns = returns.columns.droplevel(-1)
print (returns)
BAC C JPM MS WFC
0 NaN NaN NaN NaN NaN
1 -0.658950 0.216885 -0.482477 2.157889 171.008452
2 -0.053515 -0.266325 -0.974108 -0.756436 -0.019738
3 4.592146 -0.028390 155.551779 0.997444 -0.066841
4 -0.708544 -0.390220 0.094841 -0.152103 -0.515801
5 0.316048 0.697588 -0.353910 1.039454 1.597555
6 -0.538586 -0.847159 0.519208 0.400649 -0.216890
7 1.275448 4.870415 -0.158370 -0.782213 -0.571905
8 -0.356369 0.129391 -0.751538 5.297934 1.486019
9 2.840595 -0.654320 -0.884181 -0.212630 0.186573
Your code is correct except the line in In[73] where you must call dataframe(i.e., pd.DataFrame()) you have created an object by not using '()' after DataFrame. Thats why the error is type object doesnot support assignment.

calling number from a list to check if they are in a text file or not

I have a text file which there are 3 numbers in each line.
I also have a list number, like : lists = [1,2,3,4,5,6]
I like to find the lines in a text file which all the 3 numbers are from the list. for example:
text file:
11 20 6
3 5 1
30 20 12
I want to find this line :3 5 1
What is the fastest way to do so?
Using split() and set():
l = [1,2,3,4,5,6]
with open('data.txt') as file:
for i, line in enumerate(file):
if(set(list(map(int, line.split()))).issubset(l)):
print("Line %d has all numbers from the list" % i)
With an example file: data.txt like so:
11 20 6
3 5 1
30 20 12
Output:
Line 1 has all numbers from the list