Map values in a list to a new value with PySpark - list

I'm trying to recode a list of values using Pyspark to create a new column. I've set my mapping up with nested dictionaries, but can't get the mapping syntax figured out. The original data has several string values that need to get recoded to a new value, then I want to give the column a new name. The original column values will get grouped several different ways to create different new columns.
The df will have several thousand columns, so I need the code to be as efficient as possible.
I have a different scenario with a 1-1 mapping where I was able to create my expression with:
#expr = [ create_map([lit(x) for x in chain(*values.items())])[orig_df[key]].cast(IntegerType()).alias('new_name') for key, values in my_dict.items() if key in orig_df.columns]
I just can't figure out the syntax for mapping the many to one.
Here's what I've tried:
grouping_dict = {'orig_col_n':{'new_col_n_a': {'20':['011','012'.'013'],'30':['014','015','016']},
'new_col_n_b': {'25':['011','013','015'],'35':['012','014','016']}}}
expr = [ f.when(f.col(key) == f.lit(old_val),f.lit(new_value))
.cast(IntegerType())
.alias(new_var_name)
for key, new_var_names_dict in grouping_dict.items()
for new_var_name,mapping_dict in new_var_names_dict.items()
for new_value,old_value_list in mapping_dict.items()
for old_val in old_value_list
if key in original_df.columns]
new_df = original_df.select(*expr)
This expression isn't quite right, it creates multiple columns with the same name as it loops through the values that need to be mapped.
Any suggestions for restructuring my dictionary or how to fix my syntax would be greatly appreciated!
enter image description here
enter image description here
orig_col_n new_col_n_a new_col_n_b
011 20 25
012 20 35
013 20 25
014 30 35
015 30 25
016 30 35

Related

Rocket Universe & Unidata File

This is just for clarification, know exactly what a qpointer is but today in a meeting the concept of a dpointer was raised. Anyone know what a "D" pointer refers to? Never ever heard this term before.
This is a nice question because it helped me put together a couple of pieces I had rolling around in my head, so thanks for that!
D's are dictionary items that refer to a logical location in the the data array and you have probably seen them a million times in the DICT of any given file.
A D Item in the VOC servers the same purpose and is valid with any query. Lots of shops have some generics (F1, F2, F3, F4, F5, F6..etc) set up so you don't have to remember the dictionary name if you know what filed you want. I think the precedence for dictionary items is DICT File -> VOC but I could be wrong on that.
As an example to illiterate this I went into HS.SALES and took one of the DICT items in the CUSTOMER table and wrote it to VOC after removing the conversion in field 3. I chose BUY_DATE because it had a conversion
SORT CUSTOMER BUY_DATE 06:51:04am 10 Oct 2017 PAGE 1
CUSTOMER.. Date Purchased
1 01/07/91
10 01/28/91
01/29/91
01/30/91
Remove the conversion and save into the VOC.
>ED DICT CUSTOMER BUY_DATE
10 lines long.
0001: D Date of purchase
0002: 14
0003: D2/
0004: Date Purchased
0005: 8R
0006: M
0007: ORDERS
0008: INTEGER
0009:
0010:
----: 3
0003: D2/
----: R
0003:
----: SAVE VOC F14NOCON
"F14NOCON" filed in file "VOC".
----: Q
Now sort with new D type. Values are before the Y-1995 era when pick date were still 4 digits!
SORT CUSTOMER F14NOCON 06:45:25am 10 Oct 2017 PAGE 1
CUSTOMER.. Date Purchased
1 8408
10 8429
8430
8431
Good Luck!

Dtype changing when setting a column as an index

I am reindexing files from multiple folders. A file initially looks like this:
Combined Percent
0101 50
0102 25
0104 25
I then use this code to create a new index which is the union of the indexes of all my files in a folder:
import pandas as pd
from glob import glob
folders=(r'C:\pathway_to_folders')
for folder in os.listdir(folders):
path=os.path.join(folders,folder)
filenames=glob(os.path.join(path+'/*.csv'))
def rfile(fn):
return pd.read_csv(fn, dtype='str', index_col=0)
dfs = [rfile(fn) for fn in filenames]
idx = dfs[0].index
for i in range(1, len(dfs)):
idx = idx.union(dfs[i].index)
print idx
when I set the column Combined as the index column, dfs now looks like this:
Combined Percent
101 50
102 25
104 25
Is there a way to keep the formatting for the index the same as the original column, or to manipulate my code to not have to set an index possibly?
I believe that this is still a long standing bug where you can't set the dtype and specify the same column as the index column, you have to do this as a secondary step:
def rfile(fn):
return pd.read_csv(fn, dtype=str).set_index('Combined')

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.

Is it possible that an image of a given size of 26 * 7 can contain 78 separate colour values in each row of the Mat ,and another can contain 77?

Is it possible that an image of a given size of 26 * 7 can contain 78,77 or 79 values in each row of the Mat?Why is this so? I have 95 images all of which are 26 by 7,I discovered that some images have 77 separate color values on each row,others have 79.This is problematic,since I need a standard value.
This is how the images are cropped to a size of 26 by 7.
Mat standard_size=largestObject1(Rect(0,0,26,7));//a template to get the standard _size of a binary image
cv::resize(thresholdi,thresholdi,standard_size.size());
If I could convert it to only 26 pixel values,would I lose information?
I create the following code:-
ofstream in("Eye_Gestures.txt");
//Eye Graze Class
IplImage *img2 = cvLoadImage("eye2.bmp");
Mat imgg2=cvarrToMat(img2);
Formatted line0i2 = format(imgg2.row(0),"CSV" );
Formatted line1i2 = format(imgg2.row(1),"CSV" );
Formatted line2i2 = format(imgg2.row(2),"CSV" );
Formatted line3i2 = format(imgg2.row(3),"CSV" );
Formatted line4i2 = format(imgg2.row(4),"CSV" );
Formatted line5i2 = format(imgg2.row(5),"CSV" );
Formatted line6i2 = format(imgg2.row(6),"CSV" );
in<<line0i2<<", "<<line1i2<<", "<<line2i2<<", "<<line3i2<<", "<<line4i2<<", "<<line5i2<<", "<<line6i2<<", "<<EyeGraze<<endl;
I need to ensure that each row stores the same number of separate colour values for all 95 images.If it going to be 77 values ,it needs to be 77 for all rows for each image.How can I ensure that I pass 77 not 78 or 79 values to the text file?How can I disregard the excess values for each row?how can I keep track of separate colour values on a row without having to manually count them?
An image of 26 by 7 pixels contains 7 rows of 26 pixels each. There won't be 27, let alone 77 pixels on one row. You're entirely confused.
Go back an rethink what you're actually trying to achieve. Note that text files are not a natural format for image files.
A bitmap (bmp) image contains exactly 1 entry for each pixel. Depending on the type of bitmap you are using, these may be of different (but always constant) resolution.
The method you are using for converting the bmp to csv text representation seems to serialize each pixel as a set of 3 values (possibly rgb). If in an image with a width of 26px you are getting a different number of entries per line than 78, you most likely made some mistake when reading the file.
Common mistakes would be 0 values expressed as empty cells ",," or - depending on the locale you are using - confusion between the decimal comma character and the csv separator.
Are you sure the separator is indeed the ',' character you are using for separating the lines when writing to the output stream which is confusingly named in?