I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.
Related
I'm trying to recode a list of values using Pyspark to create a new column. I've set my mapping up with nested dictionaries, but can't get the mapping syntax figured out. The original data has several string values that need to get recoded to a new value, then I want to give the column a new name. The original column values will get grouped several different ways to create different new columns.
The df will have several thousand columns, so I need the code to be as efficient as possible.
I have a different scenario with a 1-1 mapping where I was able to create my expression with:
#expr = [ create_map([lit(x) for x in chain(*values.items())])[orig_df[key]].cast(IntegerType()).alias('new_name') for key, values in my_dict.items() if key in orig_df.columns]
I just can't figure out the syntax for mapping the many to one.
Here's what I've tried:
grouping_dict = {'orig_col_n':{'new_col_n_a': {'20':['011','012'.'013'],'30':['014','015','016']},
'new_col_n_b': {'25':['011','013','015'],'35':['012','014','016']}}}
expr = [ f.when(f.col(key) == f.lit(old_val),f.lit(new_value))
.cast(IntegerType())
.alias(new_var_name)
for key, new_var_names_dict in grouping_dict.items()
for new_var_name,mapping_dict in new_var_names_dict.items()
for new_value,old_value_list in mapping_dict.items()
for old_val in old_value_list
if key in original_df.columns]
new_df = original_df.select(*expr)
This expression isn't quite right, it creates multiple columns with the same name as it loops through the values that need to be mapped.
Any suggestions for restructuring my dictionary or how to fix my syntax would be greatly appreciated!
enter image description here
enter image description here
orig_col_n new_col_n_a new_col_n_b
011 20 25
012 20 35
013 20 25
014 30 35
015 30 25
016 30 35
I have three csv files containing different data for a common object. These represent data about distinct collections of items at work. These objects have unique codes. The number of files is not important so I will set this problem up with two. I have a handy recipe for joining these files using join -- but the cleaning part is killing me.
File A snippet - contains unique data. Also the cataloging error E B.
B 547
J 65
EB 289
E B 1
CO 8900
ZX 7
File B snippet - unique data about a different dimension of the objects.
B 5
ZX 67
SD 4
CO 76
J 54
EB 10
Note that file B contains a code not in common with file A.
Now I submit to you the "official" canon of codes designated for this set of objects:
B
CO
ZX
J
EB
Note that File B contains a non-canonical code with data. It needs to be captured and documented. Same with bad code in file A.
End goal: run trend and stats on the collections using the various fields from the multiple reports. They mostly match the canon but there are oddballs due to cataloging errors and codes that are no longer in use.
End goal result after merge/join:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
So my first idea was to use grep -F -f for this, using the canonical codes as a search list then merge with join. Problem is, with one letter codes it's too inclusive. It would seem like a job for awk where it can work with tab delimiters and REGEX the oddball codes. I'm not sure though, how to get awk to use a list to sift other files. Will join alone handle all this? Maybe I merge with join or paste, then sift out the weirdos? Which method is the least brittle and more likely to handle edge cases like the drunk cataloger?
If you're thinking, "Dude, this is better done with Perl or Python ...etc.". I'm all ears. No rules, I just need to deliver!
Your question says the data is csv, but based on your samples I'm assuming it's tsv. I'm also assuming E B should end up in the outlier output and that NA values should be filled with 0.
Given those assumptions, the following may be sufficient:
sort -t $'\t' -k 1b,1 fileA > fileA.sorted && sort -t $'\t' -k 1b,1 fileB > fileB.sorted
join -t $'\t' -a1 -a2 -e0 -o auto fileA.sorted fileB.sorted > out
grep -f codes out > out-canon
grep -vf codes out > out-oddball
The content of file codes:
^B\s
^CO\s
^ZX\s
^J\s
^EB\s
Result:
$ cat out-canon
B 547 5
CO 8900 76
EB 289 10
J 65 54
ZX 7 67
$ cat out-oddball
E B 1 0
SD 0 4
Try this(GNU awk):
awk 'BEGIN{FS=OFS="\t";}ARGIND==1{c[$1]++;}ARGIND==2{b[$1]=$2}ARGIND==3{if (c[$1]) {print $1,$2,b[$1]+0; delete b[$1];} else {if(tolower($1)~"[a-z]+ +[a-z]+")print>"error.fileA"; else print>"oddball.fileA";}}END{for (i in b) {print i,0,b[i] " (? maybe?)";print i,b[i] > "oddball.fileB";}}' codes fileB fileA
It will create error.fileA, oddball.fileA if such lines exists, oddball.fileB.
Normal output didn't write to file, you can write with > yourself when results are ok:
B 547 5
J 65 54
EB 289 10
CO 8900 76
ZX 7 67
SD 0 4 (? maybe?)
Had a hard time reading your description, not sure if this is what you want.
Anyway it's easy to improve this awk code.
You can change to FILENAME=="file1", or FILENAME==ARGV[1] if ARGIND is not working.
I have a dataframe: Outlet_results
it goes something like this
index Calendar year/Week Material Sellthru Qty
0 37.2013 ABC 2
1 38.2913 ABC 7
2 37.2913 BCG 22
3 39.2013 XYZ 5
Now, I wanted a separate list for the Materials and week for further coding.
I used this code for the material list
mat_outlet = list(set(outlet_result['Material']))
It works perfectly and gives me 3 values (ABC, BCG, XYZ)
However, the week list shows a faulty output even though the code is same.
week_outlet_list = list(set(outlet_result['Calendar Year/Week']))
I am getting a list with 4 values
['38.2013', '37.2013', 'Calendar Year/Week', '39.2013']
Why is the string (header) included in the list? Please help me understand this concept.
I am using Python 2.7.... has it got something to do with it?
I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.
Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.
I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)
You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:
cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100