Pandas create histogram with custom sort order - python-2.7

I am trying to create a histogram of education. However, my data is such that the edu variable is categorical: '10', '11', '12', 'GED', etc.. I want a histogram that goes in that order '10', '11', '12', 'GED', '13', etc.
My two approaches are:
1: Use pd.DataFrame.hist with a new variable that maps edu to a numeric edunum variable. However, I am then having trouble getting the original edu lables on the histogram
2: Use pd.Series(list(profile.['edu'])).value_counts().plot(kind="bar"). However, I am having trouble sorting the bar order correctly.
Any suggestions? Thanks!

Related

How do I split a tuple in python

How do I split a tuple?
I want to split the tuple, but an error occurs.
error : File "", line 1, in
How do I turn this...
a = ( '1', '2abc3', '4', '5')
into this?
a = ( '1', '2a', 'bc3', '4', '5')
Tuple is immutable,
If you want to do like you mentioned you can add all elements of tuple to list and then you can do your operation and then again you can add all elements of list to tuple
It is not possible to change a tuple using codes like tuple.pop() or tuple.remove() that you can use in lists.So I recommend editing your code directly.

Python run set.intersection with set of sets as input

I am working with biological datasets, straight from transcriptome (RNA) to finding certain protein sequences. I have a set of the protein names for each dataset, and want to find which are common to all datasets. Due to how the data is processed, I end up with a one variable that contains all the sub sets.
Due to how the set.intersect() command works, it requires at least 2 sets as input:
IDs = set.intersection(transc1 & trans2)
However I only have one input, colA which contains 30 sets of 80 to 100 entries. Here is what I have so far:
from glob import glob
for file in glob('*_query.tsv'): #input all 30 datasets, first column with protein IDs
sources = file
colnames = ['a', 'b', 'c', 'd', 'e', 'f']
df = pandas.read_csv(sources, sep='\t', names=colnames) #colnames headers for df contruction
colA = df.a.tolist() #turn column a, protein IDs, into list
IDs = set(colA) #turn lists into sets
If I print(colA), the output is something like this, with two unnamed elements as sets:
set(['ID2', 'ID8', 'ID35', 'ID77', 'ID78', 'ID199', 'ID211'])
set(['ID1', 'ID5', 'ID8', 'ID88', 'ID105', 'ID205'])
At this point I get stuck. I can't get set.intersection() working with the IDs set of sets. Also tried pandas.merge(*IDs) for which the syntax seemed to work, but the number of entries for comparison exceeded the maximum (12).
I wanted to use sets because unlike lists, it should be quick to find common IDs between all the datasets. If there is a better way, I am all for it.
Help is much appreciated.

Generate different permutations/combinations in a string

I am given a string (eg "12345678").
I want to generate different combinations using +,-,*,/.
Like :
'1+2+3+4+5+6+7+8'
'1+2*3-4+5+6-7+8'
'1-2+3+4*5+6-7*8'
'1-2-3-4+5*6+7+8'
'1+2+3+4+5+6*7*8'
'1-2+3-4+5-6+7-8'
Any idea how do i generate all different combinations like above?
this is one way to achieve this:
from itertools import product
numbers = "123456"
for operators in product('+-*/', repeat=len(numbers)-1):
ret = numbers[0]
for op, n in zip(operators, numbers[1:]):
ret += op+n
print(ret)
zip creates pairs of elements of two iterators. the rest is just string manipulation (and not in a very good way).
this is a little more compact (and pythonic?) with some more itertools magic:
from itertools import product, zip_longest, chain
numbers = "123456"
operators = '+-*/'
for ops in product(operators, repeat=len(numbers)-1):
print(''.join(chain(*zip_longest(numbers, ops, fillvalue=''))))
product is well documented. with zip_longest i create an iterator that will yield the pairs ('1', '+') , ('2', '*'), ... , ('6', '') (the last item is filled with the fillvalue; ops is one element shorter than numbers). the chain(*...) idiom is a simple way to flatten the tuples to get an iterator over the strings '1', '+', '2', '*', ..., '6', ''. then i simply join these strings.
if you don't like the chain(*...) part, you can replace it with chain.from_iterable(...) (this time without the * which may be a bit cleaner).

django-haystack order_by not working

I have a query like
SearchQueryset().all().models(Show).order_by('title')
This will return list of objects. But in the title names there might be the special characters like ./hack:twilight and also numbers like 009:absolute.
According to order_by documentation, the priority goes to special characters. But when I see the output, it starts from the numbers.
Basically I need this output using that query
>> list = ['apple', 'zebra', '.hack', 'orange', 'car', 'funk', 'python']
>>> list.sort()
>>> list
['.hack', 'apple', 'car', 'funk', 'orange', 'python', 'zebra']
Any idea?

How to get a unique list of objects from AWS S3 bucket

the following code connecting to the AWS S3 bucket and returning the list of objects from S3 bucket. I’m trying to create a unique list out of original list, by selecting partial value of the object (i.e. batchID = str((s3_file.name).split("/"))[32:-13]). I have declared “batchID" as an array. When I use set() to return unique value it returns unique numbers within each value. example: ['1', '0', '3', '2', '5', '4', '9', '8’], ['1', '0', '3', '2', '5', '4', '7', '9', '8’] etc. So it is de-duping horizontally verses vertically in the list. I’m expecting the value to be unique. See below expected output. I also tried to use nested "for loops" and used "not in” to return the unique values but it didn’t work, it is still removing duplicates vertically and not horizontally. Can anyone please help. Thank you in advance.
def __init__(self, aws_access_key_id, aws_secret_access_key, aws_bucket_to_download, use_ssl):
self.run_id = []
self.batchID = []
self._aws_connection = S3Connection(aws_access_key_id, aws_secret_access_key, is_secure = use_ssl)
self._runId(aws_bucket_to_download)
def _runId(self,aws_bucket_to_download):
if not self._bucketExists(aws_bucket_to_download):
self._printBucketNotFoundMessage(aws_bucket_to_download)
else:
bucket = self._aws_connection.get_bucket(aws_bucket_to_download)
for s3_file in bucket.list(prefix='Download/test_queue1/'):
batchID = str((s3_file.name).split("/"))[32:-13]
#a = set(batchID)
#batchID = list(a)
print batchID
#newList = list(set(batchID))
#print newList`
Output:
144019080231459
144019080231459
144019800231759
144019800231759
Expected output:
144019080231459
144019800231759
I think you're asking how to remove duplicate batch IDs. Why don't you add each batch ID to a list as you retrieve it, ignoring it if it's already in the list, for example:
batchIDlist = []
for s3_file in bucket.list(prefix='Download/test_queue1/'):
batchID = str((s3_file.name).split("/"))[32:-13]
if batchID not in batchIDlist:
batchIDlist.append(batchID)
This will also keep items in the same order they were first found.