Reading CSV file and take majority voting of certain column - python-2.7

I need to calculate the majority vote for an TARGET_LABEL Column of my CSV file in Python.
I have a data frame with Row ID and assigned TARGET_LABEL. What I need is the count of TARGET_LABEL(majority). How do I do this?
For Example Data is in this form:
**Row ID TARGET_LABEL**
Row2 0
Row6 0
Row7 0
Row10 0
Row12 0
Row15 1
. .
. .
Row99999 1
I have python script which only reads data from CSV. Here It is
import csv
ifile = open('file1.csv', "rb")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
# Save header row.
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
print '%-8s: %s' % (header[colnum], col)
colnum += 1
rownum += 1
ifile.close()

In case TARGET_LABEL** does not have a NaN values, you could use:
counts = df['TARGET_LABEL'].value_counts()
max_counts = counts.max()
Otherwise if it could contain NaN values, use
df = df.dropna(subset=['TARGET_LABEL'])
removes all the NaN values
df['TARGET_LABEL'].value_counts().max()
should give you the max counts,
df['TARGET_LABEL'].value_counts().idxmax()
should give you the most frequent value.

The package collection contains the class Counter which works similar to a dict (or more precisely a defaultdict(lambda: 0)) and which can be used to find the most frequent item.

Related

find unique values row wise on comma separated values

For a dataframe like below:
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
How to get unique values of the column col row wise in a new column like as follows:
col unique_col
0 abc,def,ghi,jkl,abc abc,def,ghi,jkl
1 abc,def,ghi,def,ghi abc,def,ghi
I tried using iteritems but got Attribute error :
for i, item in df.col.iteritems():
print item.unique()
import pandas as pd
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
def unique_col(col):
return ','.join(set(col.split(',')))
df['unique_col'] = df.col.apply(unique_col)
result:
col unique_col
0 abc,def,ghi,jkl,abc ghi,jkl,abc,def
1 abc,def,ghi,def,ghi ghi,abc,def

python 2.7 How to write the result to the xlsx file [duplicate]

This question already has answers here:
How to write to an Excel spreadsheet using Python?
(12 answers)
Closed 4 years ago.
I have some reads:
ATCAAAGTCCCGTAGGTACGGGAAATGCAAAAAAA
GGGCTAGGTAGGGATTGCCTAGTCAACTGGGGGGG
TAGCTAGGTAGGGATTGCCTAGTCAACTGGCCCGG
...
...
now ,I want to Cut the 12 bases to the left of each reads and write to a file:
f2 = open("./read1.csv","w")
with open('../001.fastq') as reader:
for index, line in enumerate(reader):
if index % 4 == 1:
f2.write(line[:12]+'\n')
f2.close()
I want to know how to write a xlsx file
every reads have 4 rows
#
ATCAAAGTCCCGTAGGTACGGGAAATGCAAAAAAA
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
#
GGGCTAGGTAGGGATTGCCTAGTCAACTGGGGGGG
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
#
TAGCTAGGTAGGGATTGCCTAGTCAACTGGCCCGG
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
for this example ,the output is:
ATCAAAGTCCCG
GGGCTAGGTAGG
TAGCTAGGTAGG
Assuming you are on Windows and have Excel installed.
There are multiple libraries to make it possible to use Excel from python. Out of the I only have used win32com.client but I was pretty content with it. I think it comes with anaconda by default but if not then you can download it from here: https://github.com/mhammond/pywin32/releases (don't forget to select the appropriate version/architecture).
Here is a mini reference to the most important functionalities:
from win32com.client import Dispatch # import the necessary library
xlApp = Dispatch("Excel.Application") # starts excel
xlApp.Visible = False # hides window (this makes things a bit faster)
xlBook = xlApp.Workbooks.Open( fullPathToXlsx ) # opens an existing workbook
xlSheet = xlBook.Sheets(3) # which sheet you want to use, indexing starts from 1!
print [sheet.Name for sheet in xlBook.Sheets] # if it is not the one you want you can see their order this way
xlSheet = xlBook.Sheets(sheetName) # or you can use its name
xlSheet.Cells(row, col).Value = newValue # newValue is what you want to assign to the cell, row and col indexing starts from 1
xlApp.DisplayAlerts = False # turns off any affirmation popup
xlBook.Save() # saving workbook
xlBook.SaveAs(newFullPath) # saving as...
xlBook.Close() # close workbook
xlApp.Quit() # exit application
You might want to convert between column/row index and the 'letter representation (i mean 'A1' for top left cell and so on). Here is what I used for it:
def cellStringToNum(aStr):
"""
returns an excel sheet cell reference as numbers in a list as [col, row]
"""
import string
colS = ""
for i in xrange(len(aStr)):
if aStr[i] in string.ascii_uppercase:
colS += aStr[i]
else:
row = int(aStr[i:])
break
col = 0
for i in xrange(len(colS)):
col += (string.ascii_uppercase.find(colS[len(colS)-1-i])+1)*(26**i)
return [col, row]
def cellNumToString(col, row):
import string
colS = string.ascii_uppercase[col % 26 - 1]
if col % 26 == 0:
col -= 26
else:
col -= col % 26
while col != 0:
col /= 26
colS = string.ascii_uppercase[col % 26 - 1] + colS
if col % 26 == 0:
col -= 26
else:
col -= col % 26
return colS+str(row)
Edit
But this question is already answered here: Python - Write to Excel Spreadsheet

Read specific column of csv

The code is working fine but it is creating list of values in braces. I want to modify the code in such a way that it prints as in csv in proper column and row format.
Expected output :
Ver Total
4 5
4 5
4 5
4 5
Actual Output:
(ver,total) (4,5) (4,5) (4,5)
Here is the following code
import csv
f = open("a.csv", 'r')
reader = csv.reader(f)
data = []
for line in f:
cells = line.split(",")
data.append((cells[0], cells[3]))
print data
Try this code:
import csv
with open('a.csv') as csvfile:
reader = csv.reader(csvfile)
rowcnt = 0
for row in reader:
if rowcnt == 0:
print row[0], row[1]
else:
print row[0], ' ', row[1]
rowcnt = rowcnt + 1
Provides the following output:
Ver Stat
4 5
4 5
4 5

Python2.7: Too many Values to Unpack - Number of Columns unknown

I have a file that I want to unpack and utilise the columns in different files. The issue I have is that the file I want to unpack varies from row to row on the number of columns it has (for example row 1 could have 7 columns, row 2 could have 15).
How do I unpack the file without receiving the error "Too many values to unpack"?
filehandle3 = open ('output_steps.txt', 'r')
filehandle4 = open ('head_cluster.txt', 'w')
for line in iter(filehandle3):
id, category = line.strip('\n').split('\t')
filehandle4.write(id + "\t" + category + "\n")
filehandle3.close()
filehandle4.close()
Any help would be great. Thanks!
You should extract the values separately, if present, e.g. like this:
for line in iter(filehandle3):
values = line.strip('\n').split('\t')
id = values[0] if len(values) > 0 else None
category = values[1] if len(values) > 1 else None
...
You could also create a helper function for this:
def safe_get(values, index, default=None):
return values[index] if len(values) > index else default
or using try/except:
def safe_get(values, index, default=None):
try:
return values[index]
except IndexError:
return default
and use it like this:
category = safe_get(values, 1)
With Python 3, and if the rows always have at least as many elements as you need, you can use
for line in iter(filehandle3):
id, category, *junk = line.strip('\n').split('\t')
This will bind the first element to id, the second to category, and the rest to junk.

IndexError, but more likely I/O error

Unsure of why I am getting this error. I'm reading from a file called columns_unsorted.txt, then trying to write to columns_unsorted.txt. There error is on fan_on = string_j[1], saying list index out of range. Here's my code:
#!/usr/bin/python
import fileinput
import collections
# open document to record results into
j = open('./columns_unsorted.txt', 'r')
# note this is a file of rows of space-delimited date in the format <1384055277275353 0 0 0 1 0 0 0 0 22:47:57> on each row, the first term being unix times, the last human time, the middle binary indicating which machine event happened
# open document to read from
l = open('./columns_sorted.txt', 'w')
# CREATE ARRAY CALLED EVENTS
events = collections.deque()
i = 1
# FILL ARRAY WITH "FACTS" ROWS; SPLIT INTO FIELDS, CHANGE TYPES AS APPROPRIATE
for line in j: # columns_unsorted
line = line.rstrip('\n')
string_j = line.split(' ')
time = str(string_j[0])
fan_on = int(string_j[1])
fan_off = int(string_j[2])
heater_on = int(string_j[3])
heater_off = int(string_j[4])
space_on = int(string_j[5])
space_off = int(string_j[6])
pump_on = int(string_j[7])
pump_off = int(string_j[8])
event_time = str(string_j[9])
row = time, fan_on, fan_off, heater_on, heater_off, space_on, space_off, pump_on, pump_off, event_time
events.append(row)
You are missing the readlines function, no?
You have to do:
j = open('./columns_unsorted.txt', 'r')
l = j.readlines()
for line in l:
# what you want to do with each line
In the future, you should print some of your variables, just to be sure the code is working as you want it to, and to help you identifying problems.
(for example, if in your code you would print string_j you would see what kind of problem you have)
Problem was an inconsistent line in the data file. Forgive my haste in posting