Deleting rows in a file using Python - python-2.7

I have input files "input.dat" contain some values like this :
41611 2014 12 18 0 0
41615 2014 12 18 0 0
41625 2014 12 18 0 0
41640 2014 6 14 3 3
42248 2014 12 18 0 0
42323 2014 12 18 0 0
42330 2014 8 13 7 7
42334 2014 12 18 0 0
42335 2014 12 18 0 0
...
I have many dataset files but seems so many unwanted data
How to delete many rows for this case 41640 and 42330 and its entire row values at instant. For now I used this script:
with open(path+fname,"r") as input:
with open("00-new.dat","wb") as output:
for line in input:
if line!="41640"+"\n":
output.write(line)
The result: The data 41640 is still exist in output. Any ideas??

You need to change your condition - how it is now it checks if the whole line is equal to 41640. Each line is instead equal to the whole row of data you are reading followed by a \n. Fixed version of your program looks like this:
with open("00-old.dat","r") as input:
with open("00-new.dat","wb") as output:
for line in input:
if "41640" not in line:
output.write(line)
To delete multiple lines you can use all() combined with a list comprehension as for instance described in this post,
if all(nb not in line for nb in del_list):
output.write(line)
where del_list is a list of values you want deleted,
del_list = ["41615", "41640", "42334"]
Also, due to Python's operator precedence your original condition will always evaluate to True. That is because even if the 41640!=line was false, the \n is added to it and interpreted (after conversion) as True. Basically, the != is evaluated first, instead of the string concatenation followed by a !=.

Related

Power BI : line grouping

I begin to use Power BI, and I don't know how to group lines.
I have this kind of data :
api user 01/07/21 02/07/21 03/07/21 ...
a 25 null 3 4
b 25 1 null 2
c 25 1 4 5
a 30 4 3 5
b 30 3 2 2
c 30 1 1 3
And I would like to have the sum of the values per user, not by api and user
user 01/07/21 02/07/21 03/07/21 ...
25 2 7 11
30 8 6 10
Do you know how to do it please ?
I created a table with your sample data (make sure your values are treated as numbers):
Then create a Matrix visual, with "user" in Rows and your desired columns in the Values section:

how to create combinatorial combination of two files

I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995
I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron

replace multiple column values at the same time

I would like to replace multiple column values at the same time in a dataframe. I would like to change 2 to 1, 1 to 2.
data=data.frmae(store=c(122,323,254,435,654,342,234,344)
,cluster=c(2,2,2,1,1,3,3,3))
The problem in my code is after it changes 2 to 1 , it changes these 1's to 2.
Can I do it in dplyr or sth? Thank you
Desired data set below
store cluster
122 1
323 1
254 1
435 2
654 2
342 3
234 3
344 3

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

Pandas dataframe applying NA to part of the data

Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object
Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan
The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.