I have a file that contains text like the following. How can I remove the "junk" characters like ^[[H using Perl ?
^[[H^[[2J^[(B^[[mtop - 19:25:22 up 69 days, 23:25, 2 users, load average: 2.55, 2.15, 1.83^[(B^[[m^[[39;49m^[[K
Tasks:^[(B^[[m^[[39;49m^[(B^[[m 114 ^[(B^[[m^[[39;49mtotal,^[(B^[[m^[[39;49m^[(B^[[m 1 ^[(B^[[m^[[39;49mrunning,^[(B^[[m^[[39;49m^[(B^[[m 113 ^[(B^[[m^[[39;49msleeping,^[(B^[[m^[[39;49m^[(B^[[m 0 ^[(B^[[m^[[39;49mstopped,^[(B^[[m^[[39;49m^[(B^[[m 0 ^[(B^[[m^[[39;49mzombie^[(B^[[m^[[39;49m^[[K
Cpu(s):^[(B^[[m^[[39;49m^[(B^[[m 18.1%^[(B^[[m^[[39;49mus,^[(B^[[m^[[39;49m^[(B^[[m 0.5%^[(B^[[m^[[39;49msy,^[(B^[[m^[[39;49m^[(B^[[m 0.0%^[(B^[[m^[[39;49mni,^[(B^[[m^[[39;49m^[(B^[[m 81.2%^[(B^[[m^[[39;49mid,^[(B^[[m^[[39;49m^[(B^[[m 0.0%^[(B^[[m^[[39;49mwa,^[(B^[[m^[[39;49m^[(B^[[m 0.0%^[(B^[[m^[[39;49mhi,^[(B^[[m^[[39;49m^[(B^[[m 0.2%^[(B^[[m^[[39;49msi,^[(B^[[m^[[39;49m^[(B^[[m 0.0%^[(B^[[m^[[39;49mst^[(B^[[m^[[39;49m^[[K
em: ^[(B^[[m^[[39;49m^[(B^[[m 16435100k ^[(B^[[m^[[39;49mtotal,^[(B^[[m^[[39;49m^[(B^[[m 3081324k ^[(B^[[m^[[39;49mused,^[(B^[[m^[[39;49m^[(B^[[m 13353776k ^[(B^[[m^[[39;49mfree,^[(B^[[m^[[39;49m^[(B^[[m 196396k ^[(B^[[m^[[39;49mbuffers^[(B^[[m^[[39;49m^[[K
Swap:^[(B^[[m^[[39;49m^[(B^[[m 4194296k ^[(B^[[m^[[39;49mtotal,^[(B^[[m^[[39;49m^[(B^[[m 0k ^[(B^[[m^[[39;49mused,^[(B^[[m^[[39;49m^[(B^[[m 4194296k ^[(B^[[m^[[39;49mfree,^[(B^[[m^[[39;49m^[(B^[[m 1531300k ^[(B^[[m^[[39;49mcached^[(B^[[m^[[39;49m^[[K
^[[6;1H
^[[7m PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND ^[(B^[[m^[[39;49m^[[K
^[(B^[[m22285 root 25 0 4931m 398m 11m S 47.6 2.5 545:13.49 java ^[(B^[[m^[[39;49m
^[(B^[[m19235 root 17 0 1406m 624m 10m S 2.0 3.9 6203:15 java ^[(B^[[m^[[39;49m
^[(B^[[m 1 root 15 0 10368 684 572 S 0.0 0.0 0:09.51 init ^[(B^[[m^[[39;49m
^[(B^[[m 2 root RT -5 0 0 0 S 0.0 0.0 2:02.87 migration/0 ^[(B^[[m^[[39;49m
^[(B^[[m 3 root 34 19 0 0 0 S 0.0 0.0 0:00.27 ksoftirqd/0 ^[(B^[[m^[[39;49m
^[(B^[[m 4 root RT -5 0 0 0 S 0.0 0.0 2:00.50 migration/1 ^[(B^[[m^[[39;49m
^[(B^[[m 5 root 34 19 0 0 0 S 0.0 0.0 0:00.26 ksoftirqd/1 ^[(B^[[m^[[39;49m
^[(B^[[m 6 root RT -5 0 0 0 S 0.0 0.0 2:04.21 migration/2 ^[(B^[[m^[[39;49m
^[(B^[[m 7 root 34 19 0 0 0 S 0.0 0.0 0:00.26 ksoftirqd/2 ^[(B^[[m^[[39;49m
^[(B^[[m 8 root RT -5 0 0 0 S 0.0 0.0 1:52.52 migration/3 ^[(B^[[m^[[39;49m
You can use Term::ANSIColor module,
perl -MTerm::ANSIColor=colorstrip -ne 'print colorstrip $_' file
Related
I have been trying to combine the information from 2 dataframes into a single new dataframe without luck. I have searched extensively, but still can't find any relevant answer, so apologies if I have missed it in my search.
When creating an investing strategy, among a large set of currencies (more than 50) I have picked the top 5 currencies to invest in for every date (in top_n.csv) and their respective % weight to invest for each currency on each date (in weights.csv).
top_n.csv lools like:
Date 0 1 2 3 4
Aug 12, 2016 bitcoin ethereum 0 0 0
Aug 11, 2016 bitcoin ethereum ripple steem litecoin
Aug 10, 2016 bitcoin ethereum ripple 0 0
Aug 09, 2016 bitcoin ethereum steem ripple ethereum-classic
weights.csv lools like:
Date 0 1 2 3 4
Aug 12, 2016 0.859 0.089 nan nan nan
Aug 11, 2016 0.856 0.092 0.020 0.016 0.016
Aug 10, 2016 0.853 0.093 0.020 nan nan
Aug 09, 2016 0.858 0.086 0.020 0.020 0.017
The DataFrame which I am trying to populate is one which contains same dates (in the index), but has a number of columns corresponding to a larger set of coins (more than 50), like in W.csv.
Is there an efficient way that (for each date) populates the right weights to any currency that has any, and leaves the others at 0? The tricky part is dealing with dates when there are not enough currencies (thus top_n.csv has less than n currencies, and weights.csv has nans in the respective positions).
W.csv lools like:
Date bitcoin ethereum bitcoin-cash ripple litecoin dash neo nem monero ethereum-classic iota qtum omisego lisk cardano zcash bitconnect tether stellar ....
Aug 12, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 11, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 10, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 09, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
My target is to result to a DataFrame that looks like W_all_target, which I attach as would not appear correctly (I have edited it by hand for this question).
I have saved three indicative CSVs as it will help to examine them.
https://drive.google.com/open?id=1olx9ARI0XP5mqbqF1pfRfJyl9wIEWyZj
I am still learning, so I understand this may be a simple question. Sincere thanks!!
Option 0
This is to accommodate the zeros and nans
dates = top_n.index.repeat(top_n.shape[1])
currs = top_n.values.ravel()
wghts = weights.values.ravel()
mask = currs != '0'
reshaped = pd.Series(wghts[mask], [dates[mask], currs[mask]]).unstack(fill_value=0)
W.update(reshaped)
Option 1
reshaped = pd.concat([d.stack() for d in [top_n, weights]], axis=1) \
.reset_index(1, drop=True).set_index(0, append=True)[1].unstack(fill_value=0)
reshaped
0 bitcoin ethereum ethereum-classic litecoin ripple steem
Date
2016-08-09 0.858 0.086 0.017 0.000 0.02 0.020
2016-08-10 0.853 0.093 0.000 0.016 0.02 0.018
2016-08-11 0.856 0.092 0.000 0.016 0.02 0.016
2016-08-12 0.859 0.089 0.000 0.016 0.02 0.015
Option 2
reshaped = pd.Series(
weights.values.ravel(),
[top_n.index.repeat(top_n.shape[1]), top_n.values.ravel()]
).unstack(fill_value=0)
reshaped
bitcoin ethereum ethereum-classic litecoin ripple steem
Date
2016-08-09 0.858 0.086 0.017 0.000 0.02 0.020
2016-08-10 0.853 0.093 0.000 0.016 0.02 0.018
2016-08-11 0.856 0.092 0.000 0.016 0.02 0.016
2016-08-12 0.859 0.089 0.000 0.016 0.02 0.015
Then you should be able to update W with
W.update(reshaped)
W
bitcoin ethereum bitcoin-cash ripple litecoin dash neo nem monero ethereum-classic iota qtum omisego lisk cardano zcash bitconnect tether stellar
Date
2016-08-12 0.859 0.089 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-11 0.856 0.092 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-10 0.853 0.093 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-09 0.858 0.086 0 0.02 0.000 0 0 0 0 0.017 0 0 0 0 0 0 0 0 0
I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop
I have the following dataframe :
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
5 4177.014316 2016 9 12 0 0 0
6 3077.103445 2016 9 13 0 0 0
7 4123.103795 2016 9 14 0 0 0
.. ... ... ... ... ... ... ...
551 NaN 2016 11 23 0 0 0
552 NaN 2016 11 24 0 0 0
553 NaN 2016 11 25 0 0 0
.. ... ... ... ... ... ... ...
579 NaN 2016 11 27 0 0 0
580 NaN 2016 11 28 0 0 0
The variables type is as follows:
print(df.dtypes)
Daily_KWH_System object
year int32
month int32
day int32
hour int32
minute int32
second int32
I need to convert "Daily_KWH_System" to Float, so that I use in Linear Regression model.
I tried the below code, which worked fine.
df['Daily_KWH_System'] = pd.to_numeric(df['Daily_KWH_System'], errors='coerce')
Then I replaced the NaN's to Blank space, to use in my model. And I used the following code
df = df.replace(np.nan,' ', regex=True)
But, again the variable " Daily_KWH_System" is getting converted to Object as soon as i replace NaN'.
Please let me know how to go about it
*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10
IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN
Porting a bit of Fortran 77 code. It appears that REAL variables are being assigned to INTEGER variables. I do not have a method to run this code and wonder what the behavior is in the following case:
REAL*4 A
A = 123.25
B = INT(A)
B = 123 or B = 124?
How about at the 0.5 mark?
REAL*4 C
C = 123.5
D = INT(C)
D = 123 or D = 123.5?
INT is always rounding down:
From the GCC documentation:
These functions return a INTEGER variable or array under the following rules:
(A) If A is of type INTEGER, INT(A) = A
(B) If A is of type REAL and
|A| < 1, INT(A) equals 0. If |A| \geq 1, then INT(A) equals the
largest integer that does not exceed the range of A and whose sign is
the same as the sign of A.
(C) If A is of type COMPLEX, rule B is
applied to the real part of A.
If you want to round to the nearest integer, use NINT.
So, in your case B and D are always 123 (if they are declared as integer).
Here is one example of code and the output, it's an extension of previous answer:
PROGRAM test
implicit none
integer :: i=0
real :: dummy = 0.
do i = 0,30
dummy = -1.0 + (i*0.1)
write(*,*) i, dummy , int(dummy) , nint(dummy) ,floor(dummy)
enddo
stop
end PROGRAM test
This is the output:
$ ./test
0 -1.000000 -1 -1 -1
1 -0.9000000 0 -1 -1
2 -0.8000000 0 -1 -1
3 -0.7000000 0 -1 -1
4 -0.6000000 0 -1 -1
5 -0.5000000 0 -1 -1
6 -0.4000000 0 0 -1
7 -0.3000000 0 0 -1
8 -0.2000000 0 0 -1
9 -9.9999964E-02 0 0 -1
10 0.0000000E+00 0 0 0
11 0.1000000 0 0 0
12 0.2000000 0 0 0
13 0.3000001 0 0 0
14 0.4000000 0 0 0
15 0.5000000 0 1 0
16 0.6000000 0 1 0
17 0.7000000 0 1 0
18 0.8000001 0 1 0
19 0.9000000 0 1 0
20 1.000000 1 1 1
21 1.100000 1 1 1
22 1.200000 1 1 1
23 1.300000 1 1 1
24 1.400000 1 1 1
25 1.500000 1 2 1
26 1.600000 1 2 1
27 1.700000 1 2 1
28 1.800000 1 2 1
29 1.900000 1 2 1
30 2.000000 2 2 2
I hope that this can better clarify the question
EDIT: Compiled with ifort 2013 on xeon
Only as completion to the existing answers I want to add an example how the commercial rounds can be realized without using NINT by
L = INT(F + 0.5)
where L is INTEGER and F is a positive REAL number. I've found this in FORTRAN 77 code samples from the last century.
Extending this to negative REAL numbers by
L = SIGN(1.0,F)*INT(ABS(F) + 0.5)
and going back to the 80th of last century, the minimal code example looks like this
PROGRAM ROUNDTEST
DO 12345 I=0,30
F = -1.0 + I * 0.1
J = INT(F)
K = NINT(F)
L = SIGN(1.0,F)*INT(ABS(F) + 0.5)
PRINT *, I, F, J, K, L
12345 CONTINUE
END
which creates the output
$ ./ROUNDTEST
0 -1.00000000 -1 -1 -1
1 -0.899999976 0 -1 -1
2 -0.800000012 0 -1 -1
3 -0.699999988 0 -1 -1
4 -0.600000024 0 -1 -1
5 -0.500000000 0 -1 -1
6 -0.399999976 0 0 0
7 -0.300000012 0 0 0
8 -0.199999988 0 0 0
9 -9.99999642E-02 0 0 0
10 0.00000000 0 0 0
11 0.100000024 0 0 0
12 0.200000048 0 0 0
13 0.300000072 0 0 0
14 0.399999976 0 0 0
15 0.500000000 0 1 1
16 0.600000024 0 1 1
17 0.700000048 0 1 1
18 0.800000072 0 1 1
19 0.899999976 0 1 1
20 1.00000000 1 1 1
21 1.10000014 1 1 1
22 1.20000005 1 1 1
23 1.29999995 1 1 1
24 1.40000010 1 1 1
25 1.50000000 1 2 2
26 1.60000014 1 2 2
27 1.70000005 1 2 2
28 1.79999995 1 2 2
29 1.90000010 1 2 2
30 2.00000000 2 2 2
ROUNDTEST is compiled and linked by gfortran version 7.4.0 by
$ gfortran.exe ROUNDTEST.FOR -o ROUNDTEST
Hope this helps you if you have to deal with old FORTRAN code.