awk remove characters after number - regex

I'm writing a AWK script to clean up a data stream so it's usable for analysis. Right now, I have the following issue.
I have a data stream that looks like this:
56, 2
64, 3
72, 0
80, -1-
88, -3--
96, 1
04, -2-
12, -7----
20, -1-
28, 7
36, 1
44, -3--
52, 3
60, 0
68, 0
76, -3--
84, -5---
92, 1
00, 4
08, 3
16, -2-
24, -3--
32, 1
40, 3
And I want to remove any dash that occurs after a number character, keep the minus in front of the numbers, so it would look like this:
56, 2
64, 3
72, 0
80, -1
88, -3
96, 1
04, -2
12, -7
20, -1
28, 7
36, 1
44, -3
52, 3
60, 0
68, 0
76, -3
84, -5
92, 1
00, 4
08, 3
16, -2
24, -3
32, 1
40, 3
I know how to do this with sed (sed 's/-*$//'), but how could this be done with only awk so i can use it in my script?
Cheers

One way, simply using sub():
awk '{ sub(/-+$/, "", $NF); print }' infile
It yields:
56, 2
64, 3
72, 0
80, -1
88, -3
96, 1
04, -2
12, -7
20, -1
28, 7
36, 1
44, -3
52, 3
60, 0
68, 0
76, -3
84, -5
92, 1
00, 4
08, 3
16, -2
24, -3
32, 1
40, 3

Using awk:
awk -F '-+$' '{$1=$1} 1' file
Using sed:
sed -i.bak 's/-*$//' file

Another possible solution :
awk -F "-+$" '{str=""; for(i=1; i<=NF; i++){str=str""$i} print str}' file
But, I think that sed is a better solution in this case.
Regards,
Idriss

Related

Django ORM fill 0 for missing date

I'm using Django 2.2.
I want to generate the analytics of the number of records by each day between the stand and end date.
The query used is
start_date = '2021-9-1'
end_date = '2021-9-30'
query = Tracking.objects.filter(
scan_time__date__gte=start_date,
scan_time__date__lte=end_date
)
query.annotate(
scanned_date=TruncDate('scan_time')
).order_by(
'scanned_date'
).values('scanned_date').annotate(
**{'total': Count('created')}
)
Which produces output as
[{'scanned_date': datetime.date(2021, 9, 24), 'total': 5}, {'scanned_date': datetime.date(2021, 9, 26), 'total': 3}]
I want to fill the missing dates with 0, so that the output should be
2021-9-1: 0
2021-9-2: 0
...
2021-9-24: 5
2021-9-25: 0
2021-9-26: 3
...
2021-9-30: 0
How I can achieve this using either ORM or python (ie., pandas, etc.)?
Use DataFrame.reindex by date range created by date_range with DatetimeIndex by DataFrame.set_index:
data = [{'scanned_date': datetime.date(2021, 9, 24), 'total': 5},
{'scanned_date': datetime.date(2021, 9, 26), 'total': 3}]
start_date = '2021-9-1'
end_date = '2021-9-30'
r = pd.date_range(start_date, end_date, name='scanned_date')
#if necessary convert to dates from datetimes
#r = pd.date_range(start_date, end_date, name='scanned_date').date
df = pd.DataFrame(data).set_index('scanned_date').reindex(r, fill_value=0).reset_index()
print (df)
scanned_date total
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 0
5 2021-09-06 0
6 2021-09-07 0
7 2021-09-08 0
8 2021-09-09 0
9 2021-09-10 0
10 2021-09-11 0
11 2021-09-12 0
12 2021-09-13 0
13 2021-09-14 0
14 2021-09-15 0
15 2021-09-16 0
16 2021-09-17 0
17 2021-09-18 0
18 2021-09-19 0
19 2021-09-20 0
20 2021-09-21 0
21 2021-09-22 0
22 2021-09-23 0
23 2021-09-24 5
24 2021-09-25 0
25 2021-09-26 3
26 2021-09-27 0
27 2021-09-28 0
28 2021-09-29 0
29 2021-09-30 0
Or use left join by another DataFrame create from range with replace misisng values to 0:
r = pd.date_range(start_date, end_date, name='scanned_date').date
df = pd.DataFrame({'scanned_date':r}).merge(pd.DataFrame(data), how='left', on='scanned_date').fillna(0)

Replace spaces with first character of the line on each line [duplicate]

This question already has answers here:
Can Vim's substitute command handle recursive pattern as sed's "t labe"?
(2 answers)
Closed 2 years ago.
I have data that looks like this:
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
I need to turn it into this:
1, 100
1, 200
1, 3030
1, 400
1, 50023
2, 30
2, 444
2, 44334
2, 441
2, 123332
etc.
I was able to do it with a vim macro but the data is far too. I was hoping something like awk could do it. But I am not really familiar with it.
Any help would be apperciated.
$ cat input
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
$ awk '{for(i=2;i<=NF;i++) printf "%s %s\n", $1, $i}' input
1, 100
1, 200
1, 3030
1, 400
1, 50023
2, 30
2, 444
2, 44334
2, 441
2, 123332
3, 100
3, 200
3, 3030
3, 400
3, 50023
awk -F',' '{split($2,a," "); for (i in a) print $1, "," , a[i]}'
explanation:
awk -F',' -- Set field seprator as ,
'{split($2,a," "); -- Split column 2 using " "(space) as delimiter and populate array a
for (i in a) print $1, "," , a[i]} -- Loop to access all element of array'
Demo :
renegade#Renegade:~$ cat test.txt
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
renegade#Renegade:~$ awk -F',' '{split($2,a," "); for (i in a) print $1, "," , a[i]}' test.txt
1 , 100
1 , 200
1 , 3030
1 , 400
1 , 50023
2 , 30
2 , 444
2 , 44334
2 , 441
2 , 123332
3 , 100
3 , 200
3 , 3030
3 , 400
3 , 50023
renegade#Renegade:~$

How I can set data with specific system

I have little problem. As example tables HAVE1 and HAVE2 I want create table like WANT, set below specific row data from HAVE2 - to all column (since COL1 to COL19, without COL20) - and get table like WANT. How I can do?
data HAVE1;
infile DATALINES dsd missover;
input ID NAME $ COL1-COL20;
CARDS;
1, A1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ,20
2, A2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
3, B1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 15, 16, 16, 20, 21 , 21, 22
4, B2, 1, 20, 3, 20, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 23, 22, 23
5, C1, 20, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 12, 13, 14, 15, 16, 17, 17, 17, 17
6, C2, 1, 2, 3, 20, 5, 6, 7, 8, 02, 10, 11, 12, 30, 14, 15, 16, 17, 18, 19, 20
;run;
Data HAVE2;
infile DATALINES dsd missover;
input ID NAME $ WARTOSC;
CARDS;
1, SUM, 50000
2, SUM, 55000
3, SUM, 60000
;run;
DATA WANT;
infile DATALINES dsd missover;
input ID NAME $ COL1-COL20;
CARDS;
1, A1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ,20
1, SUM_1 ,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000,50000
2, A2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
2, SUM_2, 55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000,55000
3, B1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 15, 16, 16, 20, 21 , 21, 22
3, SUM_3,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000,60000
4, B2, 1, 20, 3, 20, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 23, 22, 23
5, C1, 20, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 12, 13, 14, 15, 16, 17, 17, 17, 17
6, C2, 1, 2, 3, 20, 5, 6, 7, 8, 02, 10, 11, 12, 30, 14, 15, 16, 17, 18, 19, 20
;run;
So it sounds like you just need to reformat the second dataset to match what you want and them combine them. Just copy the value of WARTOSC to all of the columns and drop the original WARTOSC variable.
data HAVE1;
infile CARDS dsd truncover;
input ID NAME $ COL1-COL5;
CARDS;
1, A1, 1, 2, 3, 4, 5
2, A2, 1, 2, 3, 4, 5
3, B1, 3, 4, 5, 6, 7
4, B2, 1, 20, 3, 20, 5
5, C1, 20, 2, 3, 4, 5
6, C2, 1, 2, 3, 20, 5
;
data HAVE2;
infile CARDS dsd truncover;
input ID NAME $ WARTOSC;
CARDS;
1, SUM, 50000
2, SUM, 55000
3, SUM, 60000
;
data have2_fixed;
set have2;
name=catx('_',name,id);
array col col1-col5;
do over col ; col=wartosc; end;
drop wartosc;
run;
data want ;
set have1 have2_fixed;
by id;
run;
You could actually make the changes during the merge if the datasets are large.
data want ;
set have1 have2 (in=in2);
by id;
array col col1-col5;
if in2 then do;
name=catx('_',name,id);
do over col ; col=wartosc; end;
end;
drop wartosc;
run;
Results:
Obs ID NAME COL1 COL2 COL3 COL4 COL5
1 1 A1 1 2 3 4 5
2 1 SUM_1 50000 50000 50000 50000 50000
3 2 A2 1 2 3 4 5
4 2 SUM_2 55000 55000 55000 55000 55000
5 3 B1 3 4 5 6 7
6 3 SUM_3 60000 60000 60000 60000 60000
7 4 B2 1 20 3 20 5
8 5 C1 20 2 3 4 5
9 6 C2 1 2 3 20 5
Your wanted table is quite peculiar, you might be better off producing a report instead of a data set that you might simply proc print.
Regardless, the step will, for have2, require transformation of name and replication of wartosc.
For example:
data want (drop=wartosc);
set have1 end=end1;
output;
if not end2 then
set have2(rename=id=id2) end=end2;
if id = id2 then do;
array col col1-col20;
do over col; col=wartosc; end;
name = catx('_', name, id);
output;
end;
run;
You might need some more logic if the case of want2 having more rows than want1 can occur.

I would like some help on string functions

I can only use string objects and string functions for this exercise. When I tried to put " over " << loser. It didn't output the loser team's name first. Also, I want the scores to be arranged as winnerscore << " to " << loserscore;
Basically,
cout << winner << " over " << loser << " " << winnerscore << " " << loserscore << endl;
Code:
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cctype>
using namespace std;
int main(){
ifstream fin;
string winner, winnerscore, loser, loserscore, hey, file;
size_t pos, blank, blank2;
fin.open("C:\\Users\\leewi\\Desktop\\Computer Programs & Projects\\C++\\BentleyCIS22B\\Ex5\\Ex5.txt");
if (!fin)
{
cout << "Can't Open File." << endl;
exit(0);
}
while(!fin.eof()){
getline(fin, hey);
pos = hey.find(' ');
winner = hey.substr(0, pos);
if(isalpha(hey[pos+1])){
blank = hey.find(' ');
winner += hey.substr(pos, blank);
}
else if(isdigit(hey[pos+1]))
{
blank = hey.find(',');
winnerscore = hey.substr(pos, blank);
}
if(isalpha(hey[pos+1])){
blank = hey.find(' ');
loser += hey.substr(pos, hey.length());
}
else if(isdigit(hey[pos+1]))
{
loserscore = hey.substr(pos, hey.length());
}
cout << winner << " over " << loser << " " << winnerscore << " to " << loserscore << endl;
}
fin.close();
return 0;
}
Output I Got:
Cincinnati over 27, Buffalo to 27, Buffalo 24
Detroit over 31, Cleve to 31, Cleveland 17
Kansas City over City 24, Oakland 7 31, Cleve to 31, Cleveland 17
Carolina over City 24, Oakland 7 35, Minnes to 35, Minnesota 10
Pittsburgh over City 24, Oakland 7 19, NY Jets to 19, NY Jets 6
Philadelphia over City 24, Oakland 7 31, Tampa Bay to 31, Tampa Bay 20
Green Bay over City 24, Oakland 7 Bay 19, Baltimore 17 31, Tampa Bay to 31, Tampa Bay 20
St. Lo over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 31, Tampa Bay to 31, Tampa Bay 20
Denver over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 35, Jack to 35, Jacksonville 19
Seattle over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 20, Tenne to 20, Tennessee 13
New En over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 England 30, New Orleans 27 20, Tenne to 20, Tennessee 13
San Fr over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 England 30, New Orleans 27 Francisco 32, Arizona 20 20, Tenne to 20, Tennessee 13
Dallas over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 England 30, New Orleans 27 Francisco 32, Arizona 20 31, Wash to 31, Washington 16
over City 24, Oakland 7 Bay 19, Baltimore 17 Louis 38, Houston 13 England 30, New Orleans 27 Francisco 32, Arizona 20 31, Wash to 31, Washington 16
Output I Want:
Cincinnati over Buffalo 27 to 24
Detroit over Cleveland 31 to 17
Kansas City over Oakland 24 to 7
Carolina over Minnesota 35 to 10
Pittsburgh over NY Jets 19 to 6
Philadelphia over Tampa Bay 31 to 20
Green Bay over Baltimore 19 to 17
St. Louis over Houston 38 to 13
Denver over Jacksonville 35 to 19
Seattle over Tennessee 20 to 13
New England over New Orleans 30 to 27
San Francisco over Arizona 32 to 20
Dallas over Washington 31 to 16
I see in the second if in the loop the same condition as in the first if in the loop. Did you forget something before the second if, maybe hey = hey.substr(pos); pos = hey.find(' ');
In the body of the same second if the calculated blank is not used.
Maybe there must be loser =, not loser +=.
#barmar gave you a useful advice.

Grouping data by value ranges

I have a csv file that shows parts on order. The columns include days late, qty and commodity.
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
Thanks
Suppose you start with this data:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cut splits the Series df['Days Late'] into categories which are half-open intervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivot to get the DataFrame in the desired form:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindex to obtain the desired order for the rows and columns:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
You can create a column in your DataFrame based on your Days Late column by using the map or apply functions as follows. Let's first create some sample data.
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Late column and add a column called Code.
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupby to aggregate by the ID and Code columns, and get the counts of the groups as follows:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]