This question already has answers here:
Can Vim's substitute command handle recursive pattern as sed's "t labe"?
(2 answers)
Closed 2 years ago.
I have data that looks like this:
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
I need to turn it into this:
1, 100
1, 200
1, 3030
1, 400
1, 50023
2, 30
2, 444
2, 44334
2, 441
2, 123332
etc.
I was able to do it with a vim macro but the data is far too. I was hoping something like awk could do it. But I am not really familiar with it.
Any help would be apperciated.
$ cat input
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
$ awk '{for(i=2;i<=NF;i++) printf "%s %s\n", $1, $i}' input
1, 100
1, 200
1, 3030
1, 400
1, 50023
2, 30
2, 444
2, 44334
2, 441
2, 123332
3, 100
3, 200
3, 3030
3, 400
3, 50023
awk -F',' '{split($2,a," "); for (i in a) print $1, "," , a[i]}'
explanation:
awk -F',' -- Set field seprator as ,
'{split($2,a," "); -- Split column 2 using " "(space) as delimiter and populate array a
for (i in a) print $1, "," , a[i]} -- Loop to access all element of array'
Demo :
renegade#Renegade:~$ cat test.txt
1, 100 200 3030 400 50023
2, 30 444 44334 441 123332
3, 100 200 3030 400 50023
renegade#Renegade:~$ awk -F',' '{split($2,a," "); for (i in a) print $1, "," , a[i]}' test.txt
1 , 100
1 , 200
1 , 3030
1 , 400
1 , 50023
2 , 30
2 , 444
2 , 44334
2 , 441
2 , 123332
3 , 100
3 , 200
3 , 3030
3 , 400
3 , 50023
renegade#Renegade:~$
Related
I'm using Django 2.2.
I want to generate the analytics of the number of records by each day between the stand and end date.
The query used is
start_date = '2021-9-1'
end_date = '2021-9-30'
query = Tracking.objects.filter(
scan_time__date__gte=start_date,
scan_time__date__lte=end_date
)
query.annotate(
scanned_date=TruncDate('scan_time')
).order_by(
'scanned_date'
).values('scanned_date').annotate(
**{'total': Count('created')}
)
Which produces output as
[{'scanned_date': datetime.date(2021, 9, 24), 'total': 5}, {'scanned_date': datetime.date(2021, 9, 26), 'total': 3}]
I want to fill the missing dates with 0, so that the output should be
2021-9-1: 0
2021-9-2: 0
...
2021-9-24: 5
2021-9-25: 0
2021-9-26: 3
...
2021-9-30: 0
How I can achieve this using either ORM or python (ie., pandas, etc.)?
Use DataFrame.reindex by date range created by date_range with DatetimeIndex by DataFrame.set_index:
data = [{'scanned_date': datetime.date(2021, 9, 24), 'total': 5},
{'scanned_date': datetime.date(2021, 9, 26), 'total': 3}]
start_date = '2021-9-1'
end_date = '2021-9-30'
r = pd.date_range(start_date, end_date, name='scanned_date')
#if necessary convert to dates from datetimes
#r = pd.date_range(start_date, end_date, name='scanned_date').date
df = pd.DataFrame(data).set_index('scanned_date').reindex(r, fill_value=0).reset_index()
print (df)
scanned_date total
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 0
5 2021-09-06 0
6 2021-09-07 0
7 2021-09-08 0
8 2021-09-09 0
9 2021-09-10 0
10 2021-09-11 0
11 2021-09-12 0
12 2021-09-13 0
13 2021-09-14 0
14 2021-09-15 0
15 2021-09-16 0
16 2021-09-17 0
17 2021-09-18 0
18 2021-09-19 0
19 2021-09-20 0
20 2021-09-21 0
21 2021-09-22 0
22 2021-09-23 0
23 2021-09-24 5
24 2021-09-25 0
25 2021-09-26 3
26 2021-09-27 0
27 2021-09-28 0
28 2021-09-29 0
29 2021-09-30 0
Or use left join by another DataFrame create from range with replace misisng values to 0:
r = pd.date_range(start_date, end_date, name='scanned_date').date
df = pd.DataFrame({'scanned_date':r}).merge(pd.DataFrame(data), how='left', on='scanned_date').fillna(0)
I have a data frame like follow:
pop state value1 value2
0 1.8 Ohio 2000001 2100345
1 1.9 Ohio 2001001 1000524
2 3.9 Nevada 2002100 1000242
3 2.9 Nevada 2001003 1234567
4 2.0 Nevada 2002004 1420000
And I have a ordered dictionary like following:
OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(1, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
I want to changed the data frame as the OrderedDict needed.
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 0 1 2 1003 45
1 1.9 Ohio 20 1 1 1 5 24
2 3.9 Nevada 20 2 100 1 2 42
3 2.9 Nevada 20 1 3 1 2345 67
4 2.0 Nevada 20 2 4 1 4200 0
I think it is really a complex logic in python pandas. How can I solve it? Thanks.
First, your OrderedDict overwrites the same key, you need to use different keys.
d= OrderedDict([(1, OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])])),(2, OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])]))])
Now, for your actual problem, you can iterate through d to get the items, and use the apply function on the DataFrame to get what you need.
for k,v in d.items():
for k1,v1 in v.items():
if k == 1:
df[k1] = df.value1.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
else:
df[k1] = df.value2.apply(lambda x : int(str(x)[v1[0]-1:v1[1]]))
Now, df is
pop state value1 value2 value1_1 value1_2 value1_3 value2_1 \
0 1.8 Ohio 2000001 2100345 20 0 1 2
1 1.9 Ohio 2001001 1000524 20 1 1 1
2 3.9 Nevada 2002100 1000242 20 2 100 1
3 2.9 Nevada 2001003 1234567 20 1 3 1
4 2.0 Nevada 2002004 1420000 20 2 4 1
value2_2 value2_3
0 1003 45
1 5 24
2 2 42
3 2345 67
4 4200 0
I think this would point you in the right direction.
Converting the value1 and value2 columns to string type:
df['value1'], df['value2'] = df['value1'].astype(str), df['value2'].astype(str)
dct_1,dct_2 = OrderedDict([('value1_1', [1, 2]),('value1_2', [3, 4]),('value1_3',[5,7])]),
OrderedDict([('value2_1', [1, 1]),('value2_2', [2, 5]),('value2_3',[6,7])])
Converting Ordered Dictionary to a list of tuples:
dct_1_list, dct_2_list = list(dct_1.items()), list(dct_2.items())
Flattening a list of lists to a single list:
L1, L2 = sum(list(x[1] for x in dct_1_list), []), sum(list(x[1] for x in dct_2_list), [])
Subtracting the even slices of the list by 1 as the string indices start from 0 and not 1:
L1[::2], L2[::2] = np.array(L1[0::2]) - np.array([1]), np.array(L2[0::2]) - np.array([1])
Taking the appropriate slice positions and mapping those values to the newly created columns of the dataframe:
df['value1_1'],df['value1_2'],df['value1_3']= map(df['value1'].str.slice,L1[::2],L1[1::2])
df['value2_1'],df['value2_2'],df['value2_3']= map(df['value2'].str.slice,L2[::2],L2[1::2])
Dropping off unwanted columns:
df.drop(['value1', 'value2'], axis=1, inplace=True)
Final result:
print(df)
pop state value1_1 value1_2 value1_3 value2_1 value2_2 value2_3
0 1.8 Ohio 20 00 001 2 1003 45
1 1.9 Ohio 20 01 001 1 0005 24
2 3.9 Nevada 20 02 100 1 0002 42
3 2.9 Nevada 20 01 003 1 2345 67
4 2.0 Nevada 20 02 004 1 4200 00
I have a sequence of number from 1 000 000 to 9 999 999 (Total: 9,000,000). I've generated them in the excel and I would like to match them in following formats
Last 6 digits in:
1. XXX XXX (For example, 000 000 or 111 111 or 222 222)
2. X00 000 (For example, 100 000 or 200 000 or 300 000)
3. XYY YYY (For example, 122 222 or 233 333 or 411 111)
4. XY0 000 (For example, 230 000 or 750 000 or 120 000)
5. XYZ ZZZ (For example, 231 111 or 232 222 or 233 333)
6. X00 Y00 (For example, 200 300 or 100 400 or 500 600)
7. XXX Y00 (For example, 333 300 or 666 600 or 777 700)
8. XXX YYY (For example, 111 333 or 222 555 or 555 666)
9. XX YY ZZ (For example, 11 22 33 or 22 33 44 or 44 55 66)
10. X0 Y0 Z0 (For example, 10 20 30 or 30 40 50 or 60 70 80)
Would it be possible to do with regex or vba in excel 2013?
Since I don't have knowledge in Excel, should I seek someone's help for a simple program for such matching?
You can use VBA, but I believe you will need to set up each classification separately, and also ensure that they are in an order so as to not overlap.
Here is a partial example, showing a few VBA techniques, which you should be able to extend. I only dealt with the rightmost 6 digits and initially constructed a string; and also put each digit into an array element to make the testing formulas simpler.
Option Explicit
Function Classify(N As Long) As String
Dim I As Long
Dim S(1 To 6) As String
Dim sN As String
sN = Format(Right(N, 6), "000000")
For I = 1 To 6
S(I) = Mid(sN, I, 1)
Next I
If Left(sN, 3) = Right(sN, 3) Then
Classify = "XXX XXX"
ElseIf S(1) <> 0 And Mid(sN, 2) = 0 Then
Classify = "X00 000"
ElseIf S(1) <> 0 And Mid(sN, 2) Like WorksheetFunction.Rept(S(2), 5) Then
Classify = "XYY YYY"
ElseIf S(1) <> 0 And S(2) <> 0 And S(1) <> S(2) And Mid(sN, 3) = 0 Then
Classify = "XY0 000"
elseif ...
End If
End Function
I have large data and I want to extract two types of data based on two conditions. I wrote a tcl script to extract the data by using regex (newbie to regex).
I have used the following condition which works fine and produces part of the desired output:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time ] {
I'm using the variable time somewhere in the script. The above condition produces the following o/p(this is just a sample as the file is large):
+ 30.808352 1 2 tcp 40 ------- 30 6.7 2.30 81 2073
+ 30.808416 1 2 tcp 40 ------- 128 8.16 2.159 81 2069
+ 30.809513 1 2 tcp 40 ------- 156 12.19 2.187 1 2077
+ 30.809641 1 2 tcp 80 ------- 156 12.19 2.187 1 2078
+ 30.809878 1 2 tcp 40 ------- 151 7.18 2.182 41 2079
+ 30.813096 1 2 tcp 40 ------- 161 9.20 2.192 0 2083
+ 30.813352 1 2 tcp 40 ------- 157 13.19 2.188 1 2085
+ 30.81348 1 2 tcp 80 ------- 157 13.19 2.188 1 2086
+ 30.815362 1 2 tcp 40 ------- 148 12.18 2.179 41 2088
+ 30.815426 1 2 tcp 40 ------- 148 5.9 2.179 41 2089
+ 30.818096 1 2 tcp 40 ------- 162 10.20 2.193 0 2091
+ 30.818544 1 2 tcp 40 ------- 158 3.78 2.189 1 2093
+ 30.818672 1 2 tcp 80 ------- 158 14.19 2.189 1 2094
+ 30.820657 1 2 tcp 40 ------- 153 9.19 2.184 41 2096
+ 30.821579 1 2 tcp 40 ------- 154 10.19 2.185 41 2097
Then, inside the above if condition, I want check the 9th column :
//condition 1
if (9th between [3-6].*) ( such as 3.78,6.7, 5.9)
The second condition is :
//condition 2
if (9th between [7-14].*) ( such as 14.19,12.18,10.19, 9.19,.....)
I'm struggling with two conditions above. I tried the following, I didn't get an error, however, no matching occurred !!
condition 1:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([3-9])\..*/ } $line ] {
I know I'm repeating part of the main if condition, becuase I don't know how to skip the columns !!!
condition 2:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([7-9]|1[0-4])\..*/} $line ] {
Any suggestions !!!
Why don't you split on space? You can achieve pretty much the same outcome using a few more lines. It will be readable and can people will understand the code better:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time] {
set elements [split $line " "] ;# You can actually omit the " " in this case
set 9th [lindex $elements 8]
# Condition 1
if {$9th >= 3 && $9th < 7} { do something }
# Condition 2
if {$9th >= 7 && $9th < 15} { do something }
}
match 7-14 \+ ([0-9.]+) 1 2.*- \d+\s(?:[7-9]|1[0-4]) Demo
match 3-6 \+ ([0-9.]+) 1 2.*- \d+\s[3-6] Demo
I have a csv file that shows parts on order. The columns include days late, qty and commodity.
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
Thanks
Suppose you start with this data:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cut splits the Series df['Days Late'] into categories which are half-open intervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivot to get the DataFrame in the desired form:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindex to obtain the desired order for the rows and columns:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
You can create a column in your DataFrame based on your Days Late column by using the map or apply functions as follows. Let's first create some sample data.
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Late column and add a column called Code.
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupby to aggregate by the ID and Code columns, and get the counts of the groups as follows:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]