Powershell complicated regex powershell multiple groups - regex

I need some help with my regex.
My code looks like this (I haven't gotten very far):
$source_file = "\\server\minified.txt"
$sf_content = gc $source_file -raw
$sections = $sf_content | select-string -AllMatches '(?smi)(^\s+\d+:\d+\s+AM\s+\w+\s+ACCOUNT ACTIVITY\s-\s)(\w+\s+\w+$)(.+?(Start Account\s\d+)(.+?Elapsed))'
$sections
The file looks like this:
I was able to get the first and last name using my regex from the "ACCOUNT ACTIVITY - PERSON'S NAMEHERE" string circled in red at the top of the image shown above.
My end goal is to be able to regex the blue box as a match, getting all information from the date on the top left, down to the "1 Accounts worked per hour". Then I want to get the info from the 2nd red circle. I would like to get the start time at the beginning of that line and then find the last instance of the same line "Start account 54321234" so that I can take the last time minus the first time.
So for each blue box, get the info from the red circles. For each red circle containing "Start account" take the blue circle minus the green circle.
I would like to try this using regex groups. If I can't figure that out I'd like to put each of my blue box regex into an array and for each item in the array I can further do regex to get what I want.
My code is not complete. But I'm not sure how to do the regex so I'll keep updating this as I update the script and do my own research.
If anybody has pointers I'd appreciate it.
Here is the source content in text form:
05/07/20 Acme, Inc. PAGE 1
9:48 AM ABC ACCOUNT ACTIVITY - Bart Simpson
The time ELAPSED since the previous line is printed as HOURS:MINUTES:SECONDS.
DATE TIME ELAPSED ACTION
04/16/20 8:06:50 0:00 Enter Account Screen
-------------------------------------------------------------------------------
8:06:53 0:03 Start account 12345678 ROSS, BOB N
8:07:24 0:31 Finished account in 31 seconds
-------------------------------------------------------------------------------
8:07:26 0:02 Start account 54321234 DOE, JOHN
8:07:27 0:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
8:07:28 0:02 Start account 54321234 DOE, JOHN
8:10:26 0:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
05/06/20 4:55:49 5:08 Leave Account Screen 9:33 Elapsed
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
05/06/20 4:55:55 0:06 Leave Account Screen
-------------------------------------------------------------------------------
DAILY TOTALS
5:33:46 - Time on Account screen for the day.
3 Calls 1 Calls per hour
3 Contacts 1 Contacts per hour
3 Accounts worked 1 Accounts worked per hour
05/07/20 Acme, Inc. PAGE 1
9:48 AM ABC ACCOUNT ACTIVITY - Lisa Simpson
The time ELAPSED since the previous line is printed as HOURS:MINUTES:SECONDS.
DATE TIME ELAPSED ACTION
04/16/20 8:06:50 0:00 Enter Account Screen
-------------------------------------------------------------------------------
8:06:53 0:03 Start account 6543212 DOE, JANE
8:07:24 0:31 Finished account in 31 seconds
-------------------------------------------------------------------------------
8:07:26 0:02 Start account 88888888 DEER, JOHN
8:07:27 1:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
05/06/20 4:55:49 5:08 Leave Account Screen 10:33 Elapsed
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
05/06/20 4:55:55 0:06 Leave Account Screen
-------------------------------------------------------------------------------
DAILY TOTALS
5:33:46 - Time on Account screen for the day.
3 Calls 1 Calls per hour
3 Contacts 1 Contacts per hour
3 Accounts worked 1 Accounts worked per hour

You're going to struggle with a regex. It seems to flap repeating the second capturing group. I tried for a while, adding in labels for your pertinent matches, and I was only picking up the first matches using this regex. Anyone who is a "regex king", please look away.
(?smi)(^\s+\d+:\d+\s+(AM|PM)\s+\w+\s+ACCOUNT ACTIVITY\s-\s)(?<name>\w+\s+\w+$)(.+?(?<begin>\d+:\d+:\d+)(\s+\d:\d+\s+)(?<acctnumber>Start Account\s\d+)(\s+)(?<account>\w+,\s\w+(\s[A-za-z]|))\s+(?<end>.+?\d:\d+))
You can provide a template to pick up all fields of potential interest and use ConvertFrom-String. The key is to label all the items you want uniquely in braces. You then have to mark the first item in the template with a star, so using your example from above, you'd have something like this.
$template = #"
05/07/20 Acme, Inc. PAGE 1
9:48 AM ABC ACCOUNT ACTIVITY - {customer*:Bart Simpson}
The time ELAPSED since the previous line is printed as HOURS:MINUTES:SECONDS.
DATE TIME ELAPSED ACTION
04/16/20 8:06:50 0:00 Enter Account Screen
-------------------------------------------------------------------------------
{begin1:8:06:53} 0:03 {accNum1:Start account 12345678} {name1:ROSS, BOB N}
{end1:8:07:24} 0:31 Finished account in 31 seconds
-------------------------------------------------------------------------------
{begin2:8:07:26} 0:02 {accNum2:Start account 54321234} {name2:DOE, JOHN}
{end2:8:07:27} 0:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
{begin3:8:07:28} 0:02 {accNum3:Start account 54321234} {name3:DOE, JOHN}
{end3:8:10:26} 0:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
05/06/20 4:55:49 5:08 Leave Account Screen 9:33 Elapsed
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
05/06/20 4:55:55 0:06 Leave Account Screen
-------------------------------------------------------------------------------
DAILY TOTALS
5:33:46 - Time on Account screen for the day.
3 Calls 1 Calls per hour
3 Contacts 1 Contacts per hour
3 Accounts worked 1 Accounts worked per hour
05/07/20 Acme, Inc. PAGE 1
9:48 AM ABC ACCOUNT ACTIVITY - {customer*:Lisa Simpson}
The time ELAPSED since the previous line is printed as HOURS:MINUTES:SECONDS.
DATE TIME ELAPSED ACTION
04/16/20 8:06:50 0:00 Enter Account Screen
-------------------------------------------------------------------------------
{begin1:8:06:53} 0:03 {accNum1:Start account 6543212} {name1:DOE, JANE}
{end1:8:07:24} 0:31 Finished account in 31 seconds
-------------------------------------------------------------------------------
{begin2:8:07:26} 0:02 {accNum2:Start account 88888888} {name2:DEER, JOHN}
{end2:8:07:27} 1:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
{begin3:\s} 0:02 {accNum3:\s} {name3:\s}
{end3:\s} 1:01 Finished account in 1 seconds
-------------------------------------------------------------------------------
05/06/20 4:55:49 5:08 Leave Account Screen 10:33 Elapsed
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
05/06/20 4:55:55 0:06 Leave Account Screen
-------------------------------------------------------------------------------
DAILY TOTALS
5:33:46 - Time on Account screen for the day.
3 Calls 1 Calls per hour
3 Contacts 1 Contacts per hour
3 Accounts worked 1 Accounts worked per hour
"#
In your final example I have added a third set with a regex space in them, so that it doesn't duplicate the second set of data in set three.
You can then pipe your full input through the cmdlet, using the -TemplateContent parameter to apply your template. And you should get the data out the other side.
$data = # Get your data
$data | ConvertFrom-String -TemplateContent $template
customer : Bart Simpson
begin1 : 8:06:53
accNum1 : Start account 12345678
name1 : ROSS, BOB N
end1 : 8:07:24
begin2 : 8:07:26
accNum2 : Start account 54321234
name2 : DOE, JOHN
end2 : 8:07:27
begin3 : 8:07:28
accNum3 : Start account 54321234
name3 : DOE, JOHN
end3 : 8:10:26
customer : Lisa Simpson
begin1 : 8:06:53
accNum1 : Start account 6543212
name1 : DOE, JANE
end1 : 8:07:24
begin2 : 8:07:26
accNum2 : Start account 88888888
name2 : DEER, JOHN
end2 : 8:07:27
You can then compare your data, looping through the output objects.

Related

Need a regex that captures date & amount fields & ignore blank lines & other miscellaneous data

Goal:
Capture only the purchase date, amount, and purchased item name(s).
Ignore all blank lines
Ignore Reference # & SHIPPING AND TAX string
Then, repeat this on the next grouping of purchases.
I am using Google Sheets for this project.
Sample data showing 3 purchases (ie blocks of data)
Note: spacing & SHIPPING AND TAX string varies inbetween
01/12 P934200QXEHMBPNAD Acme MARKETPLACE SEATTLE WA $34.96 Date & Amount
Ignore (blank line)
435852496957 Ignore
BOSCH CO2135 1/8 In. x 2-3/4 I Purchased item name
BOSCH CO2131 1/16 In. x 1-7/8 Purchased item name
IZOD Men's Memory Foam Slipper Purchased item name
SHIPPING AND TAX Ignore
Ignore (blank line)
01/12 P934200QXEHMB6MQ0 Acme MARKETPLACE SEATTLE WA $48.91
492577232349
LxTek Remanufactured Ink Cartr
SHEENGO Remanufactured Ink Car
02/02 P934200AEHMB7E12 Acme MARKETPLACE SEATTLE WA $21.60
659473773469
LHXEQR Tubing Adapter for ResM
SHIPPING AND TAX
My updated attempt
=index(if(len(C26:C33),REGEXREPLACE(C26:C33,"(?Ums)(\d{2}/\d{2}) .* (\$\d{1,}\.\d{1,2}).(?:^\s+\d+$)(.*)(?:\s+SHIPPING AND TAX)","$1,$2,$3"),))
Unsuccessful results
01/12 P934200QXEHMBPNAD Acme MARKETPLACE SEATTLE WA $34.96
#VALUE!
BOSCH CO2135 1/8 In. x 2-3/4 I
BOSCH CO2131 1/16 In. x 1-7/8
IZOD Men's Memory Foam Slipper
SHIPPING AND TAX
01/12 P934200QXEHMB6MQ0 Acme MARKETPLACE SEATTLE WA $48.91
#VALUE!
LxTek Remanufactured Ink Cartr
SHEENGO Remanufactured Ink Car
02/02 P934200DJEHMB7E12 AMAZON MARKETPLACE SEATTLE WA $21.60
#VALUE!
LHXEQR Tubing Adapter for ResM
SHIPPING AND TAX
Unsuccessful
- it did not ignore:
data btwn date & amount
blank lines
SHIPPING AND TAX
- Value issue - not handling Reference # well
Function REGEXREPLACE parameter 1 expects text values. But '435848996957' is a number and cannot be coerced to a text.
This matches one "block" (note: turn DOTALL flag on):
(\d\d\/\d\d).*(\$[\d.]+).(?:^\s+\d+$)(.*)SHIPPING AND TAX\n\n
Capturing:
the date as group 1
the amount as group 2
the product lines as group 3
See live demo.

Calculate the total of one measure (with a upper limit) in another measure

I need to calculate the total value of a column per employee per month. Then I need to impose a limit of 177 per employee per month. This will go into a matrix with employee as rows and months as columns. Lastly, i want to add up all the amounts per month to show the total in a line chart.
I made a measure to calculate the 1% with a max of amount of 177= if(0.01sum[amount]>177, 177,0.01sum[amount]). Then I used this measure in my matrix as explained above. This worked fine, but when i want to make the line chart the limit of 177 is still imposed because I use the same measure.
I tested it with some dummy data! Please do it like this:
Employee Month Amount
Jack January 1500
Joe February 20000
Joe March 1600
Jack April 1800
Brad June 10000
Jack July 9500
Joe February 9500
Brad April 6500
Jack December 12000
Joe June 8000
Brad April 9500
Jack January 1000
Jack April 1100
Jack April 8000
Joe February 12000
Joe February 12500
Joe February 13000
Brad June 15000
Brad June 16000
Here is the measure (DAX Code)you need to use:
your_measure =
if(0.01 * sum(your_table[Amount]) > 177, 177,0.01* sum(your_table[Amount]))
Then lets put it on a matrix and line chart:
If you want your 177 restriction not to be applied in line chart, Why not create another simple total measure:
= 0.01 * SUM(your table[amount])
Update requested from Peter
Now You need to check the whole picture! Employee is not a part of filter context. Model is filtered only by month! I added both measure as legends to the line chart!

Power BI Matrix Visual Showing Row of Blank Values Even Though Source Data Does Not Have Blanks

I have two tables one with data about franchise locations (Franchise Profile Info) and one with Award data. Each franchise location is given a certain number of awards they are allowed to give out per year. Each franchise location rolls up to a larger group depending on where in the country they are located. These tables are in a 1 to 1 relationship using Franchise ID. I am trying to create a matrix with the number of awards, total utilized, and percentage utilized rolled up to group with the ability to expand the groups and see individual locations. For some reason when I add the value fields a blank row is created. There are not any blank rows in either of the original tables so I'm not sure where this is coming from.
Franchise Profile Info table
ID
Franchise Name
Group
Street Address
City
State
164
Park's
West
12 Park Dr.
Los Angeles
CA
365
A & J
East
243 Whiteoak Rd
Stafford
VA
271
Otto's
South
89 Main St.
St. Augustine
FL
Award table
ID
Year
TotalAwards
Utilized
164
2022
16
12
365
2022
5
5
271
2022
22
17
This tables are in a relationship with a 1 to 1 match on ID
What I want the matrix to look like
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
Instead what I'm getting is this
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
0
0
0%
I can't for the life of me figure out where this row is coming from. I can add in the Group and Franchise name as rows but as soon as I add any of the value columns this blank row shows up.
You have a value on the many side that does not exist on the one side. You can read a full explanation here. https://www.sqlbi.com/articles/blank-row-in-dax/

How to get a percentage of grouped values over a period of time (at the hour scale) in DAX?

I have a dataset containing the duration (in minutes) of occupancy events over a period of 1 hour in my rooms:
# room date duration
--- ---- ------------------- --------
0 A1 2022-01-01 08:00:00 30
1 A1 2022-01-01 10:00:00 5
2 A1 2022-01-01 16:00:00 30
3 A1 2022-01-02 10:00:00 60
4 A1 2022-01-02 16:00:00 60
...
My date column is linked to a date table in which I have:
# datetime year month monthName day dayOfWeek dayName hour
--- ------------------- ---- ----- --------- --- --------- -------- ----
...
k 2022-01-01 08:00:00 2022 1 January 1 5 Saturday 8
k+1 2022-01-01 09:00:00 2022 1 January 1 5 Saturday 9
...
n 2022-03-01 22:00:00 2022 3 March 1 1 Tuesday 22
I am trying to retrieve the following percentage: duration/timeperiod through a measure. The idea behind using a measure is :
Being able to use a time slicer and see my percentage being updated
Using, for example, a bar chart with my date hierarchy, and being able to see a percentage in my different level of hierarchy (datetime -> year -> month -> dayOfWeek -> hour)
Attempt
My idea was to create a first measure that would return the number of minutes between the first and the last date currently chosen. Here is what I came up with:
Diff minutes = DATEDIFF(
FIRSTDATE( 'date'[date] ),
LASTDATE( 'date'[date] ),
MINUTE
)
The idea was then to create a second measure that would divide the SUM of the durations by the Diff minutes' measure:
My rate = DIVIDE(
SUM( 'table'[duration] ),
[Diff minutes]
)
I currently face a few issues:
The slicer is set to (2022-01-02 --> 2022-01-03) and if I check in a matrix, I have datetime between 2022-01-02 0:00:00 and 2022-01-03 23:00:00, but my measure returns 1440 which is the number of minutes in a day but not in my selected time period
The percentage is also wrong unfortunately. Let's take the example that I highlighted in the capture. There are 2 values for the 10h slot, 5min and 60min. But the percentage shows 4.51% instead of 54.2%. It actually is the result of 65/1440, 1440 being the total of minutes for my whole time period, not my 10h slot.
Examples
1- Let's say I have a slicer over a period of 2 days (2022-01-01 --> 2022-01-02) and my dataset is the one provided before:
I would have a total duration of 185 minutes (30+5+30+60+60)
My time period would be 2 days = 48h = 2880 minutes
The displayed ratio would be: 6.4% (185/2880)
2- With the same slicer, a matrix with hours and percentage would give me:
hour rate
---- -----
0 0.0%
1 0.0%
...
8 25.0% <--- 30 minutes on the 1st of January and 0 minutes on the 2nd
9 0.0% <--- (5+0)/120
10 54.2% <--- (5+60)/120
...
16 75.0% <--- (30+60)/120
Constraints
The example I provided only has 1 room. In practice, there are n rooms and I would like my measure to return the percentage as the mean of all my rooms.
Would it be possible ? Have I chosen the right method ?
The DateDiff function you have created should work, I have tested it on a report and when I select some dates, it gives me the difference between the first and last selected dates.
Make sure your slicer is interacting with the measure.
In the meantime, I think I found a simpler and easier way to do it.
First, I added a new column to my date table, that seems dubious but is actually helpful:
minutes = 60
This allows me to get rid of the DATEDIFF function. My rate measure now looks like this:
My rate = DIVIDE(
SUM( table[duration] ),
[Number of minutes],
0
)
Here, I use the measure Number of minutes which is simply a SUM of the values in the minutes column. In order to provide accurate results when I have multiple rooms selected, I multiplied the number of minutes by the number of rooms:
Number of minutes = COUNTROWS( rooms ) * SUM( 'date'[minutes] )
This now works perfectly with my date hierarchy!

Return Calculations Incorrect in Panel Data

I'm currently working with panel data in Stata, and run the following commands to define the panel:
encode ticker, generate(ticker_n)
xtset ticker_n time
Where the ticker is a string (ticker of a listed company on a stock exchange), and time is an integer going from 930 (opening of the market) to 1559 (closing of the market). Thus, time here indicates the minutes the stock exchange is opened. For each minute the stock market is opened we have all close prices of the tickers listed at the stock exchange. A sample of the data looks as such:
date time open high low close volume ticker ticker_n
09/15/2008 930 33.31 33.31 33.31 33.31 2135 zeus zeus
09/15/2008 931 32.94 32.94 32.94 32.94 100 zeus zeus
09/15/2008 930 10.21 10.21 10.21 10.21 4270 bx bx
09/15/2008 931 10.46 10.5 10.42 10.44 5700 bx bx
Then, in an attempt to calculate returns (using the close price) I run the following command:
gen return = (close - l.close) / l.close
However, this leads to a weird error where every whole hour (time = 1100, 1200, 1300, etc.) the returns are not calculated at all and Stata just reports a "-" for the returns.
Now I assume something went wrong in defining the panel data, such that Stata does not recognize that the observation before 1500 should be 1459 (it looks for 1499 I assume?).
Hence, my question is, how do I correctly define my panel data such that Stata recognizes that my time axis is in minutes? I did not find anything in the official Stata documentation that helped me out here.
Indeed: your time variable is messing you up mightily. If time is going from 1059 to 1100, or from 1159 to 1200, each of those is a jump of 41 to Stata. The value for the time previous to 1100 would have been at time 1099, which won't be in your data; hence previous values for 1100, etc., will all be missing. There is no sense whatsoever in which Stata will look at 1100 and say "Oh! that's a time and so the previous time would have been 1059 and I should use the value for 1059". Applying a time display format wouldn't change that failure to see the times as you understand them.
You don't explain how daily dates are supposed to enter your analysis. Here is some technique for times in hours and minutes alone.
clear
input time
930
931
959
1000
1001
1059
1100
end
gen double mytime = dhms(0, floor(time/100), mod(time, 100), 0)
format mytime %tcHH:MM
gen id = 1
xtset id mytime, delta(60000)
list mytime L.mytime, sep(0)
+-----------------+
| L.|
| mytime mytime |
|-----------------|
1. | 09:30 . |
2. | 09:31 09:30 |
3. | 09:59 . |
4. | 10:00 09:59 |
5. | 10:01 10:00 |
6. | 10:59 . |
7. | 11:00 10:59 |
+-----------------+