Conditional Regex for multiple matches in a line - regex

I've got a regex that is responsible for matching the pattern A:B in lines where you might have multiple matches (i.e. "A:B A: B A : B A:B", etc.) The problem lies in the complexity of what A represents.
I'm using the regex:
\b[\w|\(|\)+]+\s*:(?:(?![\w+]+\s*:).)*
to match items in:
Data_1: Tutor Elementary: 10 a F Test: 7.87 sips
Turning 1 Data (A Run), Data: 0.0 10.0 10.0 17.3 0.0
Turning 2 Data (A Run), Data2: 0.0 6.8 0.0 6.8 6.8
Data_1: Tutor Pool: Data2: A B C
Turning 2 (A Run), ABSOLUTE: 368 337 428 0 2 147
Data_4 : 4AZE Localization : 33.14 lat -86 long
Time: 0.75 Data Scenario: 3121.2
The question is this, if you examine this setup (I use https://regex101.com/), lines 2,3,5 don't return exactly what I'm looking for. Where the match is the first in the line, I want it to grab everything from the beginning of the line to the first ':'. Is this type of conditional regex possible? I've tried every possible way I could imagine, but I haven't been successful yet.
Thanks in advance!

A little complex, but try this here
^(.*?:.*?)(\b\w+\b\s*:.*?)\b\w+\b:.*$|^(.*?:.*?)\b\w+\b\s*:(.*?)$|^(.*)$

Related

Not able to match the regex

I need to write the regex to fetch the details from the following data
Type Time(s) Ops TPS(ops/s) Net(M/s) Get_miss Min(us) Max(us) Avg(us) Std_dev Geo_dist
Period 5 145443 29088 22.4 37006 352 116302 6600 7692.04 4003.72
Global 10 281537 28153 23.2 41800 281 120023 6797 7564.64 4212.93
The above is the log which i get from a log file
I have tried writing the reg ex to get the details in the table format but could not get.
Below is the reg ex which i tried.
Type[\s+\S+].+\n(?<time>[\d+\S+\s+]+)[\s+\S+].*Period
When it comes to Period keyword the regex fails
If for some reason RichG's suggestion of using multikv doesn't work, the following should:
| rex field=_raw "(?<type>\w+)\s+(?<time>[\d\.]+)\s+(?<ops>[\d\.]+)\s+(?<tps>[\d\.]+)\s+(?<net>[\d\.]+)\s+(?<get_miss>[\d\.]+)\s+(?<min>[\d\.]+)\s+(?<max>[\d\.]+)\s+(?<avg>[\d\.]+)\s+(?<std_dev>[\d\.]+)\s+(?<geo_dist>[\d\.]+)"
Where is your data coming from?

Pandas exact str matching function?

Does pandas have a built-in string matching function for exact matches and not regex? The code below for tropical_two has a slightly higher count. Documentation tells me it does a regex search.
tropical = reviews['description'].map(lambda x: "tropical" in x).sum()
print(tropical)
tropical_two = reviews['description'].str.count("tropical").sum()
print(tropical_two)
The first way is the answer key from Kaggle but something about it seems less readable and intuitive to me compared to a .str function because when I run this it returns True instead of 2 so I am a little confused about if the answer key method is actually counting all occurrences of "tropical" and not just the first.
def in_str(text):
return "tropical" in text
in_str("tropical is tropical")
First 2 lines of dataframe:
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe #kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss #vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
Notebook here, tropical code in cell #2
https://www.kaggle.com/mikexie0/exercise-summary-functions-and-maps
You may use str.count with word boundary markers to match the exact search term:
tropical_two = reviews['description'].str.count(r'\btropical\b').sum()
print(tropical_two)
There may not be the need for a separate exact API, as str.count can be used for exact matches as well.

Extracting Multiple Blocks of Similar Text

I am trying to parse a report. The following is a sample of the text that I need to parse:
7605625112 DELIVERED N 1 GORDON CONTRACTORS I SIPLAST INC Freight Priority 2000037933 $216.67 1,131 ROOFING MATERIALS
04/23/2021 02:57 PM K WRISHT N 4 CAPITOL HEIGHTS, MD ARKADELPHIA, AR Prepaid 2000037933 -$124.23 170160-00
04/27/2021 12:41 PM 2 40 20743-3706 71923 $.00 055 $.00
2 WBA HOT $62.00 0
$12.92 $92.44
$167.36
7605625123 DELIVERED N 1 SECHRIST HALL CO SIPLAST INC Freight Priority 2000037919 $476.75 871 PAIL,UN1263,PAINT,3,
04/23/2021 02:57 PM S CHAVEZ N 39 HARLINGEN, TX ARKADELPHIA, AR Prepaid 2000037919 -$378.54
04/27/2021 01:09 PM 2 479 78550 71923 $.00 085 $95.35
2 HRL HOT $62.00 21
$13.55 $98.21
$173.76
This comprised of two or more blocks that start with "[0-9]{10}\sDELIVERED" and the last currency string prior to the next block.
If I test with "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$167.36\n)" I successfully get the first Block, but If I use "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$\d\d\d.\d\d\n)" it grabs everything.
If someone can show me the changes that I need to make to return two or more blocks I would greatly appreciate it.
* is a greedy operator, so it will try to match as much characters as possible. See also Repetition with Star and Plus.
For fixing it, you can use this regex:
(?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d)
in which I basically replaced .* with (.(?!\d{10}\sDELIVERED))* so that for every character it checks if it is followed or not by \d{10}\sDELIVERED.
See a demo here

Forvalues dropping leading 0's, how to fix?

I am attempting to create a loop to save me having to type out the code many times. Essentially, I have 60 csv files that I need to alter and save. My code looks as follows:
forvalues i = 0203 0206 : 1112 {
cd "C:\Users\User\Desktop\Data\"
import delimited `i'.csv, varnames(1)
gen time=`i'
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `i'.dta, replace
}
However, I am getting the error "203.csv" does not exist. It seems to be dropping the leading 0, any way to fix this?
You are asking for a numlist, but in this context 0203, with nothing else said, just looks to Stata like a quirky but acceptable way to write 203: hence your problem.
But do you really have a numlist that is 0203 0206 : 1112?
Try it:
numlist "0203 0206 : 1112"
ret li
The list starts 203 206 209 212 215 218 221 224 227 230 233 236 ...
My wild guess is that you have files, one for each quarter over a period, labelled 0203 for March 2002 through to 1112 for December 2011. In fact you do say that you have times, even though my guess implies 40 files, not 60. If so, that means you won't have a file that is labelled 0215, so this is the wrong way to think in any case.
Here is a better approach. First take the cd out of the loop: you need only do that once!
cd "C:\Users\User\Desktop\Data"
Now find the files that are ????.csv. You need only install fs once.
ssc inst fs
fs ????.csv
foreach f in `r(files)' {
import delimited `f', varnames(1)
gen time = substr("`f'", 1, 4)
keep rssd9017 rssd9010 bhck4074 bhck4079 bhck4093 bhck2170 time
save `time'.dta, replace
}
On my guess, you still need to fix the time to something civilised and you would be better off appending the files, but one problem at a time.
Note that insisting on leading zeros, which you think is the problem here, but is probably a red herring, is written up here.

RegEx for value 0.1 to 100.00

Looking at the xml file created by HitManPro I can see numerous entries like this one;
[Item type="Malware" malwareName="Trojan" score="0.0" status="None"]
This are the false positives.
I would like to replace the existing RegEX query that I use in a script (LabTech) with one that would look for anything like;
score="5.1" up to score="999.0"
I am new to Reg Ex queries, and I am having trouble building the search for digits inside the string score=" " .
Any help would be much appreciated. Below is a sample XML from hitmanPro
regards,
Oscar Romero
<br>
HitmanPro Scan Completed Successfully.
Threats Found!
<hr>
Scan Date: 2015-10-17T15:16:31<BR>
<p>"
[Log computer="computer name" windows="6.1.1.7601.X64/12" scan="Normal" version="3.7.9.246" date="2015-10-17T15:16:31" timeSpentInSecs="125" filesProcessed="15922"]
[Item type="Malware" malwareName="Malware" score="90.0" status="None"]
[Scanners]
[Scanner id="Bitdefender" name="Gen:Variant.Kazy.751212" /]
[/Scanners]
[File path="C:\Program Files (x86)\ESET\ESET Remote Administrator\Server\era.exe" hash="F7BB46D48B994539AFD400641CE8E4F85114FC7BA05A1BAA0D092F3A92817F13" /]
[Startup]
[Key path="HKLM\SYSTEM\CurrentControlSet\Services\ERA_SERVER\" /]
[/Startup]
[/Item]
[/Log]
"</p>
There must be a shorter version than this, but this should work.
score="(0\.[1-9]|[1-9]\.[0-9]|[1-9][0-9]\.[0-9]|[1-9][0-9][0-9]\.[0-9])"
Matches:
0.1
1.0
10.4
100.9
100.0
999.9
99.9
9.9
(etc.)
Does Not Match
0.0
0
(etc.)
Is regex the way to go?
As for whether regex is the right tool for the job, I probably agree with #Makoto that it isn't - unless you're doing a quick scan of the results as an FYI, rather than filtering results as part of a larger tool or application. In other words, except for the simplest cases, I agree with #Makoto that you want some xml parsing tool.
I have no idea on LabTech.
Anyway, the regex query that you can use:
\sscore="((?:5\.[1-9])|(?:[6-9]\.[0-9])|(?:[1-9]{1}[0-9]{1,2}\.[0-9]))"\s
or
\sscore="(5\.[1-9]|[6-9]\.[0-9]|[1-9]{1}[0-9]{1,2}\.[0-9])"\s
if you prefer without the (?: ... )
UPDATE:
Okay, I made further changes to support the 5.1 minimum, and max 999.9
PS: This is my first answer on StackOverflow