Extract numbers from line not starting with comment symbol using regex - regex

I'm try to replace all numbers not in the comment section. Here is a sample of the file to fix:
/* 2018-01-01 06:00:55 : realtime(0.002) --status(10)-- ++numretLines(0)++ --IP(192.168.1.5) PORT(22)-- queryNo(2) comment[TO: Too much time] TYPE[QUERY 4.2] */
select count(*) from table where id1 = 41111 and id2 = 221144
GO
Basically, I would like to replace numbers in strings not beginning with "/*".
I came up with the following regex: /^(?!\/\*)(?:.+\K(\d+?))/gmU
But I only manage to extract the first number of each line not starting with "/*". How could I extend this to get all the numbers of those rows?
Thanks!

Assuming your regex engine (which you haven't told) supports look behind and look ahead, you can use this regex:
(?<!^\/\*.*)(?:(?<=\s)\d+(?=\s))+
The regex starts by using a negative look behind, looking for the start of line, followed by a slash and a star.
Then it creates a new negative look behind for a White Space, then any number of digits, followed by a negative look ahead for a White Space. This Group is repeated any number of times.
You need to set the global and 'multiline' flag.
The regex skips numbers not surrounded by White Space (for instance 'id1')

Based on Wiktor Stribiżew comment, I used \/\*.*?\*\/(*SKIP)(*F)|-?\b\d+(\.\d+)? to extract the numbers, including decimals and negative values.

Related

Regex - lazy match first pattern occurrence, but no subsequent matching patterns

I need to return the first percentage, and only the first percentage, from each row in a file.
Each row may have one or two, but not more than two, percentages.
There may or may not be other numbers in the line, such as a dollar amount.
The percentage may appear anywhere in the line.
Ex:
Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.
Profits in New York increased by 0.9%.
Profits in Texas were up 1.58% an increase from last year's 0.58%.
I can write a regex to capture all occurrences:
[0-9]+\.[0-9]+[%]+?
https://regex101.com/r/owZaGE/1
The other SO questions I've perused only address this issue when the pattern is at the front of the line or always preceded by a particular set of characters
What am I missing?
/^.*?((?:\d+\.)?\d+%)/gm
works with a multiline flag, no negative lookbehind (some engines don't support non-fixed width lookbehinds). Your match will be in the capture group.
Mine is similar to you except I allowed numbers like 30% (without decimal points)
\d+(\.\d+)?%
I don't know what language you are using, but in python for getting the first occurrence you can use re.search()
Here is an example:
import re
pattern = r'\d+(\.\d+)?%'
string = 'Profits in California were down 10.00% to $100.00, a decrease from 22.6% the prior year.'
print(re.search(pattern, string).group())
I was able to solve using a negative lookbehind:
(?<!%.*?)([0-9]+\.[0-9]+[%]+?)

extract text data using regexp in MATLAB

I'm dealing with extracting visibility data in METAR(airport weather observation data).
Visibility is a 4 digit(0~9) data, and can also be expressed as'CAVOK' when visibility is good.
but it's quite tricky to use regexp. (METAR data have many variations.)
Data sample(MET_VIS) below:
201903072300 METAR RKPC 072300Z 17003KT 110V210 CAVOK 05/02 Q1026 NOSIG=
201903062000 METAR RKPC 062000Z 33018G29KT 4000 BR FEW012 SCT025 08/04 Q1018 WS R13 R31 NOSIG=
201903062200 METAR RKPC 062200Z 33015KT 290V350 9999 SCT030 07/03 Q1019 NOSIG=
201903080000 METAR RKPC 080000Z 29002KT CAVOK 08/02 Q1027 NOSIG=
I want to extract CAVOK, 4000, 9999, CAVOK on each line.
I tried but this code doesn't work with line 3 :( It returns blank.
regexp(MET_VIS(i),'((?<=KT\s)\d{4})|CAVOK','match')
The third value does not end on KT. What you might do is use another positive lookbehind to check if the string before it ends on KT and match a range of matching 7 times A-Z0-9 followed by a whitespace char after it.
Then you either match 4 digits or CAVOK using an alternation (?:\d{4}|CAVOK) or else you could match CAVOK anywhere in the string.
Add a word boundary after it to prevent the match being part of a larger word.
(?:(?<=KT\s)|(?<=KT [A-Z0-9]{7}\s))(?:\d{4}|CAVOK)\b
Regex demo
You could also make an assumption about the range of "words" from the end your target should be allowed to occur in. For example:
/\b(?:\d{4}|CAVOK)\b(?=(?: \S+){3,9}$)/gm
See regex demo.
Here we're looking for a four-digit number or the phrase CAVOK only, if it is followed by 3 to 9 non-space substrings of variable length until the end of the line.

Regex - select all text that does not start with a specific number

I want to get all text that does not start with 1,2,12,34.
I wrote
^((?!1|2|12|34).)*$
(^ asserts position at start of a line)
as in:
https://regex101.com/r/gI6sN8/14
Problems
It also doesn't select text that has 1 or 2 in the middle ("AB 1 CD").
It also doesn't select 13 (because it starts with 1)
How can I restrict it
Looks like you want this:
^(?!(1|2|12|34)\s).*
https://regex101.com/r/gI6sN8/16
As mentioned in comment, you need word boundary and correct parenthesis position
^(?!(?:1|2|12|34)\b)(.*)$
Regex Demo
You can also use \D
^(?!(?:1|2|12|34)\D)(.*)$
In your regex
^((?!1|2|12|34).)*$
you are finding whether any of the above alternative 1|2|12|34 is correct at every position. That's why it's not matching AB 1 CD
This works
^(?!(?:12?|2|34)(?!\d)).+$
https://regex101.com/r/gI6sN8/19
A valid boundary between the numbers you don't want it to
start with and the character after it appears to be any non-digit.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx