RegEx Lookaround issue - regex

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.

If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.

try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")

The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension

The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.

Related

How to retrieve the targeted substring, if the number of characters can vary?

I want to retrieve from input similar to the following: code="XY85XXXX", the substring between "".
In case of a fixed number of 8 characters I can retrieve the value with (?<=code=").{8}.
But the targeted substring length varies, 7 or 9, or somewhere in the range between 3 and 11 (as in the examples below) and that is what I need to also handle.
Input can for example be code="XY85XXXX765" or code="123".
How must I adjust the regex to achieve that flexibility?
You can use positive lookbehind to 'anchor' your matches to the fixed part (?<=code=") and a negative character class allowing any character but " occurring one or more times:
(?<=code=")[^"]+
You can use a lookahead and lookbehind both searching for quotes:
(?<=").*(?=")
let rx = /(?<=").*(?=")/;
let extract = (txt) => console.log(txt.match(rx)[0]);
extract('code="XY85XXXX"');
extract('code="Y85XXXX"');
extract('code="ZXY85XXXXZ"');
I've copied the solution ( (?<=code=")[^"]+) in this tool https://regex101.com/ for PHP.
Ok, I get my result but when I select in the tool .NET I have no result.
What should/must be changed?

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

Using RegEx how do I remove the trailing zeros from a decimal number

I'm needing to write some regex that takes a number and removes any trailing zeros after a decimal point. The language is Actionscript 3. So I would like to write:
var result:String = theStringOfTheNumber.replace( [ the regex ], "" );
So for example:
3.04000 would be 3.04
0.456000 would be 0.456 etc
I've spent some time looking at various regex websites and I'm finding this harder to resolve than I initially thought.
Regex:
^(\d+\.\d*?[1-9])0+$
OR
(\.\d*?[1-9])0+$
Replacement string:
$1
DEMO
Code:
var result:String = theStringOfTheNumber.replace(/(\.\d*?[1-9])0+$/g, "$1" );
What worked best for me was
^([\d,]+)$|^([\d,]+)\.0*$|^([\d,]+\.[0-9]*?)0*$
For example,
s.replace(/^([\d,]+)$|^([\d,]+)\.0*$|^([\d,]+\.[0-9]*?)0*$/, "$1$2$3");
This changes
1.10000 => 1.1
1.100100 => 1.1001
1.000 => 1
1 >= 1
What about stripping the trailing zeros before a \b boundary if there's at least one digit after the .
(\.\d+?)0+\b
And replace with what was captured in the first capture group.
$1
See test at regexr.com
(?=.*?\.)(.*?[1-9])(?!.*?\.)(?=0*$)|^.*$
Try this.Grab the capture.See demo.
http://regex101.com/r/xE6aD0/11
Other answers didn't consider numbers without fraction (like 1.000000 ) or used a lookbehind function (sadly, not supported by implementation I'm using). So I modified existing answers.
Match using ^-?\d+(\.\d*[1-9])? - Demo (see matches). This will not work with numbers in text (like sentences).
Replace(with \1 or $1) using (^-?\d+\.\d*[1-9])(0+$)|(\.0+$) - Demo (see substitution). This one will work with numbers in text (like sentences) if you remove the ^ and $.
Both demos with examples.
Side note: Replace the \. with decimal separator you use (, - no need for slash) if you have to, but I would advise against supporting multiple separator formats within such regex (like (\.|,)). Internal formats normally use one specific separator like . in 1.135644131 (no need to check for other potential separators), while external tend to use both (one for decimals and one for thousands, like 1.123,541,921), which would make your regex unreliable.
Update: I added -? to both regexes to add support for negative numbers, which is not in demo.
If your regular expressions engine doesn't support "lookaround" feature then you can use this simple approach:
fn:replace("12300400", "([^0])0*$", "$1")
Result will be: 123004
I know I am kind of late but I think this can be solved in a far more simple way.
Either I miss something or the other repliers overcomplicate it, but I think there is a far more straightforward yet resilient solution RE:
([0-9]*[.]?([0-9]*[1-9]|[0]?))[0]*
By backreferencing the first group (\1) you can get the number without trailing zeros.
It also works with .XXXXX... and ...XXXXX. type number strings. For example, it will convert .45600 to .456 and 123. to 123. as well.
More importantly, it leaves integer number strings intact (numbers without decimal point). For example, it will convert 12300 to 12300.
Note that if there is a decimal point and there are only zeroes after that it will leave only one trailing zeroes. For example for the 42.0000 you get 42.0.
If you want to eliminate the leading zeroes too then youse this RE (just put a [0]* at the start of the former):
[0]*([0-9]*[.]?([0-9]*[1-9]|[0]?))[0]*
I tested few answers from the top:
^(\d+\.\d*?[1-9])0+$
(\.\d*?[1-9])0+$
(\.\d+?)0+\b
All of them not work for case when there are all zeroes after "." like 45.000 or 450.000
modified version to match that case: (\.\d*?[1-9]|)\.?0+$
also need to replace to '$1' like:
preg_replace('/(\.\d*?[1-9]|)\.?0+$/', '$1', $value);
try this
^(?!0*(\.0+)?$)(\d+|\d*\.\d+)$
And read this
http://www.regular-expressions.info/numericranges.html it might be helpful.
I know it's not what the original question is looking for, but anyone who is looking to format money and would only like to remove two consecutive trailing zeros, like so:
£30.00 => £30
£30.10 => £30.10 (and not £30.1)
30.00€ => 30€
30.10€ => 30.10€
Then you should be able to use the following regular expression which will identify two trailing zeros not followed by any other digit or exist at the end of a string.
([^\d]00)(?=[^\d]|$)
I'm a bit late to the party, but here's my solution:
(((?<=(\.|,)\d*?[1-9])0+$)|(\.|,)0+$)
My regular expression will only match the trailing 0s, making it easy to do a .replaceAll(..) type function.
Breaking it down, part one: ((?<=(\.|,)\d*?[1-9])0+$)
(?<=(\.|,): A positive look behind. Decimal must contain a . or a , (commas are used as a decimal point in some countries). But as its a look behind, it is not included in the matched text, but still must be present.
\d*?: Matches any number of digits lazily
[1-9]: Matches a single non-zero character (this will be the last digit before trailing 0s)
0+$: Matches 1 or more 0s that occur between the last non-zero digit and the line end.
This works great for everything except the case where trailing 0s begin immediately, like in 1.0 or 5.000. The second part fixes this (\.|,)0+$:
(\.|,): Matches a . or a , that will be included in matched text.
0+$ matches 1 or more 0s between the decimal point and the line end.
Examples:
1.0 becomes 1
5.0000 becomes 5
5.02394900022000 becomes 5.02394900022
Is it really necessary to use regex? Why not just check the last digits in your numbers? I am not familiar with Actionscript 3, but in python I would do something like this:
decinums = ['1.100', '0.0','1.1','10']
for d in decinums:
if d.find('.'):
while d.endswith('0'):
d = d[:-1]
if d.endswith('.'):
d = d[:-1]
print(d)
The result will be:
1.1
0
1.1
10