Remove set of string from a string, multiple occurences - regex

Want to completely remove any part of my string that has
\"AddedDate\":\"\\/Date(1480542000000-0600)\\/\"
The 1480526460000-0600 is not hardcoded, it could be any set of numbers (JSON dates).

Try this regex \"AddedDate\":\"\\\/Date\(\d+(?:-\d+)?\)\\\?\" and replace with empty string. If the regex engine doesn't support \d, replace them with [0-9]. This will match date format like x or x-x, x being any number of digits.
If you want to match exactly 13 numbers in the first part of the date and 4 in the second, use \"AddedDate\":\"\\\/Date\(\d{13}(?:-\d{4})?\)\\\?\"
EDIT: For new format use \\\"AddedDate\\\":\\\"\\\\\/Date\(\d+(?:-\d+)?\)\\\\\/\\\" it should work.

Related

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

RegEx Parse Tool to extract digits from string

Using Alteryx, I have a field called Address which consists of fields like A32C, GH2X, ABC19E. So basically where digits are pinned between sets of letters. I am trying to use the RegEx tool to extract the digits out into a new column called ADDRESS1.
I have Address set to Field to Parse. Output method Parse.
My regular expression is typed in as:
(?:[[alpha]]+)(/d+)(?:[[alpha]]+)
And then I have (/d+) outputting to ADDRESS1. However, when I run this it parses 0 records. What am I doing wrong?
To match a digit, use [0-9] or \d. To match a letter, use [[:alpha:]].
Use
[[:alpha:]]+(\d+)[[:alpha:]]+
See the regex demo.
You can try this :
let regex = /(?!([A-Z]+))(\d+)(?=[A-Z]+)/g;
let values = 'A32CZ, GH2X, ABC19E'
let result = values.match(regex);
console.log(result);

Regular expressions string replacement of individual match within file

I have written a small program to whir through a textfile and find and replace regex where 9 digits \d{9}. It works fine, except what I need is a little more complicated.
I am finding the right data correctly. theFile is just a string with the text file streamread into it. I do this and then create and write it to another file.
But I need to find each string match individually, and replace that match with only the last 5 digits of that individual number (currently this is just replacing with FOUND). Keeping the file otherwise identical.
I am not sure how/what is the best way of doing this? would i have to split into an array of strings rather than one mass string? (it's quite a big file)
Any questions let me know, thanks in advance.
Dim regexString As String = "(\d{9})"
Dim replacement1 As String = "FOUND"
Dim rgx As New Regex(regexString)
Try
theFile = rgx.Replace(theFile, replacement1)
Catch
End try
Instead of using just one replacement pattern \d{9} split and group with two patterns, the first is 4 numbers long, the second 5 numbers. Then in the replace use only the last 5 numbers from the last group
Dim k = "abcd 123456789 abcf"
Dim ptn = "(\d{4})(\d{5})"
Dim result = Regex.Replace(k, ptn, "$2")
This approach leaves unchanged the sequences with less than 9 consecutive numbers, but if you have sequences with more than 9 numbers and don't want to change them, then you need a pattern with
Dim ptn = "(\b\d{4})(\d{5}\b)"
to fix the two groups inside a sequence of exactly nine numbers.
The question appears to ask for matches on exactly nine digits and wants the first four to be removed. Ie to replace the nine digits with the last five.
Splitting the regular expression in the question into two parts, for the unwanted and the wanted parts gives
regexString = "\d{4}(\d{5})"
which captures the wanted five digits, so then the replacement is
replacement1 ="$1"
Or in some other regular expression implementations it would be replacement1 ="\1". Additionally the replace method in some regular expression system may have additional options (parameters) for replace first versus replace n-th versus replace all occurrences.
Suppose there are more than nine digits and only the final five are wanted. In this case the regular expression can be written as one of the following (as different regular expression languages support different features). The replacement expression is the same as above.
regexString = "\d{4,}(\d{5})"
regexString = "\d\d\d\d+(\d{5})"
regexString = "\d\d\d\d\d*(\d{5})"
Because regular expressions are normally "greedy" the \d{5} should always match the final 5 digits but it may be worth finishing the regular expression with ...(\d{5})([^\d]|$) and changing the replace to be $1$2. That way it looks for a trailing non-digit or end-of-string.

Comma Separated Numbers Regex

I am trying to validate a comma separated list for numbers 1-8.
i.e. 2,4,6,8,1 is valid input.
I tried [0-8,]* but it seems to accept 1234 as valid. It is not requiring a comma and it is letting me type in a number larger than 8. I am not sure why.
[0-8,]* will match zero or more consecutive instances of 0 through 8 or ,, anywhere in your string. You want something more like this:
^[1-8](,[1-8])*$
^ matches the start of the string, and $ matches the end, ensuring that you're examining the entire string. It will match a single digit, plus zero or more instances of a comma followed by a digit after it.
/^\d+(,\d+)*$/
for at least one digit, otherwise you will accept 1,,,,,4
[0-9]+(,[0-9]+)+
This works better for me for comma separated numbers in general, like: 1,234,933
You can try with this Regex:
^[1-8](,[1-8])+$
If you are using python and looking to find out all possible matching strings like
XX,XX,XXX or X,XX,XXX
or 12,000, 1,20,000 using regex
string = "I spent 1,20,000 on new project "
re.findall(r'(\b[1-8]*(,[0-9]*[0-9])+\b)', string, re.IGNORECASE)
Result will be ---> [('1,20,000', ',000')]
You need a number + comma combination that can repeat:
^[1-8](,[1-8])*$
If you don't want remembering parentheses add ?: to the parens, like so:
^[1-8](?:,[1-8])*$

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.