Delete numbers not dates in R (regex) - regex

I want to remove numbers (integers and floats) from a character vector, preserving dates:
"I'd like to delete numbers like 84 and 0.5 but not dates like 2015"
I would like to get:
"I'd like to delete numbers like and but not dates like 2015"
In English a quick and dirty rule could be: if the number starts with 18, 19, or 20 and has length 4, don't delete.
I asked the same question in Python and the answer was very satisfying (\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?).
However, when I pass the same regex to grepl in R:
gsub("[\b(?!(?:18|19|20)\d{2}\b(?!\.\d))\d*\.?]"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015")
I get:
Error: '\d' is an unrecognized escape in character string starting ""\b(?!(?:18|19|20)\d"

As I mentioned in my comments, the main points here are:
regex pattern should be placed outside the character class to be treated as a sequence of subpatterns and not as separate symbols inside the class
the backslashes must be doubled in R regex patterns (since it uses C strings where \ is used to escape entities like \n, \r, etc)
and also you need to use perl=T with patterns featuring lookarounds (you are using lookaheads in yours)
Use
gsub("\\b(?!(?:18|19|20)\\d{2}\\b(?!\\.\\d))\\d*\\.?\\d+\\b"," ", "I'd like to delete numbers like 84 and 0.5 but not dates like 2015", perl=T)
See IDEONE demo.

To search and replace in R you can use:
gsub("\\b(?!(?:18|19|20)\\p{Nd}{2}\\b(?!\\.\\p{Nd}))\\p{Nd}*\\.?", "replacement_text_here", subject, perl=TRUE);

Related

Is it possible to negate a group in a regular expression?

Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.

Remove numbers that contains 8-10 digits at different places in a line

I have a list of different items. Some of them have 8-10 digits in front of the name, some others have these 8-10 digits behind the name and some others again don't have these numbers in the name.
I have two expressions that I use to remove these digits, but I can not manage to combine them with | (or). They work each for themselves, but if I use the first expression first, then the second expression, I don't get the result I want to have.
I use these to expressions for now:
(?<=[\d]{8,10}) (.*)
.*?(?=[\d]{8,10})
But if I use them both (first one and then the other), then some of the lines become totally empty.
How can I combine these to to do what I want, or if it's better, write a new expression that does what I want to do :)
List is like this:
12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145
Want this result:
Book
Book
Book
Book
Cabinet 120x30x145
Why not just use the following.
Check if there are 8 numbers in the beginning of the string, or at the end of it and remove them.
(^\d{8,10}\s*|\s*\d{8,10}$)
It gives the wanted behaviour
Instead of only matching everything but a number containing
8-10 digits + adjacent spaces, use a regex to substitute
such a number (also + adjacent spaces) with an empty string.
To match, use the following regex:
*\d{8,10} *
That is:
* - a space and an asterix - a sequence of spaces (may be empty),
\d{8,10} - a sequence of 8 to 10 digits,
* - another sequence of spaces (may be empty).
The replacement string is (as I said) empty. Of course, you should use
g (global) option.
Note that you can not use \s instead of the space, as \s matches also
CR and LF and we don't want this.
For a working example see https://regex101.com/r/1hsGzT/1
You need to use \b meta sequence boundary:
/\b[0-9\s]{8,10}\b/g;
var str = `12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145`;
var rgx = /\b[0-9\s]{8,10}\b/g;
var res = str.replace(rgx, `\n`)
console.log(res);

Regex - how to make sure a string contain a word and numbers

I need a little help with Regex.
I want the regex to validate the following sentences:
fdsufgdsugfugh PCL 6
dfdagf PCL 11
fdsfds PCL6
fsfs PCL13
kl;klkPCL6
fdsgfdsPCL13
some chars, than PCL and than 6 or a greater number.
How this can be done?
I'd go with something like this:
^(.*)(PCL *)([6-9][0-9]*|[1-5][0-9]+)$
Meaning:
(.*) = some chars
(PCL *) = then PCL with optional whitespaces afterwards
([6-9][0-9]*|[1-5][0-9]+) then 6 or a greater number
This one should suit your needs:
^.*PCL\s*(?:[6-9]|\d{2,})$
Visualization by Debuggex
In bash:
EXPR=^[a-zA-Z]\+ *PCL *\([6-9]\|[0-9]\{2,\}\)
Translated:
Line begins with at least 1 occurence of a character (ignore caps)
Any amount of spaces, PCL, any amount of spaces
Either a number between 6 or 9, or a number with at least 2 digits
This expression used with something like grep "$EXPR" file.txt will output in stdout the lines that are valid.
This worked well for me. Reads logically too according to the way you described the matching
/[^PCL]+PCL\s?*[6-9]\d*/

Regular expression= tabspace+STRING+tabspace

How can I write this as a regular expression?
tabspaceSTRINGtabspace
My data looks like this:
12345 adsadasdasdasd 30
34562 adsadasdasdasd asdadaads<adasdad 30
12313 adsadasdasdasd asdadas dsaads 313123<font="TNR">adsada 30
1232131 adsadasdasdasd asdadaads<adasdad"asdja <div>asdjaıda 30
I want to get
12345 30
34562 30
12313 30
1232131 30
\t*\t doesn't work.
try the following regular expression
\t.+\t
The problem there is your definition of String...
If you use something like the suggested above, it'll match
tabspaceSTRINGtabspacetabspace
You get the picture. This might be acceptable, if not, you need to limit your "STRING" definition, like:
\t\w+\t
or:
\t[a-zA-Z]+\t
What characters are allowed in your string?
\t\w+\t
\w would allow letters, digits and the underscore (depending on your regex engine ASCII or Unicode)
See it here on Regexr, a good platform to test regular expressions.
Your "regex" \t*\t would match 0 or more tabs and then one tab. The * is a quantifier meaning 0 or more and is referring to the character or group before (here to your \t)
If your whitespace are not tabs, try this
\s+.+\s+30
\s is a whitespace character (space, tab, newline (not important for Notepad++)).
If you are not sure about the strings you are looking for except that they are separated by tabs it is a good approach to describe such a string as everything but a tab: (^\t*)
[^\t]*\t([^\t]*)\t[^\t]*
You can test it on regexpad.com.

How to find numbers and exclude any in parentheses using regex

I'm trying to write a regex pattern that will find numbers with two leading 00's in it in a string and replace it with a single 0. The problem is that I want to ignore numbers in parentheses and I can't figure out how to do this.
For example, with the string:
Somewhere 001 (2009)
I want to return:
Somewhere 01 (2009)
I can search by using [00] to find the first 00, and replace with 0 but the problem is that (2009) becomes (209) which I don't want. I thought of just doing a replace on (209) with (2009) but the strings I'm trying to fix could have a valid (209) in it already.
Any help would be appreciated!
Search one non digit (or start of line) followed by two zeros followed by one or more digits.
([^0-9]|^)00[0-9]+
What if the number has three leading zeros? How many zeros do you want it to have after the replacement? If you want to catch all leading zeros and replace them with just one:
([^0-9]|^)00+[0-9]+
Ideally, you'd use negative look behind, but your regex engine may not support it. Here is what I would do in JavaScript:
string.replace(/(^|[^(\d])00+/g,"$10");
That will replace any string of zeros that is not preceded by parenthesis or another digit. Change the character class to [^(\d.] if you're also working with decimal numbers.
?Regex.Replace("Somewhere 001 (2009)", " 00([0-9]+) ", " 0$1 ")
"Somewhere 01 (2009)"