substitution with eval and repeat the character by grouping string length?

substitution with eval and repeat the character by grouping string length? - regex

My input as follow
my $s = '<B>Estimated:</B>
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
<B>Instability index:</B>
The instability index (II) is computed to be 31.98
This classifies the protein as stable.';
I want to remove the <B></B> tags from string and put the underline for bold tags.
I expected output is
Estimated:
---------
The N-terminal of the sequence considered is M (Met).
The estimated half-life is: 30 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
Instability index:
------------------
The instability index (II) is computed to be 31.98
This classifies the protein as stable.
For this tried the following regex but I don't know what is the problem there.
$s=~s/<B>(.+?)<\/B>/"$1\n";"-" x length($1)/seg; # $1\n in not working
In the above regex I don't know how to put this "$1\n"? And how to use the continuous statement in substitution separated by ; or anything else?
How can I fix it?

The e modifier returns back just the last-executed statement, so
$s=~s/<B>(.+?)<\/B>/"$1\\n";"-" x length($1)/seg;
throws away the "$1\\n" (which should really be "$1\n")
This works:
$s=~s/<B>(.+?)<\/B>/"$1\n" . "-" x length($1)/seg;
The reason I was asking about your Perl version was to assess if it was possible to do what is effectively a variable-length lookbehind with \K:
$s=~s/<B>(.+?)<\/B>\K/ "\n" . "-" x length($1)/seg;
\K is available for Perl versions 5.10+.

Related

Removing unmatched text and building a table with the remaining matches

I have 30000 lines that look like the one below.
342800005013000 CON N GORE PT LOT 31 RP 11R2284 PT PART 1 RP 11R4541 PT PART 2
I would like to capture the 15 digit number at the beginning and any "11R***" numbers.
In Notepad++ I've used \d{15}|(11R\d*)* to match everything that I want. Ultimately I would like to get all the matched results into excel. What would be the best way to do so?
Thanks for your help.
Notepad++ Matches

You could try this one
(^[0-9]*)|(11R[0-9A-Za-z]*)
edit: check it now, the code formatting correctly displays the regex;

Replace characters after character grouping with nothing

I have a large csv with a text column that has a max width of 200. In nearly all cases the data is fine. In some cases, the data is too long or has not quite been filled in properly, i would like to use regex to find the last instance of a specific numeric/character pairing and then remove everything after it.
eg data:
df <- data.frame(ID = c("1","2","3"),
text = c("A|explain what a is|12.2|Y|explain Y|2.36|",
"A|explain what a is|15.2|E|explain E|10.2|E|explain E but run out hal",
"D|explain what d is|0.48|Z|explain z but number 5 is present|"))
My specific character pair is any number followed by a |
This would mean Row 1 is fine, row 2 would have everything after '10.2' removed and row 3 would have everything after 0.48 removed
I tried this regex:
df[,2] <- sub("([^0-9]+[^|]*$)", "", df[,2])
It very nearly nearly worked but the very few rows in my data that have a number present in the explanation do not play along. Any clues? I'm not a great regexer yet, learning the ropes
I saw this question about grouping, but couldn't quite apply it to my problem.

Using sub, we capture one or more characters (.*) followed by one of more numbers, followed by a dot if present (\\.?) followed by one or more numbers as a group followed by | and the rest of the characters until the end of the string. In the replacement, the capture group is specified (\\1).
sub('^(.*[0-9]+\\.?[0-9]+)\\|.*$', '\\1', df$text)

Regex that will get dimensions from string

I'm having a heck of a time finding a regex that will pull dimensions from a string.
Here is what I have so far, its not really doing what I want:
((\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)|(\d+\s*(\d+\d+|\d+)*)\s*[xX]\s*(\d+\s*(\d+\d+|\d+)*)| (\d+\s*(\d+\d+|\d+)*))
Here are some examples of what it pulls (bolded):
-16" x 1476' 80 GA. EQ Ultra-Premium Hand Wrap - Use as replacement for 18" x 1500' 80 Gauge (4/Case)
-48 x 60" Corrugated Sheet (250/Bale)
-MP Die Cut Divider 25 7/ 8 x 19 3/4 1000/bale
-Part "B" (3"x3x 1/2" charcoal foam) only - extra pieces
I'm looking for it grab do the following:
-16" x 1476' 80 GA. EQ Ultra-Premium Hand Wrap - Use as replacement for 18" x 1500' 80 Gauge (4/Case)
-48 x 60" Corrugated Sheet (250/Bale)
-MP Die Cut Divider 25 7/8 x 19 3/4 1000/bale
-Part "B" (3"x3 x 1/2" charcoal foam) only - extra pieces
Notice the regex is not catching the lower part of the fraction because of the "/", also if an inch symbol (i.e. ") is between that dimension and another number it won't grab the first number, you can see that in the first example.
Once I have this regex working, I can strip out the inch and foot symbols (i.e. " and '), and break each number down into each dimension. Just trying to pull the initial dimension numbers first.
Thanks so much if you have any input.

(\d+\s*\d+\s*\/\d+|\d+\s*\/\d+|\d+)["']?(\s*[xX]\s*(\d+\s*\d+\s*\/\d+|\d+\s*\/\d+|\d+)["']?)+
Demo

I don't understand the x-in-the-middle-of-the-number thing, and I didn't attempt to get that part, but otherwise, this works:
([0-9]+["']?(?: [0-9]+/[0-9]+)? x [0-9]+["']?(?: [0-9]+/[0-9]+)?)
Debuggex Demo
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Simplify regular expression for time literals (like "10h50m")

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:
10h30m # meaning 10 hours + 30 minutes
5m30s # meaning 5 minutes + 30 seconds
10h20m15s # meaning 10 hours + 20 minutes + 15 seconds
15.6s # meaning 15.6 seconds
The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course).
As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.
So I have for all but the last group a number match like:
([0-9]+)
And for the last group even:
([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?) # to allow for .5 and 0.5 and 5.0 and 5
Going through all the combinations of h, m and s a cute little python script gives me the following regex:
(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)
Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.

You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:
h|hm|hms|ms|m|s
which is what you have currently. You can break this into:
(h|hm|hms)|(ms|m)|s
and then pulling out h from the first expression and m from the second we get (using (x|) == x?):
h(m|ms)?|ms?|s
Continuing on we get to
h(ms?)?|ms?|s
which is probably simpler (and probably the simplest).
Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as
h(d|m(d|sd?)?)?|m(d|sd?)?|sd?
(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)
This would result in something like (for just hours and minutes):
[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m
Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.

the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.
your hours
[0-9]{1,2}h
your minutes
[0-9]{1,2}m
your seconds
[0-9]{1,2}(\.[0-9]{1,3})?s
you want all those in order, and able to omit any of those (wrap with ?)
([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?
this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.
the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm
looking at #dbaupp h(ms?)?|ms?|s you can take the above and match:
h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s
so you get to:
h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
ms? : [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
s : [0-9]{1,2}(\.[0-9]{1,3})?s
all those OR'd together give you a big but easy to break down regex:
([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s
which get you away with both the empty string problem and the match of hs.
looking at #Donal Fellows comment on #dbaupp answer, I'll also do (h?m)?S|h?M|H
(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
h?m : ([0-9]{1,2}h)?[0-9]{1,2}m
h : [0-9]{1,2}h
and merged together, you end up with something smaller than the above:
(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h
now we have to find a way to match .xx demical representation

Here is a short Python expression that works:
(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))
Inspired by Cameron Martins answer based on conditionals.
Explained:
(\d+h)? # optional int "h" (capture 1)
(\d+m)? # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?) # int or decimal
(?(2) # if "m" (capture 2) was matched:
s # "s"
| (?(1) # else if "h" (capture 1) was matched:
m # "m"
| # else (nothing matched):
[hms])) # any of the "h", "m" or "s"

You may have hours, minutes, and seconds.
/(\d{1,2}h)*(\d{1,2}m)*(\d{1,2}(\.\d+)*s)*/
should do the work. Depending on the regex library, you will get your items in order, or you will have to parse them further to check for h, m or s.
In this latter case, see also what is returned by
/(\d{1,2}(h))*(\d{1,2}(m))*(\d{1,2}(\.\d+)*(s))*/

The last group should be:
([0-9]*\.[0-9]+|[0-9]+(\.[0-9]+)?)
unless you want to match 5.
You could use regex ifs, like so:
(([0-9]+h)?([0-9]+m)?([0-9]+s)?)(?(?<=h)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m)?|(?(?<=m)(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)?|\b(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)[hms])?))
Here - http://regexr.com?31dmj
I havn't checked that this works, but it trys to match just integers for hours, minutes, then seconds first, then if the last thing matched is hours, it allows fractional minutes, otherwise if the last thing matched is minutes, it allows fractional seconds.

RegEx for Prices?

I am searching for a RegEx for prices.
So it should be X numbers in front, than a "," and at the end 2 numbers max.
Can someone support me and post it please?

In what language are you going to use it?
It should be something like:
^\d+(,\d{1,2})?$
Explaination:
X number in front is: ^\d+ where ^ means the start of the string, \d means a digit and + means one or more
We use group () with a question mark, a ? means: match what is inside the group one or no times.
inside the group there is ,\d{1,2}, the , is the comma you wrote, \d is still a digit {1,2} means match the previous digit one or two times.
The final $ matches the end of the string.

I was not satisfied with the previous answers. Here is my take on it:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})
|^^^^^^|^^^^^^^^^^^^^|^^^^^^^^^^^|
| 1-3 | 3 digits | 2 digits |
|digits| repeat any | |
| | no. of | |
| | times | |
(get a detailed explanation here: https://regex101.com/r/cG6iO8/1)
Covers all cases below
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
But also weird stuff like
5.000,000.00
In case you want to include 5 and 1000 (I personally wound not like to match ALL numbers), then just add a "?" like so:
\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?

I am working on similar problem. However i want only to match if a currency Symbol or String is also included in the String like EUR,€,USD or $. The Symbol may be trailing or leading. I don't care if there is space between the Number and the Currency substring. I based the Number matching on the previous discussion and used Price Number: \d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?
Here is final result:
(USD|EUR|€|\$)\s?(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2}))|(\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$)
I use (\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})?)\s?(USD|EUR|€|\$) as a pattern to match against a currency symbol (here with tolerance for a leading space). I think you can easily tweak it for any other currencies
A Gist with the latest Version can be found at https://gist.github.com/wischweh/b6c0ac878913cca8b1ba

So I ran into a similar problem, needing to validate if an arbitrary string is a price, but needed a lot more resilience than the regexes provided in this thread and many other threads.
I needed a regex that would match all of the following:
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00
And not to match stuff like IP addresses. I couldn't figure out a single regex to deal with the european and non-european stuff in one fell swoop so I wrote a little bit of Ruby code to normalise prices:
if value =~ /^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
Float(value.delete(","))
elsif value =~ /^([1-9][0-9]{,2}(\.[0-9]{3})*|[0-9]+)(,[0-9]{1,9})?$/
Float(value.delete(".").gsub(",", "."))
else
false
end
The only difference between the two regexes is the swapped decimal place and comma. I'll try and break down what this is doing:
/^([1-9][0-9]{,2}(,[0-9]{3})*|[0-9]+)(\.[0-9]{1,9})?$/
The first part:
([1-9][0-9]{,2}(,[0-9]{3})*
This is a statement of numbers that follow this form: 1,000 1,000,000 100 12. But it does not allow leading zeroes. It's for the properly formatted numbers that have groups of 3 numerics separated by the thousands separator.
Second part:
[0-9]+
Just match any number 1 or more times. You could make this 0 or more times if you want to match: .11 .34 .00 etc.
The last part:
(\.[0-9]{1,9})?
This is the decimal place bit. Why up to 9 numerics, you ask? I've seen it happen. This regex is supposed to be able to handle any weird and wonderful price it sees and I've seen some retailers use up to 9 decimal places in prices. Usually all 0s, but we wouldn't want to miss out on the data ^_^
Hopefully this helps the next person to come along needing to process arbitrarily badly formatted price strings or either european or non-european format :)

^\d+,\d{1,2}$

I am currently working on a small function using regex to get price amount inside a String :
private static String getPrice(String input)
{
String output = "";
Pattern pattern = Pattern.compile("\\d{1,3}[,\\.]?(\\d{1,2})?");
Matcher matcher = pattern.matcher(input);
if (matcher.find())
{
output = matcher.group(0);
}
return output;
}
this seems to work with small price (0,00 to 999,99) and various currency :
$12.34 -> 12.34
$12,34 -> 12,34
$12.00 -> 12.00
$12 -> 12
12€ -> 12
12,11€ -> 12,11
12.999€ -> 12.99
12.9€ -> 12.9
£999.99€ -> 999.99
...

Pretty simple for "," separated numbers(Or no seperation) with 2 decimal places , supports deliminator but does not force them. Needs some improvement but should work.
^((\d{1,3}|\s*){1})((\,\d{3}|\d)*)(\s*|\.(\d{2}))$
matches:
1,123,456,789,134.45
1123456134.45
1234568979
12,345.45
123.45
123
no match:
1,2,3
12.4
1234,456.45
This may need some editing to make it function correctly
Quick explanation: Matches 1-3 numbers(Or nothing), matches a comma followed by 3 numbers as many times as needed(Or just numbers), matches a decimal point followed by 1 or 2 numbers(Or Nothing)

This code worked for me !! (PHP)
preg_match_all('/\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?/',$price[1]->plaintext,$lPrices);

So far I tried, this is the best
\d{1,3}[,\\.]?(\\d{1,2})?
https://regex101.com/r/xT8aQ7/1

r'(^\-?\d*\d+.?(\d{1,2})?$)'
This will allow digits with only one decimal and two digits after decimal

This one reasonably works when you may or may not have decimal part but an amount shows up like this 100,000 - or 100,000.00. Tested using Clojure only
\d{1,3}(?:[.,]\\d{3})*(?:[.,]\d{2,3})

\d+((,\d+)+)?(.\d+)?(.\d+)?(,\d+)?
to cover all
5
5.00
1,000
1,000,000.99
5,99 (european price)
5.999,99 (european price)
0.11
0.00

^((\d+)((,\d+|\d+)*)(\s*|\.(\d{2}))$)
Matches:
1
11
111
1111111
11,2122
1222,21222
122.23
1223,3232.23
Not Matches:
11e
x111
111,111.090
1.000

anything like \d+,\d{2} is wrong because the \d matches [0-9\.] i.e. 12.34,1.
should be: [0-9]+,[0-9]{2} (or [0-9]+,[0-9]{1,2} to allow only 1 decimal place)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

substitution with eval and repeat the character by grouping string length? - regex

Related

Removing unmatched text and building a table with the remaining matches

Replace characters after character grouping with nothing

Regex that will get dimensions from string

Simplify regular expression for time literals (like "10h50m")

RegEx for Prices?

Categories

Resources