Preg_match for items in a list - regex

EDIT: The answer and comment below make me think that I didn't explain this clearly... I am looking for a regular expression that matches multiple occurrences of a list. For example, I might want to take ABCBCBCBCBCD and I want to get the array [BC, BC, BC, BC, BC] from it. I don't know how many items will be in the list. If it is ABCD, I want the list [bc]. If it is ABCBCD, I want [bc, bc]. I thouht I could use /A(BC)+D/ to match all occurrences of BC, but that is not working.
The original question...
I have a set of very large data files. Per file, I only want a list of items out of it. The information I'm looking for has the format:
...<RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>...
The ... means that there is tons of text before and after this set. I can easily get the first item listed using the regex
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)~', $data, $matches);
Then, $matches[1] has "Amoxicillin, ". I tried to get all matches in the list using:
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)+~', $data, $matches);
That doesn't work. I get no matches. What is the syntax for "Multiple matches for the preceding sequence between ( and )"?
Of note, this is what is in $matches:
Array (
[0] => <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN>
[1] => <LN ID=531123>Amoxicillin</LN>
)
So, it looked at both items in the list, but only returned the first one. What I want is:
Array (
[0] => <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN>
[1] => <LN ID=531123>Amoxicillin</LN>
[2] => <LN ID=441655>Akikacin</LN>
)

Is this what you are looking for?
preg_match_all("/(\<RXCUI\>.*\<\/LN\>)/", $input_lines, $output_array);
http://www.phpliveregex.com/p/fpc

After a lot of research, it appears that this cannot be done with a single preg_match function. It requires two passes. The first will pull the entire match from the beginning to the end of the list. The second will break the list into the matches that are desired.
The first pass (assume $s = ...<RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>...)
preg_match('~<RXCUI>[^<]*(<LN[^>]*>[^<]*</LN>[^<]*)+</ERS>~', $s, $match1);
Now, $match1[0] = <RXCUI> <LN ID=531123>Amoxicillin</LN>, <LN ID=441656>Amikacin</LN></ERS>
I can use preg_match_all to get just what I want between the RXCUI and ERS elements
preg_match_all('~<LN[^>]*>[^<]*</LN>~', $match1[0], $match2);
Now, $match2[0] will contain an array:
[0] => <LN ID=531123>Amoxicillin</LN>
[1] => <LN ID=441656>Amikacin</LN>
It doesn't matter how many LN lines there are, the second preg_match_all will return them all.
This could be simplified a great deal if you could ensure that there are no LN elements anywhere else in the original document. I know that they are are LN elements that are not part of the RXCUI section. So, I can't just look for those.

Related

Cypress: find part of a string with cy.contains

Let's say I have two elements with these texts: "Find a hero" and "Find the hero".
I want to use cy.contains() to find one of these, and want to write something like
cy.contains("Find" + * + "hero")
I don't understand how I can write this command to find anything that contains the two words "Find" and "hero" in a sentence, no matter the order or where they come in.
I'm only using the native cypress (no imported testing libraries, was hoping it won't be necessary).
Hope someone can help.
The format of .contains() to use is a regex parameter, which has "/" delimiters:
cy.contains(/Find .* hero/)
The ".*" in the middle means any characters, and any number of characters.
Check out the example on https://regex101.com.
It's possible to use startsWith and endsWith
cy.get(selector)
.should('satisfy', ($el) => {
const text = $el.text()
return text.startsWith('Find') && text.endsWith('hero')
})
This is my helper function
const contain = ($el, first, last) => {
const text = $el.text()
return text.startsWith(first) && text.endsWith(last)
}
cy.get(selector)
.should($el => contain($el, 'Find', 'hero'))
Double contains
I don't think anyone mentioned yet, you can use :contains() inside .contains()
cy.contains(':contains(Find)', 'hero') // both strings contained
Matches the 2nd one:
<div>Find the villain</div>
<div>Find the hero</div>
You could use a regex expression to assert that text contain both 'Find' and 'Hero' by doing the following :
cy.get('[data-cy=login-button]').invoke('text').should('match', new RegExp('.*Find.*hero', 'gi'));
Or your could even do
cy.get('YOUR_ELEMENT').should('contains', 'Find').should('contains', 'Hero')

php regexp to search replace string functions to mb string functions

Solution was to look into look-aheads and look-behinds - the concept of LookArounds in RegEx helped me solve my issue since replacements was eaten from eachother when i did a replacement
So we've been working for a while to make some transitions on some of our older projects and (perhaps bad/old coding habits) and are working on making them php7-ready.
In this process i have made some adjustments in the .php files of the project so that for example
The problem at hand is that im facing some issues with danish characters in php string functions (strlen, substr etc) and would like for them to use mb_string functions instead. From what i can read on the internet using the "overload" function is not the way to go, so therefore i've decided to make filebased search replace.
My search replace function look like this right now (Updated thanks to #SeanBright)
$testfile = file_get_contents($file);
$array = array ( 'strlen'=>'mb_strlen',
'strpos'=>'mb_strpos',
'substr'=>'mb_substr',
'strtolower'=>'mb_strtolower',
'strtoupper'=>'mb_strtoupper',
'substr_count'=>'mb_substr_count',
'split'=>'mb_split',
'mail'=>'mb_send_mail',
'ereg'=>'mb_ereg',
'eregi'=>'mb_eregi',
'strrchr' => 'mb_strrchr',
'strichr' => 'mb_strichr',
'strchr' => 'mb_strchr',
'strrpos' => 'mb_strrpos',
'strripos' => 'mb_strripos',
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr'
);
foreach($array as $function_name => $mb_function_name){
$search_string = '/(^|[\s\[{;(:!\=\><?.,\*\/\-\+])(?<!->)(?<!new )' . $function_name . '(?=\s?\()/i';
$testfile = preg_replace($search_string, "$1".$mb_function_name."$2$3", $test,-1,$count);
}
print "<pre>";
print $test;
The $file has this content:
<?php
print strtoupper('test');
print strtolower'test');
print substr('tester',0,1);
print astrtoupper('test');
print bstrtolower('test');
print csubstr(('tester',0,1);
print [substr('tester',0,1)];
print {substr('tester',0,1)};
substr('test',0,1);
substr('test',0,1);
(substr('test',0,1));
!substr();
if(substr()==substr()=>substr()<substr()){
?substr('test');
}
"test".substr('test');
'asd'.substr('asd');
'asd'.substr('asd');
substr( substr('asdsadsadasd',0,-1),strlen("1"),strlen("100"));
substr (substr ('Asdsadsadasd',0,-1), strlen("1"), strlen("100"));
substr(substr(substr('Asdsadsadasd',0,-1),0,-1), strlen("1"), strlen("100"));
mailafsendelse(substr('asdsadsadasd',0,-1), strlen("1"), strlen("100"));
mail(test);
substr ( tester );
substr ( tester );
mail mail mail mail ( tester );
$mail->mail ();
$mail -> mail ();
new Mail();
new mail ();
strlen ( tester )*strlen ( tester )+strlen ( tester )/strlen ( tester )-strlen ( tester )
;
The point here is that the actual php code does not have to be valid syntax. I just wanted to make it work in different scenarios
My regEx problem is that i cannot find out why this line:
substr(substr(substr('Asdsadsadasd',0,-1),0,-1), strlen("1"), strlen("100"));
is not working. The 1st and 3rd substr are replaced correct but the 2nd looks like this:
mb_substr(substr(mb_substr('Asdsadsadasd',0,-1),0,-1), mb_strlen("1"), mb_strlen("100"));
As a note my search string is made to work with all sorts of characters in front of function name and require that the characters AFTER the function name is a "("
In a perfect world i would like to also exclude stringfunctions that are methods in classes, for example: $order->mail() that would send an email. This i would like NOT to be converted to $order->mb_send_mail()
From my understanding all parameters are the same, so it should not be a problem.
Complete script can be found here
https://github.com/welrachid/phpStringToMBString
The problem is that some of the characters you are using to delimit your function call checks are being consumed by matching. If you switch the last group to be a positive lookahead, this will fix the problem:
$search_string = '/([ \[{\n\t\r;(:!=><?\.,])'.($function_name).'([\ |\t]{0,1})(?=[(]{1})/i';
^^ Add these
Your current expression also won't match function calls at the beginning of the line. The following handles that and also simplifies things a bit:
$search_string = '/(^|[\s\[{;(:!=><?.,])' . $function_name . '(?=\s?\()/i';
I've set up an example on regex101.com.
You might even be able to get away with:
$search_string = '/(^|\W)' . $function_name . '(?=\s?\()/i';
Where \W will match a non-word character.
Update
To prevent matching method calls, you can add a negative lookbehind to your pattern:
$search_string = '/(^|[\s\[{;(:!=><?.,])(?<!->)' . $function_name . '(?=\s?\()/i';
^^^^^^^

Using Regular expression to search for a value but exlude that string from the results?

This is prob really simple, I'm new to regular expressions but say I wanted to find the 2 numbers preceding some characters. i.e "12 qty"
So I'm using \d\d.qty to bring back the match "12 qty" but I want to exclude the word qty?
I have tried using \d\d.qty*([^qty]*) but it doesn't work.
You need to use a positive look ahead, depends on which language of course:
(\d\d)(?=\sqty)
You could use (\d\d)(.qty) so you get back
Array
(
[0] => Array
(
[0] => 12 qty
)
[1] => Array
(
[0] => 12
)
[2] => Array
(
[0] => qty
)
)
Now use second item of the array and you have, what you want

Parse labeled param strings with Regex

Can anyone help me with this one?
My objective here is to grab some info from a text file, present the user with it and ask for values to replace that info so to generate a new output. So I thought of using regular expressions.
My variables would be of the format: {#<num>[|<value>]}.
Here are some examples:
{#1}<br>
{#2|label}<br>
{#3|label|help}<br>
{#4|label|help|something else}<br><br>
So after some research and experimenting, I came up with this expression: \{\#(\d{1,})(?:\|{1}(.+))*\}
which works pretty well on most of the ocasions, except when on something like this:
{#1} some text {#2|label} some more text {#3|label|help}
In this case variables 2 & 3 are matched on a single occurrence rather than on 2 separate matches...
I've already tried to use lookahead commands for the trailing } of the expression, but I didn't manage to get it.
I'm targeting this expression for using into C#, should that further help anyone...
I like the results from this one:
\{\#(\d+)(?:|\|(.+?))\}
This returns 3 groups. The second group is the number (1, 2, 3) and the third group is the arguments ('label', 'label|help').
I prefer to remove the * in favor of | in order to capture all the arguments after the first pipe in the last grouping.
A regular expression which can be used would be something like
\{\#(\d+)(?:\|([^|}]+))*\}
This will prevent reading over any closing }.
Another possible solution (with slightly different behaviour) would be to use a non-greedy matcher (.+?) instead of the greedy version (.+).
Note: I also removed the {1} and replaced {1,} with + which are equivalent in your case.
Try this:
\{\#(\d+)(?:\|[^|}]+)*\}
In C#:
MatchCollection matches = Regex.Matches(mystring,
#"\{\#(\d+)(?:\|[^|}]+)*\}");
It prevents the label and help from eating the | or }.
match[0].Value => {#1}
match[0].Groups[0].Value => {#1}
match[0].Groups[1].Value => 1
match[1].Value => {#2|label}
match[1].Groups[0].Value => {#2|label}
match[1].Groups[1].Value => 2
match[2].Value => {#3|label|help}
match[2].Groups[0].Value => {#3|label|help}
match[2].Groups[1].Value => 3

Regex Help, How do I make order of expressions not matter?

I can't figure out how to get the order of the incoming string parameters (price,merchant,category) will not matter to the regex. My regex matches the parts of the string but not the string as a whole. I need to be able to add \A \Z to it.
Pattern:
(,?price:(;?(((\d+(\.\d+)?)|min)-((\d+(\.\d+)?)|max))|\d+)+){0,1}(,?merchant:\d+){0,1}(,?category:\d+){0,1}
Sample Strings:
price:1.00-max;3-12;23.34-12.19,category:3
merchant:25,price:1.00-max;3-12;23.34-12.19,category:3
price:1.00-max;3-12;23.34-12.19,category:3,merchant:25
category:3,price:1.00-max;3-12;23.34-12.19,merchant:25
Note: I'm going to add ?: to all my groups after I get it working.
You should probably just parse this string through normal parsing. Split it at the commas, then split each of those pieces into two by the colons. You can store validation regexes if you'd like to check each of those inputs individually.
If you do it through regex, you'll probably have to end up saying "this combination OR this combination OR this combination", which will hurt real bad.
You have three options:
You can enumerate all the possible orders. For 3 variables there are 6 possibilities. Obviously this doesn't scale;
You can accept possible duplicates; or
You can break the string up and then parse it.
(2) means something like:
/(\b(price|category|merchant)=(...).*?)*/
The real problem you're facing here is that you're trying to parse what is essentially a non-regular language with a regular expression. A regular expression describes a DFSM (deterministic finite state machine) or DFA (deterministic finite automaton). Regular languages have no concept of state so the expression can't "remember" what else there has been.
To get to that you have to add a "memory" usually in the form of a stack, which yields a PDA (pushdown automaton).
It's exactly the same problem people face when they try and parse HTML with regexes and get stuck on tag nesting issues and similar.
Basically you accept some edge conditions (like repeated values), split the string by comma and then parse or you're just using the wrong tool for the job.
How about don't try and do it all with one Cthulhugex?
/price:([^,]*)/
/merchant:([^,]*)/
/category:([^,]*)/
$string=<<<EOF
price:1.00-max;3-12;23.34-12.19,category:3
merchant:25,price:1.00-max;3-12;23.34-12.19,category:3
price:1.00-max;3-12;23.34-12.19,category:3,merchant:25
category:3,price:1.00-max;3-12;23.34-12.19,merchant:25
EOF;
$s = preg_replace("/\n+/",",",$string);
$s = explode(",",$s);
print_r($s);
output
$ php test.php
Array
(
[0] => price:1.00-max;3-12;23.34-12.19
[1] => category:3
[2] => merchant:25
[3] => price:1.00-max;3-12;23.34-12.19
[4] => category:3
[5] => price:1.00-max;3-12;23.34-12.19
[6] => category:3
[7] => merchant:25
[8] => category:3
[9] => price:1.00-max;3-12;23.34-12.19
[10] => merchant:25
)