I have this simple code that I need to match some words with accents but its not working like I need.
This is the code
<?
$ab=("BÉLICA HOL");
preg_match_all("/[A-ZÑÁÉÍÓÚ\.]+\b/", $ab,$match_mayusculas);
print_r($match_mayusculas);
?>
The result is this: Array ( [0] => Array ( [0] => BÉLICA [1] => HOL ) )
Why?
If I do this
$ab=utf8_decode("BÉLICA HOL");
The result is Array ( [0] => Array ( [0] => B [1] => LICA [2] => HOL ) )
Where is my mistake?
Really thanks
This works
utf8_decode($match_mayusculas[0][0])
Thanks sp00m
Related
Is it possible to have n capture groups?
For example,
http://www.example.com/first-path
http://www.example.com/first-path/second-path
http://www.example.com/first-path/second-path/third-path
http://www.example.com/something.html
http://www.example.com/first-path?id=5
I am trying to capture first-path as group 1, second-path as group 2, and third-path as group 3 using http:\/\/(.*)\/(?!.*\/$)(.*), but it does not split the segments.
No specific programming language being used.
If you were using PHP, you could do something like this. The first split removes the leading http://www.example.com/ part, and the second then splits those values around the /:
$urls = array('http://www.example.com/first-path',
'http://www.example.com/first-path/second-path',
'http://www.example.com/first-path/second-path/third-path',
'http://www.example.com/something.html',
'http://www.example.com/first-path?id=5');
foreach ($urls as $url) {
$tail = preg_split('#https?://[^/]+/#', $url, -1, PREG_SPLIT_NO_EMPTY)[0];
$paths = preg_split('#/#', $tail);
print_r($paths);
}
Output:
Array
(
[0] => first-path
)
Array
(
[0] => first-path
[1] => second-path
)
Array
(
[0] => first-path
[1] => second-path
[2] => third-path
)
Array
(
[0] => something.html
)
Array
(
[0] => first-path?id=5
)
A similar thing could be done in Javascript:
let urls = ['http://www.example.com/first-path',
'http://www.example.com/first-path/second-path',
'http://www.example.com/first-path/second-path/third-path',
'http://www.example.com/something.html',
'http://www.example.com/first-path?id=5'];
console.log(urls.map(s => s.split(/https?:\/\/[^\/]+\//)[1].split(/\//)))
Output:
Array(5) […]
0: Array [ "first-path" ]
1: Array [ "first-path", "second-path" ]
2: Array(3) [ "first-path", "second-path", "third-path" ]
3: Array [ "something.html" ]
4: Array [ "first-path?id=5" ]
I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)
Here is my regex: {{[^\{\s}]+\}}
And my input is {{test1}}{{test2}}{{test3}}.
How can I get these 3 tests by array using regex expression?
I would use: ~\{\{([^}]+?)\}\}~
and accessing array depends on your language!
[EDIT] add explanations
~: delimiter
\{\{, \}\}~: match characters literally. Should be
escaped.
[^}]: match anything inside {{}} until a }
+: repeat
pattern multiple times (for multiple characters)
?: is for 'lazy'
to match as few times as possible.
(): is to capture
:)
[EDIT] add PHP code sample for matching illustration:
<?php
$string= "{{test1}}{{test2}}{{test3}}";
if (preg_match_all("~\{\{([^}]+?)\}\}~s", $string, $matches))
{
print_r(array($matches));
// Do what you want
}
?>
will output this:
Array
(
[0] => Array
(
[0] => Array
(
[0] => {{test1}}
[1] => {{test2}}
[2] => {{test3}}
)
[1] => Array
(
[0] => test1
[1] => test2
[2] => test3
)
)
)
test[0-9]+
This matches all occurences of testX where X is an integer of any size.
If you're trying to identify the braces instead, use this:
[{\}]
C# uses Matches Method returns MatchCollection object.
Here is some codes,
Regex r = new Regex(#"{{[^{\s}]+}}");
MatchCollection col = r.Matches("{{test1}}{{test2}}{{test3}}");
string[] arr = null;
if (col != null)
{
arr = new string[col.Count];
for (int i = 0; i < col.Count; i++)
{
arr[i] = col[i].Value;
}
}
Here's my string:
address='St Marks Church',notes='The North East\'s premier...'
The regex I'm using to grab the various parts using match_all is
'/(address|notes)='(.+?)'/i'
The results are:
address => St Marks Church notes => The North East\
How can I get it to ignore the \' character for the notes?
Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:
$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => address="St Marks Church"
[1] => notes="The North East's premier..."
)
[1] => Array
(
[0] => address
[1] => notes
)
[2] => Array
(
[0] => St Marks Church
[1] => The North East's premier...
)
)
Another method with preg_split:
//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
//split the values at the = sign
$parts[$key]=preg_split('!=!',$value);
foreach($parts[$key] as $k2=>$v2){
//trim the quotes out and remove the slashes
$parts[$key][$k2]=stripslashes(trim($v2,"'"));
}
}
Output looks like:
Array
(
[0] => Array
(
[0] => address
[1] => St Marks Church
)
[1] => Array
(
[0] => notes
[1] => The North East's premier...
)
)
Super slow old-skool method:
$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
switch($string[$pos]){
case $string[$pos]==='=':
$mode = 'value';
break;
case $string[$pos]===",":
$store[$key]=trim($value,"'");
$key=$value='';
$mode = 'key';
break;
default:
$$mode .= $string[$pos];
}
$pos++;
}
$store[$key]=trim($value,"'");
Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.
The following patterns will match the substrings required to buildyour desired associative array:
Patterns that generate a fullstring match and 1 capture group:
/(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
/(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)
Patterns that generate 2 capture groups:
/(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
/(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)
Code: (Demo)
$string = "address='St Marks Church',notes='The North East\'s premier...'";
preg_match_all(
"/(address|notes)='\K(?:\\\'|[^'])*/",
$string,
$out
);
var_export(array_combine($out[1], $out[0]));
echo "\n---\n";
preg_match_all(
"/(address|notes)='((?:\\\'|[^'])*)/",
$string,
$out,
PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));
Output:
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
---
array (
'address' => 'St Marks Church',
'notes' => 'The North East\\\'s premier...',
)
Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.
Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.
Some notes:
Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.
Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.
You should match up to an end quote that isn't preceded by a backslash thus:
(address|notes)='(.*?)[^\\]'
This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.
I have the following string:
cn=abcd,cn=groups,dc=domain,dc=com
Can a regular expression be used here to extract the string after the first cn= and before the first ,? In the example above the answer should be abcd.
/cn=([^,]+),/
most languages will extract the match as $1 or matches[1]
If you can't for some reason wield subscripts,
$x =~ s/^cn=//
$x =~ s/,.*$//
Thats a way to do it in 2 steps.
If you were parsing it out of a log with sed
sed -n -r '/cn=/s/^cn=([^,]+),.*$/\1/p' < logfile > dumpfile
will get you what you want. ( Extra commands added to only print matching lines )
/^cn=([^,]+),/
Also, look for a pre-built LDAP parser.
Yeah, using perl/java syntax cn=([^,]*),. You'd then get the 1st group.
I had to work that out in PHP.
Since a LDAP string can sometimes be lengthy and have many attributes, I thought of contributing how I am using it in a project.
I wanted to use:
CN=username,OU=UNITNAME,OU=Region,OU=Country,DC=subdomain,DC=domain,DC=com
And turn it into:
array (
[CN] => array( username )
[OU] => array( UNITNAME, Region, Country )
[DC] => array ( subdomain, domain, com )
)
Here is how I built my method.
/**
* Read a LDAP DN, and return what is needed
*
* Takes care of the character escape and unescape
*
* Using:
* CN=username,OU=UNITNAME,OU=Region,OU=Country,DC=subdomain,DC=domain,DC=com
*
* Would normally return:
* Array (
* [count] => 9
* [0] => CN=username
* [1] => OU=UNITNAME
* [2] => OU=Region
* [5] => OU=Country
* [6] => DC=subdomain
* [7] => DC=domain
* [8] => DC=com
* )
*
* Returns instead a manageable array:
* array (
* [CN] => array( username )
* [OU] => array( UNITNAME, Region, Country )
* [DC] => array ( subdomain, domain, com )
* )
*
*
* #author gabriel at hrz dot uni-marburg dot de 05-Aug-2003 02:27 (part of the character replacement)
* #author Renoir Boulanger
*
* #param string $dn The DN
* #return array
*/
function parseLdapDn($dn)
{
$parsr=ldap_explode_dn($dn, 0);
//$parsr[] = 'EE=Sôme Krazï string';
//$parsr[] = 'AndBogusOne';
$out = array();
foreach($parsr as $key=>$value){
if(FALSE !== strstr($value, '=')){
list($prefix,$data) = explode("=",$value);
$data=preg_replace("/\\\([0-9A-Fa-f]{2})/e", "''.chr(hexdec('\\1')).''", $data);
if(isset($current_prefix) && $prefix == $current_prefix){
$out[$prefix][] = $data;
} else {
$current_prefix = $prefix;
$out[$prefix][] = $data;
}
}
}
return $out;
}