Difficult Regexp - regex

I need a regexp which does the following:
Heres the name of an HTML input field:
lm[0][ti]
I need to find the basic name ("lm"). Only if the name contains brackets I need to find the string in the second brackets ("ti").
To get it in portions is easy with the following regexp:
([a-zA-Z\d_]+)\[?([0-9]*)\]?\[?([a-zA_Z\d_]+)\]?
It matches all the portions I need.
Array
(
[0] => lm[0][ti]
[1] => lm
[2] => 0
[3] => ti
)
But if the HTML input name was just "lm", using this regexp I cannot determine that item #4 in the array is a valid name. The array would look like this:
Array
(
[0] => lm
[1] => l
[2] =>
[3] => m
)
"m" is not valid for me, I'd like to get this array:
Array
(
[0] => lm
[1] =>
[2] =>
[3] =>
)
or this
Array
(
[0] => lm
)
You can test the regexp here:
http://regexp-tester.mediacix.de/exp/regex/
Thanks for support in finding the right regexp...

Try this:
(\w+)(?:\[(\d+)\])?(?:\[(\w+)\])?
Input:
lm[0][ti]
Output:
Input:
lm
Output:

Related

How to use regex to include linebreaks in extracted results

I am processing a text file of messages that resembles this (though a lot longer):
13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
Hello
13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
where someone added a line break
13/09/18, 4:10 pm - Fred Dag: Here is another message
The following regex works to extract the data into Date, Time, Name and Message except where the Message includes a line break:
(?<date>(?:[0-9]{1,2}\/){2}[0-9]{1,2}),\s(?<time>(?:[0-9]{1,2}:)[0-9]{2}\s[a|p]m)\s-\s(?<name>(?:.*)):\s(?<message>(?:.+))
Using preg_match_all, and the regex above, in php7.4 I have generated the following array:
Array
(
[0] => Array
(
[date] => 13/09/18
[time] => 4:14 pm
[name] => Fred Dag
[message] => Jackie, please could you send to me too? ‚ thank you
)
[1] => Array
(
[date] => 13/09/18
[time] => 4:45 pm
[name] => Jackie Johnson
[message] => Here is yet another message
)
[2] => Array
(
[date] => 13/09/18
[time] => 4:10 pm
[name] => Fred Dag
[message] => Here is another message
)
)
But the array is missing the lines caused by the line breaks which should be appended to the previous Message. I get the same result when playing in regex101.com.
I tried including the single line modifier for the message like
this (?<message>(?s:.+)) but that then selected everything from the start of the first message to the end of the file.
I tried playing with greedy vs non-greedy but I couldn't get that to work.
I tried using a reverse lookup, but I don't seem to have enough understanding
to get that to work and ended up just randomly pasting code off the internet which did nothing but get me frustrated.
I think I have exhausted my knowledge of regex and reached the end of Google with the terms I know to search with :) Could anyone point me in the right direction?
Your immediate problem seems to be that the dot you are using to match the message content does not match across newlines. That can easily be fixed by using the /s dot all flag in your PHP regex. But that aside, I think your regex would also need to change. I suggest the following pattern:
\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)
This pattern matches a line from the starting date, across newlines, until reaching either the start of the next message or the end of the input.
Sample script:
$input = "13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you\nHello\n13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message\nwhere someone added a line break\n13/09/18, 4:10 pm - Fred Dag: Here is another message";
preg_match_all("/\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}.*?(?=\d{2}\/\d{2}\/\d{2}, \d{1,2}:\d{1,2}|$)/s", $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => 13/09/18, 4:14 pm - Fred Dag: Jackie, please could you send to me too? ‚ thank you
Hello
[1] => 13/09/18, 4:45 pm - Jackie Johnson: Here is yet another message
where someone added a line break
[2] => 13/09/18, 4:10 pm - Fred Dag: Here is another message
)

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Regex pattern matching for inconsistent address patterns in large dat file

I know it can't be perfect but I am not very good with regex and I'm having difficulties getting a better matching percentage.
I have a file that has over 9 million rows and the addresses are very inconsistent. I was wondering if I could get some help from the people here that are better than me. Any help would be greatly appreciated.
This is what I have so far. I thought the best way to attack this would be to try to match the pattern from the end of the string since apt,bx, po box, etc could be at the start of the string.
/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/
Several patterns that I can see. The large number of spaces is as in the file. I tried splitting on 2 spaces or more as well as in the regex I have thus far.
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS ZIP CITY STATE
ADDRESS CITY STATE
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
P O BOX # ADDRESS CITY STATE
APT DIGIT# ADDRESS CITY STATE
SPACE DIGIT ADDRESS CITY STATE
UNIT # ADDRESS CITY STATE
SP DIGIT ADDRESS CITY STATE
DIGITS-DIGITS ADDRESS CITY STATE
BX DIGIT ADDRESS CITY STATE
ADDRESS APT # CITY STATE
ADDRESS UNIT # CITY STATE
ADDRESS P O BOX DIGIT CITY STATE
P O B O X DIGIT CITY STATE
P O BOX DIGIT CITY STATE
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY STATE
This is a rather complex problem which sadly won't have a simple solution.
You could try the following regex admittedly far from perfect:
^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)
In the image, group 1 = address; group 2 = zip; group 3 = city; group 4 = state
Input, note I changed STATE to st; zip to 12345; and po box digits to actual digits
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
ADDRESS CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
P O BOX # 1234 ADDRESS CITY st
APT DIGIT# ADDRESS CITY st
SPACE DIGIT ADDRESS CITY st
UNIT # ADDRESS CITY st
SP DIGIT ADDRESS CITY st
DIGITS-DIGITS ADDRESS CITY st
BX DIGIT ADDRESS CITY st
ADDRESS APT # CITY st
ADDRESS UNIT # CITY st
ADDRESS P O BOX 3245 CITY st
P O B O X 123 CITY st
P O BOX 345 CITY st
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
Matches
[0] => Array
(
[0] => F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
[1] => ADDRESS CITY st
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[3] => APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[4] => P O BOX # 1234 ADDRESS CITY st
[5] => APT DIGIT# ADDRESS CITY st
[6] => SPACE DIGIT ADDRESS CITY st
[7] => UNIT # ADDRESS CITY st
[8] => SP DIGIT ADDRESS CITY st
[9] => DIGITS-DIGITS ADDRESS CITY st
[10] => BX DIGIT ADDRESS CITY st
[11] => ADDRESS APT # CITY st
[12] => ADDRESS UNIT # CITY st
[13] => ADDRESS P O BOX DIGIT CITY st
[14] => P O B O X 123 CITY st
[15] => P O BOX 345 CITY st
[16] => ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
)
[address] => Array
(
[0] => ADDRESS 12345
[1] => ADDRESS
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[4] => ADDRESS
[5] => APT DIGIT#
[6] => ADDRESS
[7] => ADDRESS
[8] => ADDRESS
[9] => DIGITS-DIGITS ADDRESS
[10] => ADDRESS
[11] => APT #
[12] => UNIT #
[13] => DIGIT
[14] => 123
[15] => P O BOX 345
[16] => SPACE/SP/SPC/UNIT DIGIT
)
[zip] => Array
(
[0] => 12345
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
)
[city] => Array
(
[0] => CITY
[1] => CITY
[2] => CITY
[3] => CITY
[4] => CITY
[5] => ADDRESS CITY
[6] => CITY
[7] => CITY
[8] => CITY
[9] => CITY
[10] => CITY
[11] => CITY
[12] => CITY
[13] => CITY
[14] => CITY
[15] => CITY
[16] => CITY
)
[state] => Array
(
[0] => st
[1] => st
[2] => st
[3] => st
[4] => st
[5] => st
[6] => st
[7] => st
[8] => st
[9] => st
[10] => st
[11] => st
[12] => st
[13] => st
[14] => st
[15] => st
[16] => st
)
Recommend having a look at question 11160192
Denomales' answer is quite sufficient for your needs I think, but I'm going to expand my comment above into an answer since I think there are some relevant pieces specific to your question.
Are they US addresses? You could try an API or tool to extract the addresses en-masse. Here's an example of such a tool from another Stack Overflow answer recently, which had a small list of addresses to match:
For disclosure, I work at SmartyStreets and helped to develop this. While it's not designed specifically with spreadsheet or tabular address data in mind, it was designed for non-uniform input like freeform text. You can even splice millions of rows into the service in pieces.
Perhaps this will be helpful as it validates the addresses too, after it finds them in text. Addresses are real gnarly, as you're discovering, and a dedicated tool can sometimes be the best way to handle them. Not saying this is the correct answer for your case, but hopefully still informative.

Amazon SES getSendStatistics

I am facing some problem related to getSendStatistics in Amazon SES Api, problem is that when i call getSendStatistics first time return data is not same as Second time call getSendStatistics (when page is refresh).
Example
First time return data
[GetSendStatisticsResult] => CFSimpleXML Object
(
[SendDataPoints] => CFSimpleXML Object
(
[member] => Array
(
[0] => CFSimpleXML Object
(
[DeliveryAttempts] => 3
[Timestamp] => 2013-04-23T04:47:00Z
[Rejects] => 0
[Bounces] => 0
[Complaints] => 0
)
[1] => CFSimpleXML Object
(
[DeliveryAttempts] => 1
[Timestamp] => 2013-04-23T10:17:00Z
[Rejects] => 0
[Bounces] => 0
[Complaints] => 0
)
)
)
)
Second time return data
[GetSendStatisticsResult] => CFSimpleXML Object
(
[SendDataPoints] => CFSimpleXML Object
(
[member] => Array
(
[0] => CFSimpleXML Object
(
[DeliveryAttempts] => 1
[Timestamp] => 2013-04-23T10:17:00Z
[Rejects] => 0
[Bounces] => 0
[Complaints] => 0
)
[1] => CFSimpleXML Object
(
[DeliveryAttempts] => 3
[Timestamp] => 2013-04-23T04:47:00Z
[Rejects] => 0
[Bounces] => 0
[Complaints] => 0
)
)
)
)
some how its change it position don't know what happening can any one guide me about this problem . i am newbie for Amazon SES
Thank You
According to AWS, there is no guarantee, at any particular moment, that statistics are 100% up to date - there may be a reporting lag, even if you haven't sent something in between the calls.
We may delay the data returned in GetSendStatistics in order to better aggregate the data, and it is not guaranteed to be accurate to the minute.
https://forums.aws.amazon.com/thread.jspa?messageID=278174
Could this possibly explain what you are seeing?

Regex to separate and phone numbers of same network from a list of phone numbers

I have been had a tough time trying to figure this out.
Let me explain my needs this way:
I am making a php sms sender script.
The recipients phone numbers will be typed inside a textarea, separated with commas
eg : 2348064356853,2347065478934,2348167456845,2347123454680.
These numbers would be a mixture of phone numbers from different mobile networks
Different APIs will handle the sms delivery to different phone groups (mobile networks) (eg. API1, handles Phones from NetworkA and NetworkB, while API2 will handle NetworkC, etc
I already know the formats of the numbers assigned to the various Phone Networks.
Here is the formats for all the available networks:
All the phone numbers, are 13 digits in length, starting with 234,
The first 6 digits in each phone number identifies a network
All the networks have more than 1 unique identification numbers
below is the numbers assigned to the different networks:
Network A - 234706, 234803, 234806, 234810, 234813, 234816 4235940,
Network B - 234705, 234805, 234807, 234815,
Network C - 234809, 234817, 234818, 234708
Network D - 234802, 234808, 234812
Network E - `234702, 234819, 234709, 234704, 234707
My Challenge is how to separate the phone numbers from these different networks and group them into separate variables.
Please, I need someone to help me with this.
Network A
/234(?:706|803|806|810|813|816)\d{7}/
Network B
/234(?:705|805|807|815)\d{7}/
Network C
/234(?:708|809|817|818)\d{7}/
Network D
/234(?:802|808|812)\d{7}/
Network E
/234(?:702|704|707|709|819)\d{7}/
You can use a preg_match_all per expression above, with the PREG_SET_ORDER option so that all your numbers are neatly in an array, $matches[0].
The expressions could be anchored (^ or $) or boundaries or lookaround assertions could have been used, but I wasn't sure exactly what your inputs were.
1- Separate Network prefixes, something like:
networks={
"505 404 506 606",
"236 848 488 993"
etc
};
2- Get if a number is valid, and separate it:
/234(\d{3})(\d{7})/gm this will make 2 groups, one with the prefix and other with the number so you just must find the prefix and send the number over the index of the networks array where you have found your prefix.
Edit: Demo
Here's your regex (?P<NETA>234(?:706|803|806|810|813|816)\d{7})|(?P<NETB>234(?:705|805|807|815)\d{7})|(?P<NETC>234(?:809|817|818|708)\d{7})|(?P<NETD>234(?:802|808|812)\d{7})|(?P<NETE>234(?:702|819|709|704|707)\d{7})
and here's some helpful code to create your own next time
<?php
$networx = array(
'A' => array( 706, 803, 806, 810, 813, 816 ),
'B' => array( 705, 805, 807, 815 ),
'C' => array( 809, 817, 818, 708 ),
'D' => array( 802, 808, 812 ),
'E' => array( 702, 819, 709, 704, 707 )
);
$regex = '';
foreach ( $networx as $k => $nums ) {
$nums = implode( '|', $nums );
$regex .= ( $regex ? '|' : '' ) . "(?P<NET{$k}>234(?:{$nums})\d{7})";
}
$text = "Network A - 2347061234567, 234803, 234806, 234810, 234813, 23481 6 4235940,
Network B - 234705, 2348051234567, 234807, 2348151234567,
Network C - 234809, 234817, 2348181234567, 234708
Network D- 234802, 2348081234567, 234812
Network E - 234702, 234819, 234709, 234704, 234707";
preg_match_all( "/{$regex}/", $text, $matches );
?>
The final var $matches will contain keys like NETA, NETB, ... NETE containing arrays with the corresponding numbers.
[NETB] => Array
(
[0] =>
[1] => 2348051234567
[2] => 2348151234567
[3] =>
[4] =>
)