I am converting one pdf to text with xpdf and then find some words
with help of regex and preg_match_all.
I am seperating my words with colon in pdftotext.
Below is my pdftotext output:
In respect of Shareholders
Name: xyx
Residential address: dublin
No of Shares: 2
Name: abc
Residential address: canada
No of Shares: 2
So i write one regex that will show me words after colon in text().
$regex = '/(?<=: ).+/';
preg_match_all($regex, $string, $matches);
But Now i want regex that will display all data after In respect of Shareholders.
So, i write $regex = '/(?<=In respect of Shareholders).*?(?=\s)';
But it shows me only :
Name: xyx
I want first to find all data after In respect of shareholders and then another regex to find words after colon.
You may use
if (preg_match_all('~(?:\G(?!\A)|In respect of Shareholders)\s*[^:\r\n]+:\h*\K.*~', $string, $matches)) {
print_r($matches[0]);
}
See the regex demo
Details
(?:\G(?!\A)|In respect of Shareholders) - either the end of the previous successful match or In respect of Shareholders text
\s* - 0+ whitespaces
[^:\n\r]+ - 1 or more chars other than :, CR and LF
: - a colon
\h* - 0+ horizontal whitespaces
\K - match reset operator that discards all text matched so far
.* - the rest of the line (0 or more chars other than line break chars).
In your regex (?<=: ).+ you will match any character 1+ times after a colon and a space. To capture all that follows the spaces or tabs in a group, you could use (?<=: )[\t ](.+)
Another way to match the texts using a capturing group could be:
^.*?:[ \t]+(\w+)
Explanation
^ Assert start of the string
.*?: Match any character non greedy followed by a :
[ \t]+ Match 1+ times a space or a tab
(\w+) Capture in a group 1+ word characters
Regex demo | Php demo
Or use \K to forget what was matched if that is supported:
^.*?:\h*\K\w+
Regex demo
Related
For example I want to match three values, required text, optional times and id, and the format of id is [id=100000], how can I match data correctly when text contains spaces.
my reg: (?<text>[\s\S]+) (?<times>\d+)? (\[id=(?<id>\d+)])?
example source text: hello world 1 [id=10000]
In this example, all of source text are matched in text
The problem with your pattern is that matches any whitespace and non whitespace one and unlimited times, which captures everything without getting the other desired capture groups. Also, with a little help with the positive lookahead and alternate (|) , we can make the last 2 capture groups desired optional.
The final pattern (?<text>[a-zA-Z ]+)(?=$|(?<times>\d+)? \[id=(?<id>\d+)])
Group text will match any letter and spaces.
The lookahead avoid consuming characters and we should match either the string ended, or have a number and [id=number]
Said that, regex101 with further explanation and some examples
You could use:
:\s*(?<text>[^][:]+?)\s*(?<times>\d+)? \[id=(?<id>\d+)]
Explanation
: Match literally
\s* Match optional whitespace chars
(?<text> Group text
[^][:]+? match 1+ occurrences of any char except [ ] :
) Close group text
\s* Match optional whitespace chars
(?<times>\d+)? Group times, match 1+ digits
\[id= Match [id=
(?<id>\d+) Group id, match 1+ digirs
] Match literally
Regex demo
I have more than a million lines of text in this format:
AAAA BBBBBBBBBBBBBBB CCCC
Separated by \t
I want to have it in a format
AAAA_CCCC BBBBBBBBBBBBBBB
But I cannot seem to figure out how to do it using regular expressions in Notepad++
You may try the following find and replace, in regex mode:
Find: ^(\S+)\t(\S+)\t(\S+)$
Replace: $1_$3 $2
Here is a demo.
If the separator is a tab, you can use
^[^\r\n\t]+\K\t([^\r\n\t]+)\t([^\r\n\t]+)$
The pattern matches:
^ Start of string
[^\r\n\t]+ Match 1+ chars other than a tab or newline
\K\t Forget what is matches so far using \K and match a tab
([^\r\n\t]+) Capture group 1, match any 1+ chars other than a newline or tab
\t Match a tab
([^\r\n\t]) Capture group 2, match 1 char other than a newline or tab
$ end of string
In the replacement use the 2 capture groups with an underscore in between.
_$2 $1
See a regex demo.
The result of the replacement:
AAAA_CCCC BBBBBBBBBBBBBBB
I have the following regular expressions that extract everything after first two alphabets
^[A-Za-z]{2})(\w+)($) $2
now I want to the extract nothing if the data doesn't start with alphabets.
Example:
AA123 -> 123
123 -> ""
Can this be accomplished by regex?
Introduce an alternative to match any one or more chars from start to end of string if your regex does not match:
^(?:([A-Za-z]{2})(\w+)|.+)$
See the regex demo. Details:
^ - start of string
(?: - start of a container non-capturing group:
([A-Za-z]{2})(\w+) - Group 1: two ASCII letters, Group 2: one or more word chars
| - or
.+ - one or more chars other than line break chars, as many as possible (use [\w\W]+ to match any chars including line break chars)
) - end of a container non-capturing group
$ - end of string.
Your pattern already captures 1 or more word characters after matching 2 uppercase chars. The $ does not have to be in a group, and this $2 should not be in the pattern.
^[A-Za-z]{2})(\w+)$
See a regex demo.
Another option could be a pattern with a conditional, capturing data in group 2 only if group 1 exist.
^([A-Z]{2})?(?(1)(\w+)|.+)$
^ Start of string
([A-Z]{2})? Capture 2 uppercase chars in optional group 1
(? Conditional
(1)(\w+) If we have group 1, capture 1+ word chars in group 2
| Or
.+ Match the whole line with at least 1 char to not match an empty string
) Close conditional
$ End of string
Regex demo
For a match only, you could use other variations Using \K like ^[A-Za-z]{2}\K\w+$ or with a lookbehind assertion (?<=^[A-Za-z]{2})\w+$
Extract all the string between 2 patterns:
Input:
test.output0 testx.output1 output3 testds.output2(\t)
Output:
output0 output1 ouput3 output2
Note: (" ") is the tab character.
You may try:
\.\w+$
Explanation of the above regex:
\. - Matches . literally. If you do not want . to be included in your pattern; please use (?<=\.) or simply remove ..
\w+ - Matches word character [A-Za-z0-9_] 1 or more time.
$ - Represents end of the line.
You can find the demo of the regex in here.
Result Snap:
EDIT 2 by OP:
According to your latest edit; this might be helpful.
.*?\.?(\w+)(?=\t)
Explanation:
.*? - Match everything other than new line lazily.
\.? - Matches . literally zero or one time.
(\w+) - Represents a capturing group matching the word-characters one or more times.
(?=\t) - Represents a positive look-ahead matching tab.
$1 - For the replacement part $1 represents the captured group and a white-space to separate the output as desired by you. Or if you want to restore tab then use the replacement $1\t.
Please find the demo of the above regex in here.
Result Snap 2:
Try matching on the following pattern:
Find: (?<![^.\s])\w+(?!\S)
Here is an explanation of the above pattern:
(?<![^.\s]) assert that what precedes is either dot, whitespace, or the start of the input
\w+ match a word
(?!\S) assert that what follows is either whitespace of the end of the input
Demo
I have this text:
text1 without brackets
text2 (with brackets)
and I need two groups in every line:
group#1: text1 without brackets
group#2:
group#1: text2
group#2: with brackets
Here is a link for this example: regexr.com
Thanks for help!
You may use
^(.*?)(?:\s*\(([^()]*)\))?$
See the regex demo and the regex graph:
Details
^ - start of string
(.*?) - Group 1: any 0+ chars as ew as possible
(?:\s*\(([^()]*)\))? - an optional sequence of patterns that is tried at least once:
\s* - 0+ whitespaces
\( - a ( char
([^()]*) - Group 2: 0+ chars other than ( and )
\) - a ) char
$ - end of the string.
Try pattern: ([^(\n]+)(?:\n|\(([^)]+))
Explanation:
([^(\n]+) - first capturing group: match one or more characters other than ( or \n so it will match everything until opening bracket or newline character
(?:...) - used in order to make use of alternation and not create second capturing group
\n|\(([^)]+) - match newline or bracker ( and one or more characters other than closing bracket ) storing it into second capturing group.
Demo