So guys I am trying to cut a string which the str1 down in the below;
Str1
([Title]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
([Title2]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
into two grups via brackets like this;
group1
([Title]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
group2
([Title2]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
and then eventually I will split groups from the square brackets into smaller strings.
I managed to split groups into smaller strings with this regex expression;
final regex2 = RegExp(r'\[(.*)\]');
but I cannot manage to split the big string(str1) into groups.
I would be very grateful if you can help me somehow with this problem.
By the way I tried
final regex1 = RegExp(r'\((.*)\)');
and it did not worked.
Edit: okey guys I found the answer which is
final regex1 = RegExp(r'\((.*?)\)',dotAll: true);
Related
This question already has answers here:
Regular expression to return text between parenthesis
(11 answers)
Regular expression to find number in parentheses, but only at beginning of string
(2 answers)
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 1 year ago.
I can't find a regex to get '(% number%)' in string.
For example I would like to get (100), (2000) ... inside the following string:
Lorem Ipsum is simply dummy text (100) of the printing and typesetting
industry. Lorem Ipsum has been the (2000) industry's standard dummy
text ever since the 1500s, when an unknown printer took a galley of
type and scrambled it to (40) make a type specimen book. It has
survived not only five centuries, but also the leap into electronic
typesetting, remaining essentially unchanged. It was popularized in
the 1960s with the release of Letraset sheets containing Lorem Ipsum
passages, and more recently with desktop publishing software like
Aldus PageMaker including versions of Lorem Ipsum.
I am having trouble removing the number following an empty line using Regex. Here's the sample paragraph that I have:
1
- Lorem Ipsum is simply dummy text of
2
the printing and typesetting industry.
49
and more recently with desktop publishing software like Aldus PageMaker.
I need to remove all the numbers from the beginning of the sentence as well as the empty lines:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. and more recently with desktop publishing software like Aldus PageMaker.
This is the regex that I can think of [\n](.) ,but it can only remove one digit of number
The difficult part is to remove the number because the number of digits are not necessary 1 or 2 digits. How do I tackle this problem?
Do a regex replace of the following regex with blank:
^\d*\n
See live demo.
Hello I am trying to extract the 7digit with a big query for extracting the 2670782 and 2670788
on this data
desc field data below
is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type 8888888 specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum.
>> https://hello.com/pudding/answer/2670782?hl=en&ref_topic=7072943
>> https://hello.com/pudding/answer/2670788?hl=en&ref_topic=7072943
I have a query but there are also other 7digit number on the data other than the 2670782 and 2670788. so first I wanted to check if the line starts with ">>" and includes "hello.com" and I can extract it.
Here is the query that I have but it will grab the 8888888 as well which is not supposed to be.
SELECT
desc,
REGEXP_EXTRACT_ALL(desc, r"\/(\d{7})") AS num
FROM
`table`
WHERE
REGEXP_CONTAINS(DESCRIPTION, r"(>> )")
AND REGEXP_CONTAINS(desc, r"(hello.com)")
I believe I need to check if the line starts with >> and it contains hello.com in a single regex formula and then I can extract the 7 digit number after the /. I am stuck so
Any help would be much appreciated!!
You can use this regex if each of your inputs is one line
^>>.+hello.com.+\/(\d{7})
I test this regex in regex101.com with your input and the 1-line input assumption
UPDATE:
You can replace the ">>" with newline character, then use the below regex to extract the number
hello.com.+\/(\d{7})
Here is the example:
WITH
sample AS (
SELECT
'''start here not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum. >> hello.com/pudding/answer/2670782?hl=en&ref_topic=7072943 >> hello.com/pudding/answer/2670788?hl=en&ref_topic=7072943
''' AS txt
UNION ALL
SELECT
'''
is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type 8888888 specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing 8888888 software like Aldus PageMaker including versions of Lorem Ipsum.
>> https://hello.com/pudding/answer/2670786?hl=en&ref_topic=7072943
>> https://hello.com/pudding/answer/2670785?hl=en&ref_topic=7072943
'''),
sample_new_line AS (
SELECT
REGEXP_REPLACE(txt, '>>', '\n') AS txt
FROM
sample)
SELECT
REGEXP_EXTRACT_ALL(txt, r"hello.com.+\/(\d{7})") AS num
FROM
sample_new_line;
The files that are going to load is xml file. after I get to the specific path, which is I wanted to get the specific string by lines, then added it into a list.
These are the example of strings that I want to extract from a file. (highlighted from a bold text.
<XMLTAG>
<p><b>[1]</b> Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum <LINK HREF="test"><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123; [1234] 1 ABC 123:</p>
<p><b>[2]</b> <LINK HREF="test"><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123 lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>
</XMLTAG>
I wanted to extract [1234] 1 ABC 123; [1234] 1 ABC 123 and insert it in the same line as first in the list.
These are the codes for extraction.
Private Function Slist(ByVal list As List(Of String)) As List(Of String)
Dim rlist As List(Of String) = New List(Of String)
Dim temp As String = ""
For i As Integer = 0 To list.Count - 1 Step 1
If i = 0 Then
temp = list(i).ToString
ElseIf i = list.Count - 1 Then ' 2nd & before last
If list(i).ToString.Contains("<i>") Then
rlist.Add(temp )
temp = list(i).ToString
rlist.Add(temp)
Else
temp = temp & "; " & list(i).ToString
rlist.Add(temp)
End If
Else 'first
If list(i).ToString.Contains("<i>") Then
rlist.Add(temp)
temp = list(i).ToString
Else
temp = temp & "; " & list(i).ToString
End If
End If
Next
Return rlist.Distinct.ToList
End Function
I don't know another options on how to extract string after , I came across of using regex, below are the sample of regex pattern I could think of
\[\d{4}\]\s\d{1,3}\s\w{3}\s\d{1,3}
Still, I'm stuck on how to implement it in my codes, anyone can help me on this? Thank you.
Please consider the following code snippet regarding the Regex which might help you to solve your problem.
Private Function Slist(ByVal list As List(Of String)) As List(Of String)
Dim rlist As New List(Of String)
Dim rx As New System.Text.RegularExpressions.Regex("\[\d{4}\]\s\d{1,3}\s\w{3}\s\d{1,3}")
For Each item As String In list
If rx.IsMatch(item) Then
rlist.Add(rx.Match(item).Value)
End If
Next
Return rlist.Distinct.ToList
End Function
Good luck.
Edit 1
Let's make things clear. If you prefer the Regx, consider the following:
Suppose that you have the following lines:
Dim s1 As String = "<p><b>[1]</b> Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum <LINK HREF=""test""><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123; [4567] 2 DEF 456:</p>"
Dim s2 As String = "<p><b>[2]</b> <LINK HREF=""test""><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123 lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>"
If you Regx the pattern:
Dim pat1 As String = "\[\d+\]\s\d\s\D{3}\s\d+"
For Each m As Match In Regex.Matches(s1, pat1)
Console.WriteLine(m.Value)
Next
For Each m As Match In Regex.Matches(s2, pat1)
Console.WriteLine(m.Value)
Next
The output will be 3 matches from both strings (s1 and s2):
[1234] 1 ABC 123
[4567] 2 DEF 456
[1234] 1 ABC 123
Whereas the output of the pattern:
Dim pat2 As String = "(\[\d+\]\s\d\s\D{3}\s\d+.\s\[\d+\]\s\d\s\D{3}\s\d+)"
For Each m As Match In Regex.Matches(s1, pat2)
Console.WriteLine(m.Value)
Next
For Each m As Match In Regex.Matches(s2, pat2)
Console.WriteLine(m.Value)
Next
will be just one match from the first string (s1):
[1234] 1 ABC 123; [4567] 2 DEF 456
hope it's clear now, good luck.
Your XML is malformed, but Regular Expressions is not required here. You essentially are asking to get all the text after the last occurrence of the keyword </LINK>. With that being said, here is a function that does just that:
Private Function Slist(ByVal lines() As String) As IEnumerable(Of String)
Return
From line As String in lines
Where line.IndexOf("</LINK>") > -1
Select line.Substring(line.Substring(line.LastIndexOf("</LINK>") + 8))
End Function
You would pass in the lines from your XML file via read all lines, something like the following:
Dim _slist As IEnumerable(Of String) = Slist(IO.File.ReadAllLines("your-path-here.xml"))
I'm having the below text.
^0001 HeadOne
##
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.
^0002 HeadTwo
##
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.
^004 HeadFour
##
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
^0004 HeadFour
##
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been theindustry's standard dummy text ever since the 1500s, when an unknown printer took a galley of typeand scrambled it to make a type specimen book.
Below is the regex I'm using to Find.
##([\n\r\s]*)(.*)([\n\r\s]+)\^
but this is catching only ^0001 and ^0003 as these have only one paragraph, but in my text there are multi para contents.
I'm using VS code, can someone please let me know how can I capture such multi para strings using REGEX in VS code or NPP.
Thanks
One weird thing about VSCode regex is that \s does not match all line break chars. One needs to use [\s\r] to match all of them.
Keeping that in mind, you want to match all substrings that start with ## and then stretch up to a ^ at the start of a line or end of string.
I suggest:
##.*(?:[\n\r]+(?!\s*\^).*)*
See the regex demo
NOTE: To only match ## at the start of a line, add ^ at the start of the pattern, ^##.*(?:[\s\r]+(?!\s*\^).*)*.
NOTE 2: Starting with VSCode 1.29, you need to enable search.usePCRE2 option to enable lookaheads in your regex patterns.
Details
^ - start of a line
## - a literal ##
.* - the rest of the line (0+ chars other than line break chars)
(?:[\n\r]?(?!\s*\^).*)* - 0 or more consecutive occurrences of:
[\n\r]+(?!\s*\^) - one or more line breaks not followed with 0+ whitespace and then ^ char
.* - the rest of the line
In Notepad++, use ^##.*(?:\R(?!\h*\^).*)* where \R matches a line break, and \h* matches 0 or more horizontal whitespaces (remove if ^ is always the first char on a delimiting line).
I plugged your input data into /tmp/test and got the following to work using perl syntax
grep -Pzo "##(?:\s*\n)+((?:.*\s*\n)+)(?:\^.*)*\n+" /tmp/test
This should be placing the paragraphe not starting with ^ into $1. You may need to add \r back into this to make it match perfectly