How to get the string and added into a list? - regex

The files that are going to load is xml file. after I get to the specific path, which is I wanted to get the specific string by lines, then added it into a list.
These are the example of strings that I want to extract from a file. (highlighted from a bold text.
<XMLTAG>
<p><b>[1]</b> Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum <LINK HREF="test"><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123; [1234] 1 ABC 123:</p>
<p><b>[2]</b> <LINK HREF="test"><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123 lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>
</XMLTAG>
I wanted to extract [1234] 1 ABC 123; [1234] 1 ABC 123 and insert it in the same line as first in the list.
These are the codes for extraction.
Private Function Slist(ByVal list As List(Of String)) As List(Of String)
Dim rlist As List(Of String) = New List(Of String)
Dim temp As String = ""
For i As Integer = 0 To list.Count - 1 Step 1
If i = 0 Then
temp = list(i).ToString
ElseIf i = list.Count - 1 Then ' 2nd & before last
If list(i).ToString.Contains("<i>") Then
rlist.Add(temp )
temp = list(i).ToString
rlist.Add(temp)
Else
temp = temp & "; " & list(i).ToString
rlist.Add(temp)
End If
Else 'first
If list(i).ToString.Contains("<i>") Then
rlist.Add(temp)
temp = list(i).ToString
Else
temp = temp & "; " & list(i).ToString
End If
End If
Next
Return rlist.Distinct.ToList
End Function
I don't know another options on how to extract string after , I came across of using regex, below are the sample of regex pattern I could think of
\[\d{4}\]\s\d{1,3}\s\w{3}\s\d{1,3}
Still, I'm stuck on how to implement it in my codes, anyone can help me on this? Thank you.

Please consider the following code snippet regarding the Regex which might help you to solve your problem.
Private Function Slist(ByVal list As List(Of String)) As List(Of String)
Dim rlist As New List(Of String)
Dim rx As New System.Text.RegularExpressions.Regex("\[\d{4}\]\s\d{1,3}\s\w{3}\s\d{1,3}")
For Each item As String In list
If rx.IsMatch(item) Then
rlist.Add(rx.Match(item).Value)
End If
Next
Return rlist.Distinct.ToList
End Function
Good luck.
Edit 1
Let's make things clear. If you prefer the Regx, consider the following:
Suppose that you have the following lines:
Dim s1 As String = "<p><b>[1]</b> Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum <LINK HREF=""test""><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123; [4567] 2 DEF 456:</p>"
Dim s2 As String = "<p><b>[2]</b> <LINK HREF=""test""><i>NAME</i> [2017] 3 ABCD 247</LINK> [1234] 1 ABC 123 lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>"
If you Regx the pattern:
Dim pat1 As String = "\[\d+\]\s\d\s\D{3}\s\d+"
For Each m As Match In Regex.Matches(s1, pat1)
Console.WriteLine(m.Value)
Next
For Each m As Match In Regex.Matches(s2, pat1)
Console.WriteLine(m.Value)
Next
The output will be 3 matches from both strings (s1 and s2):
[1234] 1 ABC 123
[4567] 2 DEF 456
[1234] 1 ABC 123
Whereas the output of the pattern:
Dim pat2 As String = "(\[\d+\]\s\d\s\D{3}\s\d+.\s\[\d+\]\s\d\s\D{3}\s\d+)"
For Each m As Match In Regex.Matches(s1, pat2)
Console.WriteLine(m.Value)
Next
For Each m As Match In Regex.Matches(s2, pat2)
Console.WriteLine(m.Value)
Next
will be just one match from the first string (s1):
[1234] 1 ABC 123; [4567] 2 DEF 456
hope it's clear now, good luck.

Your XML is malformed, but Regular Expressions is not required here. You essentially are asking to get all the text after the last occurrence of the keyword </LINK>. With that being said, here is a function that does just that:
Private Function Slist(ByVal lines() As String) As IEnumerable(Of String)
Return
From line As String in lines
Where line.IndexOf("</LINK>") > -1
Select line.Substring(line.Substring(line.LastIndexOf("</LINK>") + 8))
End Function
You would pass in the lines from your XML file via read all lines, something like the following:
Dim _slist As IEnumerable(Of String) = Slist(IO.File.ReadAllLines("your-path-here.xml"))

Related

Regex to Remove Empty Line and Number

I am having trouble removing the number following an empty line using Regex. Here's the sample paragraph that I have:
1
- Lorem Ipsum is simply dummy text of
2
the printing and typesetting industry.
49
and more recently with desktop publishing software like Aldus PageMaker.
I need to remove all the numbers from the beginning of the sentence as well as the empty lines:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. and more recently with desktop publishing software like Aldus PageMaker.
This is the regex that I can think of [\n](.) ,but it can only remove one digit of number
The difficult part is to remove the number because the number of digits are not necessary 1 or 2 digits. How do I tackle this problem?
Do a regex replace of the following regex with blank:
^\d*\n
See live demo.

Flutter dart split a string into smaller groups with regex

So guys I am trying to cut a string which the str1 down in the below;
Str1
([Title]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
([Title2]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
into two grups via brackets like this;
group1
([Title]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
group2
([Title2]
[Lorem Ipsum is simply dummy text of the printing and typesetting industry.]
[Lorem Ipsum has been the industry's standard dummy text ever since the 1500s]
[when an unknown printer took a galley of type and scrambled it to make a type specimen book.]
[It has survived not only five centuries, but also the leap into electronic typesetting,])
and then eventually I will split groups from the square brackets into smaller strings.
I managed to split groups into smaller strings with this regex expression;
final regex2 = RegExp(r'\[(.*)\]');
but I cannot manage to split the big string(str1) into groups.
I would be very grateful if you can help me somehow with this problem.
By the way I tried
final regex1 = RegExp(r'\((.*)\)');
and it did not worked.
Edit: okey guys I found the answer which is
final regex1 = RegExp(r'\((.*?)\)',dotAll: true);

scala regex to limit with double space

I have a data like below
135 stjosephhrsecschool london DunAve
175865 stbele_higher_secondary sch New York
11 st marys high school for women Paris Louis Avenue
I want to extract id schoolname city area.
Pattern is id(digits) followed by single space then school name. name can have multiple words split by single space or it may have special chars. then minimum of double space or more then city . Again city may have multi words split space or may have special chars. then minimum of 2 spaces or more then its area. Even area follows the same properties as school name & city. But area may or may not present in the line. If its not then i want null value for area.
Here is regex I have tried.
([\d]+) ([\w\s\S]+)\s\s+([\w\s\S]+)\s\s+([\w\s\S]*)
But This regex is not stopping when it see more than 2 spaces. Not sure how to modify this to fit to my data.
all the help are appreciated.
Thanks
If I understand your issue correctly - the issue is that the resulting groups contain trailing spaces (e.g. "Louis Avenue "). If so - you can fix this by using the non-greedy modifiers like +? and *?:
([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*
Which results in what seems to be the desired output:
val s1 = "135 stjosephhrsecschool london DunAve"
val s2 = "175865 stbele_higher_secondary sch New York "
val s3 = "11 st marys high school for women Paris Louis Avenue "
val r = """([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*""".r
def matching(s: String) = s match {
case r(a,b,c,d) => println((a,b,c,d))
case _ => println("no match")
}
matching(s1) // (135,stjosephhrsecschool,london,DunAve)
matching(s2) // (175865,stbele_higher_secondary sch,New York,)
matching(s3) // (11,st marys high school for women,Paris,Louis Avenue)

regex how to find two or more indentical lines with one line between in a file

i have some files and must find identical lines starting with "abc" and exact one line between these two identical lines.
lorem
abcdefg
lorem
abcdefg
lorem
lorem
abcdefg
abcdefg
lorem
lorem
in this sample the lines 2 and 4 should match but not then lines 4 and 7 and not the lines 7 and 8. is it possible?
Since you don't say the language I would do something like:
abc([^\n]+)\n[^\n]*\nabc(\1)
which checks for:
Letters abc.
a captured group without new lines.
The new line character.
A complete new line.
The new line character.
The previously matched first group content.
Check if its available for your language:
http://www.regular-expressions.info/refext.html (for instance in .NET this is not valid).

Regex- First instance of Time in a string

I am trying to build a regular expression that would pull the first time out of a string.
The issue is the time format is not standardized.
Here are the possible variations.
':' with 1 hour digit before the ':' (ex. 9:00 pm)
':' with 2 hour digits before the ':' (ex. 10:00pm)
no minutes with with 1 hour digit (ex 9pm)
no minutes with with 1 hour digit (ex 10pm)
Additionally there may or may not be a space before "am" or "pm"
Here is an example string.
7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text
I would like this string to return "7:30 pm"
You did not specify the tool you want to use, here a simple implementation using sed:
echo '7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text' | sed 's/\([0-2]\?[0-9]\(:[0-5][0-9]\)\? *[ap]m\).*/\1/i'
Legenda:
'[0-2]\?[0-9]' match the hour (with 1 or 2 digits)
'\(:[0-5][0-9]\)\?' match the minutes (optional)
' *' optional spaces
'[ap]m' match am,pm,AM,PM (also Am,aM,pM,Pm)*
'.*' match all the rest of the string
In addiction: the external \(...\) create a group of all the above elements (a backreference) used later in the substitution part of the regex \1.
*: The last /i modifier make the regex case insensitive
You can rewrite all as a standard perl regex:
/(?i)[0-2]?\d(?::[0-5]\d)?\s*[ap]m/
Little ruby code:
#!/usr/bin/env ruby
input = "7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text"
puts input[/(?i)[0-2]?\d(?::[0-5]\d)?\s*[ap]m/]
Try this regex:
(?i)\d{1,2}(?::\d{2})?\s*[ap]m
Explaining:
(?i) # insensitive case
\d{1,2} # one or two digits
(?: # optional group
:\d{2} # the minutes
)? # end optional group
\s* # any spaces
[ap]m # "am" or "pm"
Regex live here.
Hope it helps.
You can use the following regex:
\d{1,2}\:?(?:\d{1,2}|)\s*[ap]m
A almost generic solution may be achieved using following expression:
([012]?\d(:[0-5]\d)?\s*(pm|am|PM|AM))
It considers capturing groups, getting all present time strings on string.
In javascript, it might be tested like following:
var testTime = "7:30 pm -9 pm Lorem Ipsum is simply dummy text. 9pm-10pm Lorem Ipsum is simply dummy text";
var timeRex = /([012]?\d(:[0-5]\d)?\s*(pm|am|PM|AM))/g;
var firstTime = timeRex.exec(testTime)[0];
console.log(firstTime);
I really believe that there is a better general solution. I will try some more stable, then publish it here.