Generalize regex XML validation

Generalize regex XML validation - regex

I am looking to make my code better by letting for example Dim IDNumbers() As String = 'SOMETHING' instead of writing out the list of strings individually like {"0", "1", "2", "3", "4"}. I want to be able to read all the ID, student name and birthday nodes from my XML file without having to physically go and list them all. I am unsure how to do this If anybody could help with sample code that would be great. The reason for this is that if I modify my code in the file I have to change it in my vb code too.

Try following :
Imports System.Xml
Imports System.Xml.Linq
Imports System.Globalization
Module Module1
Const FILENAME As String = "c:\temp\test.xml"
Sub Main()
Dim doc As XDocument = XDocument.Load(FILENAME)
Dim results = doc.Descendants("student").Select(Function(x) New With { _
.id = CType(x.Element("ID"), string), _
.name = CType(x.Element("student_name"), string), _
.birthday = CType(x.Element("birthday"), string) _
}).ToList()
End Sub
End Module

Related

Extract JSON from String using flutter dart

Hello I want to extract JSON from below input string.
I have tried bellow regex in java and it is working fine,
private static final Pattern shortcode_media = Pattern.compile("\"shortcode_media\":(\\{.+\\})");
I want in regex for dart.
Input String
<script type="text/javascript">window.__initialDataLoaded(window._sharedData);</script><script type="text/javascript">window.__additionalDataLoaded('/p/B9fphP5gBeG/',{"graphql":{"shortcode_media":{"__typename":"GraphSidecar","id":"2260708142683789190","shortcode":"B9fphP5gBeG","dimensions":{"height":1326,"width":1080}}}});</script><script type="text/javascript">
<script type="text/javascript">window.__initialDataLoaded(window._newData);</script><script type="text/javascript">window._newData('/p/B9fphP5gBeG/',{"graphql":{"post":{"__typename":"id","id":"2260708142683789190","new_code":"B9fphP5gBeG"}}});</script><script type="text/javascript">
(function(){
function normalizeError(err) {
var errorInfo = err.error || {};
var getConfigProp = function(propName, defaultValueIfNotTruthy) {
var propValue = window._sharedData && window._sharedData[propName];
return propValue ? propValue : defaultValueIfNotTruthy;
};
return {}
}
)
Expected json
{"graphql":{"shortcode_media":{"__typename":"GraphSidecar","id":"2260708142683789190","shortcode":"B9fphP5gBeG","dimensions":{"height":1326,"width":1080}}}}
Note: There are multiple json string in input string, i need json of shortcode_media tag

please use
void main() {

String json = '''
{"graphql":
{"shortcode_media":{"__typename":"GraphSidecar","id":"2260708142683789190","shortcode":"B9fphP5gBeG","dimensions":{"height":1326,"width":1080}}},
"abc":{"def":"test"}
}
''';
RegExp regExp = new RegExp(
"\"shortcode_media\":(\\{.+\\})",
caseSensitive: false,
multiLine: false,
);
print(regExp.stringMatch(json).toString());
}
output
"shortcode_media":{"__typename":"GraphSidecar","id":"2260708142683789190","shortcode":"B9fphP5gBeG","dimensions":{"height":1326,"width":1080}}}
Dartpad

The corresponding Dart RegExp would be:
static final RegExp shortcodeMedia = RegExp(r'"shortcode_media":(\{.+\})");
It does not work, though. JSON is not a regular language, so you can't parse it using regular expressions.
The value of "shortcode_media" in your example JSON ends with several } characters. The RegExp will stop the match at the third of those, even though the second } is the one matching the leading {. If your JSON text contains any further values after the shortcode_media entry, those might be included as well.
Stopping at the first } would also be too short.
If someone reorders the JSON source code to the equivalent
"shortcode_media":{"dimensions":{"height":1326,"width":1080},"__typename":"GraphSidecar","id":"2260708142683789190","shortcode":"B9fphP5gBeG"}
(that is, putting the "dimensions" entry first), then you would only capture until the end of the dimensions block.
I would recommend either using a proper JSON parser, or at least improving the RegExp to be able to handle a single nested JSON object - since you seem to already know that it will happen.
Such a RegExp could be:
RegExp(r'"shortcode_media":(\{(?:[^{}]*(?:\{.*?\})?)*?\})')
This RegExp will capture the correct number of braces for the example code, but still won't work if there are more nested JSON objects. Only a real parser can handle the general case correctly.

Regex from Python to Kotlin

I have a question about Regular Expression (Regex) and I really newbie in this. I found a tutorial a Regex written in Python to delete the data and replace it with an empty string.
This is the code from Python:
import re
def extract_identity(data, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
data (dict): The dictionary with data specific to this type of event.
context (google.cloud.functions.Context): The Cloud Functions event
metadata.
"""
import base64
import json
import urllib.parse
import urllib.request
if 'data' in data:
strjson = base64.b64decode(data['data']).decode('utf-8')
text = json.loads(strjson)
text = text['data']['results'][0]['description']
lines = text.split("\n")
res = []
for line in lines:
line = re.sub('gol. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan', '', line, flags=re.IGNORECASE)
line = line.replace(":","").strip()
if line != "":
res.append(line)
p = {
"province": res[0],
"city": res[1],
"id": res[2],
"name": res[3],
"birthdate": res[4],
}
print('Information extracted:{}'.format(p))
In the above function, information extraction is done by removing all e-KTP labels with regular expressions.
This is the sample of e-KTP:
And this is the result after scanning that e-KTP using the python code:
Information extracted:{'province': 'PROVINSI JAWA TIMUR', 'city': 'KABUPATEN BANYUWANGI', 'id': '351024300b730004', 'name': 'TUHAN', 'birthdate': 'BANYUWANGI, 30-06-1973'}
This is the full tutorial from the above code.
And then my question is, can we use Regex in Kotlin to remove the label from the result of e-KTP like in python code? Because I try some logic that I understand it does not remove the label of e-KTP. My code in Kotlin like this:
....
val lines = result.text.split("\n")
val res = mutableListOf<String>()
Log.e("TAG LIST STRING", lines.toString())
for (line in lines) {
Log.e("TAG STRING", line)
line.matches(Regex("gol. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan"))
line.replace(":","")
if (line != "") {
res.add(line)
}
Log.e("TAG RES", res.toString())
}
Log.e("TAG INSERT", res.toString())
tvProvinsi.text = res[0]
tvKota.text = res[1]
tvNIK.text = res[2]
tvNama.text = res[3]
tvTgl.text = res[4]
....
And this is the result of my code:
TAG LIST STRING: [PROVINSI JAWA BARAP, KABUPATEN TASIKMALAYA, NIK 320625XXXXXXXXXX, BRiEAFAUZEROMARA, Nama, TempatTgiLahir, Jenis keiamir, etc]
TAG INSERT: [PROVINSI JAWA BARAP, KABUPATEN TASIKMALAYA, NIK 320625XXXXXXXXXX, BRiEAFAUZEROMARA, Nama, TempatTgiLahir, Jenis keiamir, etc]
The label still exists, It's possible to remove a label using Regex or something in Kotlin like in Python?

The point is to use kotlin.text.replace with a Regex as the search argument. For example:
text = text.replace(Regex("""<REGEX_PATTERN_HERE>"""), "<REPLACEMENT_STRING_HERE>")
You may use
line = line.replace(Regex("""(?i)gol\. darah|nik|kewarganegaraan|nama|status perkawinan|berlaku hingga|alamat|agama|tempat/tgl lahir|jenis kelamin|gol darah|rt/rw|kel|desa|kecamatan"""), "")
Note that (?i) at the start of the pattern is a quick way to make the whole pattern case insensitive.
Also, when you need to match a . with a regex you need to escape it. Since a backslash can be coded in several ways and people often fail to do it correctly, it is always recommended to define regex patterns within raw string literals, in Kotlin, you may use the triple-double-quoted string literals, i.e. """...""" where each \ is treated as a literal backslash that is used to form regex escapes.

How can I use code to export a SharePoint list to Excel

I found a previous question and it looks to be what I'm looking for. However when I run the code, I get a debug error (Highlights the last line from "Set ObjMyList . . . . ("A1"))". Below is the code I'm using with the specific path & GUIDs. I tried adjusting the sharepoint address, but the one listed is the one that points to the library. I also tried just the home address (Stopping at "TEP") and all the way to including "All Items.aspx". I'm sure I am missing something "simple", but just thought I'd try to ask here.
Dim objMyList As ListObject
Dim objWksheet As Worksheet
Dim strSPServer As String
Const SERVER As String = "https://twdc.sharepoint.com/sites/WDPR-dclrecruiting/Test/TEP/Trip%20Event%20Planning%20Library"
Const LISTNAME As String = "{6B39FDF1-29AE-418C-9D99-92293FED5C81}"
Const VIEWNAME As String = "{CCFD1C7F-74CA-4921-A599-628C800C818A}"
strSPServer = "http://" & SERVER & "/_vti_bin"
Set objWksheet = Worksheets.Add
Set objMyList = objWksheet.ListObjects.Add(xlSrcExternal, _
Array(strSPServer, LISTNAME, VIEWNAME), False, xlYes, Range("A1"))

Below code works in my local
Sub ExportList()
Dim objWksheet As Worksheet
Dim strSPServer As String
Const SERVER As String = "sp/sites/team"
Const LISTNAME As String = "{3e47ff9c-9aab-4a40-9d6a-c47e9b793484}" 'From source code
Const VIEWNAME As String = "{67709eda-c975-4669-85e5-d95e263dadc6}" 'From source code
' The SharePoint server URL pointing to the SharePoint list to import into Excel.
strSPServer = "http://" & SERVER & "/_vti_bin"
Set objWksheet = Sheets("Sheet1")
' Add a list range to the newly created worksheet
' and populated it with the data from the SharePoint list.
Set objMyList = objWksheet.ListObjects.Add(xlSrcExternal, Array(strSPServer, LISTNAME, VIEWNAME), True, , Range("A1"))
Set objMyList = Nothing
Set objWksheet = Nothing
End Sub

vb.net Regex remove a tags with mailto

I have a text for example:
" Visit www.flexstaff.com for details
Email rachel#flexstaff.com apply online."
I would like to delete only the a tags that contain "mailto" so
rachel#flexstaff.com will become
rachel#flexstaff.com
I have this regex:
Dim rgxMailTo = New Regex("<a\b\s[^<>]*(?<=#.*)>|(?<=#.*)</a>",RegexOptions.IgnoreCase)
Dim ret As String = rgxMailTo.Replace(text, Environment.NewLine)
But it selects other a tags as well.

Use the below regex and then replace the match with $1.
<a\b\s*[^<>]*\bmailto\b[^<>]*>([^<>]*)<\/a>
DEMO
To select only the tags.
<a\b\s*[^<>]*\bmailto\b[^<>]*>|(?<=<a\b\s*[^<>]*\bmailto\b[^<>]*>[^<>]*)<\/a>

If your text is of uncertain source (so it was not all generated in 100% predictable way), using regex is a very bad idea - trust me, I've been there.
One option is to use Html Agility Pack, and load the HTML as an XElement (C#, as I have sample on hand):
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(HTML);
htmlDoc.OptionOutputAsXml = true;
using (var stream = new MemoryStream())
{
htmlDoc.Save(stream);
stream.Position = 0;
var xelement = XElement.Load(stream);
DoStuffToXElement(xelement);
}
Note, that in case you have just a fragment without a root element:
Link
<img src="#"/>
Remember to wrap it in something neutral, like htmlDoc.LoadHtml("<div>"+HTML+"</div>");
Now you can use LinqToXml to find whatever you need, traverse the tree or do anything quite safely:
xHtml
.Descendants()
.Where(e=>e.Name.LocalName.Equals("a", StringComparison.OrdinalIgnoreCase)
&& e.Attribute("href") != null
&& e.Attribute("href").Value.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase))
.Remove();
Final note: this is nearly always much slower than regex - if time is important (for example you do it at every page load or sth) it might be too slow, but I guess this kind of processing can be done beforehand?

You can use the power of LINQ to XML like this:
Imports System.Text.RegularExpressions
Imports System.Xml.Linq
Imports System.Xml
Imports System.Xml.XPath
Public Class Form1
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Dim str As String = "Visit www.flexstaff.com for details\nEmail rachel#flexstaff.com apply online."
Dim xDoc As XDocument = XDocument.Parse("<?xml version= '1.0'?><root>" + str + "</root>")
Dim query = xDoc.XPathSelectElements("//a[contains(#href,'mailto')]")
For Each element In query
element.Remove()
Next element
Dim Res As String = xDoc.ToString().Replace("<root>", String.Empty).Replace("</root>", String.Empty)
End Sub
End Class
Outoput (Res):
Visit www.flexstaff.com for details\nEmail apply online.

Regex to parse querystring values to named groups

I have a HTML with the following content:
... some text ...
link ... some text ...
... some text ...
link ... some text ...
... some text ...
I would like to parse that and get a match with named groups:
match 1
group["user"]=123
group["section"]=2
match 2
group["user"]=678
group["section"]=5
I can do it if parameters always go in order, first User and then Section, but I don't know how to do it if the order is different.
Thank you!

In my case I had to parse an Url because the utility HttpUtility.ParseQueryString is not available in WP7. So, I created a extension method like this:
public static class UriExtensions
{
private static readonly Regex queryStringRegex;
static UriExtensions()
{
queryStringRegex = new Regex(#"[\?&](?<name>[^&=]+)=(?<value>[^&=]+)");
}
public static IEnumerable<KeyValuePair<string, string>> ParseQueryString(this Uri uri)
{
if (uri == null)
throw new ArgumentException("uri");
var matches = queryStringRegex.Matches(uri.OriginalString);
for (int i = 0; i < matches.Count; i++)
{
var match = matches[i];
yield return new KeyValuePair<string, string>(match.Groups["name"].Value, match.Groups["value"].Value);
}
}
}
Then It's matter of using it, for example
var uri = new Uri(HttpUtility.UrlDecode(#"file.aspx?userId=123&section=2"),UriKind.RelativeOrAbsolute);
var parameters = uri.ParseQueryString().ToDictionary( kvp => kvp.Key, kvp => kvp.Value);
var userId = parameters["userId"];
var section = parameters["section"];
NOTE: I'm returning the IEnumerable instead of the dictionary directly just because I'm assuming that there might be duplicated parameter's name. If there are duplicated names, then the dictionary will throw an exception.

Why use regex to split it out?
You could first extrct the query string. Split the result on & and then create a map by splitting the result from that on =

You didn't specify what language you are working in, but this should do the trick in C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace RegexTest
{
class Program
{
static void Main(string[] args)
{
string subjectString = #"... some text ...
link ... some text ...
... some text ...
link ... some text ...
... some text ...";
Regex regexObj =
new Regex(#"<a href=""file.aspx\?(?:(?:userId=(?<user>.+?)&section=(?<section>.+?)"")|(?:section=(?<section>.+?)&user=(?<user>.+?)""))");
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
string user = matchResults.Groups["user"].Value;
string section = matchResults.Groups["section"].Value;
Console.WriteLine(string.Format("User = {0}, Section = {1}", user, section));
matchResults = matchResults.NextMatch();
}
Console.ReadKey();
}
}
}

Using regex to first find the key value pairs and then doing splits... doesn't seem right.
I'm interested in a complete regex solution.
Anyone?

Check this out
\<a\s+href\s*=\s*["'](?<baseUri>.+?)\?(?:(?<key>.+?)=(?<value>.+?)[&"'])*\s*\>
You can get pairs with something like Groups["key"].Captures[i] & Groups["value"].Captures[i]

Perhaps something like this (I am rusty on regex, and wasn't good at them in the first place anyway. Untested):
/href="[^?]*([?&](userId=(?<user>\d+))|section=(?<section>\d+))*"/
(By the way, the XHTML is malformed; & should be & in the attributes.)

Another approach is to put the capturing groups inside lookaheads:
Regex r = new Regex(#"<a href=""file\.aspx\?" +
#"(?=[^""<>]*?user=(?<user>\w+))" +
#"(?=[^""<>]*?section=(?<section>\w+))";
If there are only two parameters, there's no reason to prefer this way over the alternation-based approaches suggested by Mike and strager. But if you needed to match three parameters, the other regexes would grow to several times their current length, while this one would only need another lookahead like just like the two existing ones.
By the way, contrary to your response to Claus, it matters quite a bit which language you're working in. There's a huge variation in capabilities, syntax, and API from one language to the next.

You did not say which regex flavor you are using. Since your sample URL links to an .aspx file, I'll assume .NET. In .NET, a single regex can have multiple named capturing groups with the same name, and .NET will treat them as if they were one group. Thus you can use the regex
userID=(?<user>\d+)&section=(?<section>\d+)|section=(?<section>\d+)&userID=(?<user>\d+)
This simple regex with alternation will be far more efficient than any tricks with lookaround. You can easily expand it if your requirements include matching the parameters only if they're in a link.

a simple python implementation overcoming the ordering problem
In [2]: x = re.compile('(?:(userId|section)=(\d+))+')
In [3]: t = 'href="file.aspx?section=2&userId=123"'
In [4]: x.findall(t)
Out[4]: [('section', '2'), ('userId', '123')]
In [5]: t = 'href="file.aspx?userId=123&section=2"'
In [6]: x.findall(t)
Out[6]: [('userId', '123'), ('section', '2')]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generalize regex XML validation - regex

Related

Extract JSON from String using flutter dart

Regex from Python to Kotlin

How can I use code to export a SharePoint list to Excel

vb.net Regex remove a tags with mailto

Regex to parse querystring values to named groups

Categories

Resources