Issue with regex expression used to extract timestamps from XML file

Issue with regex expression used to extract timestamps from XML file - c++

I am aiming to implement regex into my C++ program in an attempt to extract timestamps, among other things from an XML file. Right now I am focusing on creating a regex expression to extract 6 timestamps in particular from the XML file. Unfortunately, I my regex expression does not seem to be locating the 6 timestamps I want it to. The expression I have created is: \2\0\1\4\\-\0\7\-\0\8\T\1\8\:\1\4\:\.\.\\.\7\1\6\Z. If you look at the XML file which I have linked below, I am trying to extract the timestamps from 6 lines in particular(lines 72,75,78,81,84,and 87). Could someone possibly help me point out what is being done wrong? Sorry, I'm just getting familiarizing myself with Regex for the first time. I am using http://regexr.com/ to test my expressions.
Link to XML file: http://pastebin.com/5hMy9RzK
Six timestamps which I want my regex expression to locate:
timestamp="2014-07-08T18:14:17.716Z"
timestamp="2014-07-08T18:14:18.716Z
timestamp="2014-07-08T18:14:19.716Z
timestamp="2014-07-08T18:14:20.716Z
timestamp="2014-07-08T18:14:21.716Z
timestamp="2014-07-08T18:14:22.716Z

Your expression looks strange, you are escaping every literal character with a \ which is usually only used for special characters.
Is this what you're looking for?
\d\d\d\d-\d\d-\d\d\w\d\d:\d\d:\d\d\.716Z
Example:
http://regexr.com/3cbs2

Related

Regular expression to find last match in XML output

I have been working for days to learn regex so that I can extract the last match out of an xml output of a test from a scientific instrument. The instrument buffer can hold multiple tests and I am only interested in the last (most recent) test. I can't figure it out!
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>6</SampleId>
<DateTime>2022-10-28T15:16:22</DateTime>
<Value>300</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
<Ticket class="SAMPLE" serialno="6000SP210134" versions="FP6000;Main:V1.25;COM:V1.7;D:V1.11;TEC:V1.6">
<Measurement>
<SampleId>7</SampleId>
<DateTime>2022-10-28T15:18:55</DateTime>
<Value>425</Value>
<Unit>mOsmol/kg</Unit>
<DeviceCode>6000SP210134</DeviceCode>
<CheckSum>50c5656fd477cbcd3b7a5036ba98a542</CheckSum>
</Measurement>
</Ticket>
I need match and return the last value from the last test <Ticket></Ticket> (the number of Tickets is variable). In this example it would be 425.
I thought this might work, but it doesn't...
\<Value>\d{2,4}<\/Value>.*\n$\
This regular expression is executed and interpreted in a lab information management system called LabVantage, not in any language like perl, php, C, etc. A regular expression is the only option I have.

LabVantage does not seem to publicly reveal their regex engine but if you have access to lookarounds then this should work:
<Value>\d{2,4}<\/Value>(?![\s\S]*<\/Value>)
<Value>\d{2,4}<\/Value> - you know what this does, you wrote it =)
(?![\s\S]*<\/Value>) - ahead of me, </Value> does not exist
https://regex101.com/r/XpbOdR/1
If lookbehinds are supported then you can get fancy like this to extract only the digits:
(?<=<Value>)\d{2,4}(?=<\/Value>(?![\s\S]*<\/Value>))
https://regex101.com/r/VCDURX/1

I was not able to coax LabVantage to work with a regular expression in the ways recommend above. However, if any LabVantage user is looking to solve a similar issue, the way it was resolved was to use a Value Extraction Rule like this:
extract /regex/ extract /regex/
or
extract /regex/ extract last number
This type of expression is not explicitly made a visible to the user but it still works. So the final code that did work is this:
extract /(?s).*Value>/ extract last number
Thanks all who contributed.

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.
We need to know about the arrival of new files as soon as possible.
I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.
In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.
According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.
My regular expressions, that I have tried so far are:
First try before even noticing temporary office files:
^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Second try, intention was excluding a leading ~:
^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Third try, intention was excluding a leading ~ by its character code:
^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Fourth try, intention was excluding a leading ~ by its character code with a capital E:
^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
All of those don't stop sending notifications on file openings…
Does anyone have any idea what to do?
All suggestions and alternatives are welcome.
I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).
How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?
Short version:
How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?
Requirements:
What should not be matched:
Subfolders (my approach was files without a .),
Thumbs.db (Windows thumbnails db),
*.part (filezilla partial uploads),
~$. (temporary files starting with ~ or ~$, MS Office tmp files)
The following list provides some files and folders that must be matched or not matched by the regex:
Ablage (subfolder, should not be matched)
Abrechnungen (subfolder, should not be matched)
eine_testdatei.csv
TEST-WORKBOOK.xlsx
TEST-WORKBOOK_äöüß.xlsx
Test-2018-08-08.txt
~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)
TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)
TEST-WORKBOOK.part (partial upload, should not be matched)
New Problems occurred while trying to find the regex
A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by #Bohemian. I wasn't aware of those problems, so I just add them here for completeness.
The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").
This can be avoided by using the html names < instead of < and > instead of >.
The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$. The engine says:
Error: 2018-08-17T06:05:46Z REGEX-13
[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$]
The corresponding line in the xml file looks like this:
<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$" />
Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.
Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…

POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.
Here's how to do it, but as you can see, it's not very readable.
^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$
... and it still probably doesn't do exactly what you want.

Try this:
^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$
See live demo.
There is nothing special about the tilda character in regex.

I am very late on this but above comments were helpful for me. It may not work for you but my solution is:
file_list <- file_list[!grepl("~", file_list)]

Finding text between two tags with variable namespace

I have to parse a lot of text files where each text file contain one or more XML documents. I do know every XML is wrapped in a Envelope tag as root tag, but they have varying namespaces.
I tried to create a regular expression to grab these XML documents from a text file, and it does work for most of them, but for some I get an catastrophic backtracking error. I think it's because the text is too large and my expression not very efficient. I'm not really great at regex, so i'm struggling to fix this.
The pattern i'm looking for is:
<namespace:envelope attributes>XML</namespace:envelope>
What i've come up with so far is:
(?i)<[^:]*?:envelope[^>]*?>.*?<\/[^:]*?:envelope>
Any help would be greatly appreciated.

Try to use this regular expression:
#<([^/].*?):envelope\s.+?</\1:envelope>#s
RegEx101 Demo 1
Or shorter one, if you don't need to have namespace separately:
#<([^/].*?:envelope)\s.+?</\1>#s
RegEx101 Demo 2

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.

#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.

It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regular expression to amend Sysprep.inf file

I currently have a requirement to parse a sysprep.inf file and insert a value input by the end user.
I'm coding this utility using AutoIT and my regular expression is slightly out.
The line I need amending is as follows:
ComputerName=%DeviceName%
DeviceName is variable injected by LANDesk. If the device has previously been in the LANDesk database the name is injected into the file. If not the variable name remains. The device name must go after the =
Here is a snippet of my current code:
$FileContents = StringRegExpReplace($FileContents,'ComputerName=[a-z]','ComputerName='& $deviceNameInput)
Thanks for any guidance anyone can offer.

I'm not familiar with AutoIT or BASIC... but it looks like you need to be using something like this:
$FileContents = StringRegExpReplace($FileContents,'.*ComputerName=(\%[a-zA-Z]*\%).*', $deviceNameInput)
OR
$FileContents = StringRegExpReplace($FileContents,'ComputerName=\%[a-zA-Z]*\%', 'ComputerName='&$deviceNameInput)
this will only replace a device name that's a-z or A-Z. Not numerical or containing spaces.

Writing regular expressions can be tough because there are so many dialects of regular expressions. Assuming you are using a regex library that supports a Perl-like dialect you might want to try this for your regex:
^\s*ComputerName\s*=\s*(?:%DeviceName%|[a-zA-Z0-9_-]+)
Basically this regex will match an lines either the litteral string ComputerName=%DeviceName% or ComputerName=<some actual device name that only contains the characters a-z, A-Z, 0-9, _, and ->. This regex is also a bit lenient in that it will match a line that contains whitespace at the beginning of the line as well as before and/or after the equals sign. The image below explains the components of this regex in greater detail.
p.s. that image was generated by RegexBuddy, an excellent regular expression IDE.

Autoit has a great way of dealing with ini files - IniWrite
IniWrite("SysPrep.ini", "write_section_here", "ComputerName", $deviceNameInput)
creates or updates SysPrep.ini with:
[write_section_here]
ComputerName=localhost

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issue with regex expression used to extract timestamps from XML file - c++

Your expression looks strange, you are escaping every literal character with a \ which is usually only used for special characters. Is this what you're looking for? \d\d\d\d-\d\d-\d\d\w\d\d:\d\d:\d\d\.716Z Example: http://regexr.com/3cbs2

Related

Regular expression to find last match in XML output

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

Finding text between two tags with variable namespace

Problems with finding and replacing

Regular expression to amend Sysprep.inf file

Categories

Resources