Use regex in UFT PDF comparison - regex

In one of my UFT test cases, I need to verify a amount on a PDF file.
Sometimes the amount is "3000" and sometimes it is "3.000". And sometimes even "3 000"!
I would like to accept those 3 possibilities, knowing that this amount is stored in a datatable.
I tried something like "3.?000" (with regex check in the file checkpoint) but it's not matching any of the 3 solutions.
How would you do?

One of UFT's idiosyncrasies when dealing with regexs is that it add implicit anchors at the beginning and ends of lines.
Try adding .* before and after the text you want to match - .*\s3[., ]?000\s.*.
Also verify that you've activated the regular expression flag for your line. I find the UI for File Content Checkpoints to be a bit unintuitive so you may have missed that.

Related

REGEX if then |else

I am having a difficult time understanding the syntax for if then|else.
I'm looking at a large multiline chunk of text and attempting to retrieve a file number that could match a varying number of expressions. I want to start with the most restrictive match and if its not found move on to another possible match. I dont want to use a simple or | because I want to make sure that one match does not exist before evaluating the other possiblity.
For example I could have a file number in the format ABC-12345 or I could have it in the format ABC-1234567 (there are several other possibilities that make just matching the number of digits impractical as there may be other masks all together, I think if I can understand how to properly structure the if statement for this simple example I can work into the rest)
I thought I might be able to accomplish this by using something like
?(?=[a-zA-Z]{3}-[A-Za-z0-9]{5}$)[a-zA-Z]{3}-[A-Za-z0-9]{5}$|[a-zA-Z]{3}-[A-Za-z0-9]{7}$)
I am trying to get at :If I get a match on the ABC-12345 then return that, otherwise return anything that would match ABC-1234567
I am looking on different lines at the moment hence using the $ , however I suspect I may need to change this to \b if I am searching on the same line

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

How to find arbitrary URLs in plain text?

There are tons of solutions to find and/or parse normal URLs, but none of them deals with arbitrary text, i.e. URLs that are split over several lines? How would you find a URL that can have line breaks after any character?
Note: I'm not interested in the individual parts of the URL. I just want to find all URLs in a given text to convert them to links (e.g. like in plain e-mail text).
Example:
Text text text text text. Look at this:
http://stackoverfl
ow.com/
questions/15252042/
find-urls-in-text
Question question question.
Several approaches are possible:
1) Write a regex with whitespace rules after each regular char. This will certainly blow up the regex pattern but is the most flexible one. For catching line breaks use DOT_ALL mode. DOT_ALL will however produce the same problems as the next approach.
2) (Temporarily) remove line breaks and use normal regex pattern matching. This approach has problems though as it can happen that you include more text than necessary (at the end of the URL) or don't find a URL (if the linebreak is at the start, messing up the protocol string).
2a) A modification of 2) could be to do several match attempts removing only certain line breaks, e.g. after looking for an initial URL part (e.g. www, http etc.). Only possible if recognition time is secondary.
3) Ease your task with domain specific knowlege. For instance if you know where line breaks can occur (or if they occur only at specific positions) then look for these specific cases and solve them first. Then return to the usual regex search.
3a) A variation of 3) could be to look specifically for the protocol and and page extension using a regex with full whitespace rules to find start and stop of an URL. This works obviously only if there's always a protocol/filename_with_extension. Transform the found tokens into regular ones without whitespaces (but include a space before the protocol and after the extension) and then remove all line breaks in the text. Now you can match the URL with a regular regex.
There are certainly more variations possible, but the general idea is the same.

Having trouble creating a regex for a list of zip codes

I need to test whether a list of zip codes in a textarea has only 5-digit zip codes. Under normal circumstances the list would look like this:
56228, 56243, 55324, 55325, 55329, 55355, 55389
I need to find out if there is anything but the above pattern in the textarea. There can be any number of individual zip codes, but I need to make sure there isn't anything else. (I think I'm going to need to be able to highlight illegal matches in the textarea also, but I'll cross that bridge when I get to it).
I started with this regex:
^\d{5},?\s?$+
I'm very new to building regular expressions, but as I understand it, the above should match any set of 5 digits, and commas and whitespace after the five digits may or may not be there.
Online regex testers (I've tried several) aren't finding any matches, whether I have a legitimate list of zip codes or a list with "illegal" characters.
What am I missing here?
This one should suit your needs:
^([, ]*\d{5})+[, ]*$