Finding all XML Files containing specific strings using REGEX - regex

I use VSCode for salesforce and I have hundreds of fieldsets in the sandbox, I would like to use REGEX to find all XML files that contains these 2 words in any order:
LLC_BI__Stage__c
LLC_BI__Status__c
I have tried using these REGEX but it did not work, I am assuming because the strings are in different lines:
(?=LLC_BI__Stage__c)(?=LLC_BI__Status__c)
^(?=.*\bLLC_BI__Stage__c\b)(?=.*\bLLC_BI__Status__c\b).*$
(.* LLC_BI__Stage__c.* LLC_BI__Status__c.* )|(.* LLC_BI__Status__c.* LLC_BI__Stage__c.*)
e.g, this XML File contains the 2 strings and should be returned
<displayedFields>
<field>LLC_BI__Amount__c</field>
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
**<field>LLC_BI__Stage__c</field>**
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
<field>LLC_BI__lookupKey__c</field>
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>
<displayedFields>
**<field>LLC_BI__Status__c</field>**
<isFieldManaged>false</isFieldManaged>
<isRequired>false</isRequired>
</displayedFields>

You could use an alternation to find either one of them and according to this post use [\s\S\r] to match any character including newlines.
If there is an issue using [\s\S\r] you migh tuse [\S\r\n\t\f\v ]* instead.
(?:LLC_BI__Stage__c[\S\s\r]*LLC_BI__Status__c|LLC_BI__Status__c[\S\s\r]*LLC_BI__Stage__c)
Explanation
(?: Non capturing group
LLC_BI__Stage__c[\S\s\r]*LLC_BI__Status__c Match first part till second part
| Or
LLC_BI__Status__c[\S\s\r]*LLC_BI__Stage__c Match second part till first part
) Close group
Regex demo 1 and Regex demo 2

Related

separate similar type of substrings from a string

I have a string that resembles an XML format.
the above string contains several substrings starting with <SingleProvisioningRequest> and ending with </SingleProvisioningRequest>.
is there a way in python such that I can put these substrings into a list.
"<SingleProvisioningRequest><msisdn>919949566686</msisdn>
<serviceId>104900900</serviceId>
<renewalCount>-1</renewalCount>
<isAdvanceRenewal>1</isAdvanceRenewal>
<userName>ad</userName>
<password>ad</password>
<vendorId>1</vendorId>
<circleId>AP</circleId>
<productId>4305</productId>
</SingleProvisioningRequest><SingleProvisioningRequest>
<msisdn>918698291214</msisdn>
<serviceId>20900302900</serviceId>
<renewalCount>-1</renewalCount>
<isAdvanceRenewal>1</isAdvanceRenewal>
<userName>ad</userName>
<vendorId>1</vendorId>
<circleId>MAH</circleId>
<productId>7956</productId>
</SingleProvisioningRequest>"
Use this regex (?<=<SingleProvisioningRequest>)[\s\S]+?(?=<\/SingleProvisioningRequest>)
See it working here, it matches whatever is between the tags you described regex101
Note it uses a lookbehind for the beginning tag and a lookahead for the finishing one. The part at the middle is a non greedy trick for matching everything between.
As for python this should work to place the matches in a list:
p = re.compile('(?<=<SingleProvisioningRequest>)[\s\S]+?(?=<\/SingleProvisioningRequest>)')
p.findall('YOUR TEXT HERE')
Update for your update (matching the tags too): regex101
<SingleProvisioningRequest>[\s\S]+?<\/SingleProvisioningRequest>

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.

RegEx: capture entire group content

I am writing a parser for some Oracle commands, like
LOAD DATA
INFILE /DD/DATEN
TRUNCATE
PRESERVE BLANKS
INTO TABLE aaa.bbb
( some parameters... )
I already created a regex to match the entire command. I am now looking for a way to capture the name of the input file ("/DD/DATEN" for instance here).
My problem is that using the following regex will only return the last character of the first group ("N").
^\s*LOAD DATA\s*INFILE\s*(\w|\\|/)+\s*$
Debuggex Demo
Any ideas?
Many thanks in advance
EDIT: following #HamZa 's question, here would be the entire regex to parse Oracle LOAD DATA INFILE command (simplified though):
^\s*LOAD DATA\s*INFILE\s*((?:\w|\\|/)+)\s*((?:TRUNCATE|PRESERVE BLANKS)\s*){0,2}\s*INTO TABLE\s*((?:\w|\.)+)\s*\(\s*((\w+)\s*POSITION\s*\(\s*\d+\s*\:\s*\d+\s*\)\s*((DATE\s*\(\s*(\d+)\s*\)\s*\"YYYY-MM-DD\")|(INTEGER EXTERNAL)|(CHAR\s*\(\s*(\d+)\s*\)))\s*\,{0,1}\s*)+\)\s*$
Debuggex Demo
Let's point out the wrongdoer in your regex (\w|\\|/)+. What happens here ?
You're matching either a word character or a back/forwardslash and putting it in group 1 (\w|\\|/) after that you're telling the regex engine to do this one or more times +. What you actually want is to match those characters several times before grouping them. So you might use a non-matching group (?:) : ((?:\w|\\|/)+).
You might notice that you could just use a character class after all ([\w\\/]+). Hence, your regex could look like
^\s*LOAD DATA\s*INFILE\s*([\w\\/]+)\s*$
On a side note: that end anchor $ will cause your regex to fail if you're not using multiline mode. Or is it that you intentionally didn't post the full regex :) ?
Not tested but...
^\s*LOAD DATA\s*INFILE\s*(\S+)\s*$