Matching a word if preceding text is not "class=' " - regex

I'm trying to create a regex for a search that will look at the following code and return only the ids and not the classes:
1 id="contact"
2 class="contact"
3 #contact
4 .contact
I want to return contact from the 1st and 3rd lines and NOT 2nd and 4th lines.
This is for a search across multiple files to avoid going through each one individually and checking whether it needs changing or not.
Is this possible?

Here you go:
/(?:#|id=")(\w+)"?/g
strings beginning with either # or id=" followed by word characters. You'll probably want to enhance it to handle dashes and underscores, I'd bet.
In this case, the first group is non-capturing, and the ID text will be your first capture group $1.
UPDATE
this one:
(?:(?<=id=")|(?<=#))(contact)
uses a positive lookbehind to find your prefixes and matches just the string "contact". This will NOT work in JavaScript (so you can't test it online) but will work in a text editor or CLI tool like ack.

Related

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

Regex parsing using Kimono Labs

I am attempting to use software supplied by Kimonolabs to get a list of articles and their links from a web site. The problem I am having is that a string I have scraped from the web site has a date along with some text that I am unable to separate from the date.
Kimono uses this syntax for a regex:
/^()(.*?)()$/
first bracket => to the left of the required content
second bracket => this is what should get extracted
third bracket => to the right of the required content
Specifically the website I am trying to scrape is:
http://www.yashinquesada.com/futbol-nacional
Here is an example of the line I am trying to parse (I only want the date):
<p class="nspInfo nspInfo1 tleft fnone">Enero 08, 2016 <a href="/futbol-nacional/28-la-primera" >La Primera</a></p>
My attempts to parse this line returned no results, I have tried reading through regex reference materials but they are pretty complicated for me.
Any suggestions are appreciated!
The regular expressions Kimono expects need to have three groups (a group is a pair of parentheses). That means you always need to keep this structure:
/^()(.*?)()$/
This is Kimono's default, where the first group is empty, the second contains all the text (. matches any character, *? basically means "any number of times"), and the third is empty again.
You can adapt that arrangement to cut off unwanted text at the beginning and at the end - the value that ends up in your data will always be whatever the middle group matches.
I suspect the values you currently get currently are looking like this:
Enero 07, 2016 La Primera
so what you actually want to do is cut off text at the end.
Let's make the second and third groups more specific. We know the date always contains the year, which is four digits (\d\d\d\d or \d{4}) - and actually the match should end there. That's fairly easy:
/^()(.*?\d{4})(.*)$/
So, in English:
first group stays empty, no cut-off at the beginning
second group matches any character, but stops after matching four digits
third group matches the remainder of the value; Kimono will throw away that substring
Play around with the expression over at regex101: https://regex101.com/r/rM3tX0/1

preg_match_all with multiple OR conditions

Im trying to create a single regex pattern to match a string where 2 fields (separated by a comma) could either be
a) empty,
b) a single word, or
c) 2 words separated by a backslash (\).
This is a log file where position 1 is a source username field and position 2 is a destination user field, but both could be separated with a backslash if domain name is present (domain\username)
I've tried everything I can think of and can get 2 out of 3 to match, but not all conditions. Below are the possible variants that this string could be in. (something1 and something2 are known patterns that occur before and after this condition)
something1,,,something2
something1,,dstuser,something2
something1,,dstdomain\dstuser,something2
something1,srcdomain\srcuser,,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
something1,srcuser,dstdomain\dstuser,something2
something1,srcuser,dstuser,something2
something1,srcuser,,something2
something1,srcdomain\srcuser,dstuser,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
For example, I've tried this:
^.*something1,(,|(?J)(?<src_username>[^\\]*),|(?<src_domain>.*?)\\(?<src_username>[^\\]*),).*?,something2*
this matches some of the time, but I'm curious if this is possible with a single line of regex.
Thanks in advance....
I think you are looking for this regex:
(?J)^.*something1,(?:,|(?<src_username>[^,\\]+),|(?<src_domain>[^,\\]+)\\(?<src_username>[^,\\]+),)(?:,|(?<dst_user>[^\\,]+),|(?<dst_domain>[^,\\]+)\\(?<dst_username>[^,\\]*),)something2.*
Check the demo
I am using negated character class [^,\\] extensively to not overmatch and stay in the boundaries of a "cell". Also, I make use of (?:...) non-capturing groups to not make a mess with the captured groups and helps keep the output clean.

RegExp , Notepad++ Replace / remove several values

I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.