regex to remove text between two strings in TextWrangler - regex

I've searched quite hard for an answer to this.
Basically, what I'm trying to do is to remove certain fields in some of my exported vCards which I exported using the Mac's Contacts application via Automator.
I managed to remove those single-line fields such as Birthday and Social Network. However, there is one particular field which is taking up multiple lines which I assume is a base64-encoded version of the original image - the PHOTO field.
This is an example of the start of the field:
PHOTO;ENCODING=b;TYPE=JPEG:/9j/4AAQSkZJRgABAQAAAQABAAD/4gxYSUNDX1BST0ZJTEUA
The end varies, so I used the start of the next line as the end:
CATEGORIES
The closest I've got was PHOTO;ENCODING.*CATEGORIES
Unfortunately, it seems to only select the the first line of the entire chunk.
Is there any way around this? I'm trying to do this in TextWrangler on my Mac.

Instead of .* you need :-
(.+[\r\n]+).*
because . doesn't match linebreak chars.
The pattern in parentheses matches multiple lines consisting of char sequences ending with linebreaks.

With the help of a friend I tried in TextWrangler
ATTACH;ENCODING=BASE64([^\n]*\n )*[^\n]*\n
and it matches each attachment

Related

How to extract specific strings from a Page url on Google Data Studio?

I am new into google data studio and I would like to extract the first parameter and last parameter of the following url /red-car/2020.75/it-it/window. Therefore, being able to have one category for the car colour (red-car) and one for the region (it-it).
The position of these parameters will be always the same. Is there any way I could extract these?
I have tried to use regex expression but at the moment I was not able to figure out the right way.
Any suggestions?
get the first field (red-car):
REGEXP_EXTRACT(test,r"^[^/]*/([^/]+)")
or if the field contains http://;
REGEXP_EXTRACT(REPLACE(test,"http://","") ,r"^[^/]*/([^/]+)")
get the 3rd field (region: it-it):
REGEXP_EXTRACT(test,r"^[^/]*/[^/]*/[^/]*/([^/]+)")
The r" stands for a regular expression. The round brackets are for the text to be extracted. [^/]* stands for any text without a slash.
See documentation: https://support.google.com/datastudio/answer/7050487?hl=en&ref_topic=7041728

Search in VSCode for the multiline contents of a set of XML tags, using a regular expression

I am using VSCode to do a global search of XML files. Within those files there are multiple instances of these XML tags: <translated></translated>. I need to find all occurrences of any hyphens - that exist anywhere between those tags, where the contents of those tags can be on multiple lines.
<translated>
Content is here
Could be on multiple lines
The meeting could take 3-4 hours
</translated>
In the above example, the phrase "3-4 hours" has a hyphen in it. I need a regex that works for VSCode which finds all incidences of hyphens which happen to be within a set of these XML tags.
Option 1 (using VS Code)
This only matches one dash at a time and not all dashes. This is because limiting the search to inside one set of tags means it can only do one pass at a time. I was going to delete this answer but if it's the only answer given it may be better than nothing. The work around would be that you would have to refresh the search (button above the search box) and click replace all over and over. If there are lots of dashes this would be annoying but better than no answer.
I have been fiddling with Visual Code Studio and the following seems to work.
(<translated>(.|\n)*?)(-)((.|\n)*?<\/translated>)
Assuming you may be wanting to, for example, replace the dash it's possible to with adding back groups 1 and 4 wrapped around any new text...
$1 <yourTextHere> $4
Example:
Before replace:
After replace (note only the 3-4 in the first section of the file(s) is affected and the 3 to 4 is not changed):
Option 2 / Update (using Brackets.io)
While I'm unsure of the cause if the failure for VSCode to match across files, the following regex works with Brackets (google Brackets.io) across multiple files...
-(?=[^<]*?<\/translated>)
You have to have all your files in a folder and open the folder. Then search in the project (Find > Find in files). Notice in the screenshot it shows for the matches found across all files. In the lower panel for the selected file t2 copy.txt it matches first on line 6 and then on line 16 and (correctly) does not match on line 10 because it is not contained in a translated tag set.
The reason why -(?=[^<]*?<\/translated>) doesn't work in vscode is because it does not EXPLICITLY contain a newline \n. Even though [^<] includes newlines, the \n needs to be actually written into the regex in order to trigger the multiline option. Why is this?
See https://github.com/microsoft/vscode/issues/75265 which uses a similar regex. The issue makes for interesting reading ;>} Primarily for performance reasons.
So simply using this
-(?=[^<]*?\n*<\/translated>)
works in vscode!
-(?=[^<]*?\n<\/translated>) would work for you too unless you have single line blocks like:
<translated>Con-tent is he-re</translated>

Remove a character from the middle of a string with regex

I have no programing experience and thought this would be simple, but I have searched for days without luck. I am using a program to strip content from a web page. The program uses regex filters to display what you want from the stripped content. The stripped content can be any letter and is in the form of USD/SEK. I want to display USDSEK, without the "/"
Thanks
To elaborate further - I am using a program called Data toolbar for chrome, which makes it easy to strip content from web pages. After it strips the content, it provides a regex filter to display what part of the content is displayed. But I have to know the regex command to remove the / from USD/SEK, to display just USDSEK. I've tried [A-Z.,]+ but that only displays USD. I need the regex command to grab the first 3 and last 3 characters only, or to omit the / from the string.
Try adding parentheses around the groups which you wish to capture:
([a-zA-Z]{3})\/([a-zA-Z]{3})
or
([a-zA-Z]{3})\/((?1))
Depending on the functionality of the program you are using you can then reference these captured groups as $1and $2 (or \1and \2 depending on flavor)

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.
Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line