Remove everything from a cell thats within <> in Google Sheets - regex

I know there are a lot of topics about removing characters from Google Sheets cells. I've tried to find a way how to solve my issue with the information on the web / stackoverflow but I can't find it...
I need to create a column with text in multiple rows. The original file still has styling codes <p> <strong> <i> etc in it. I need to remove these styling codes. So actually every <> code should be removed from the cell. I tried to do this with substitute but than I can only say f.e. remove <p> and I'm still having the other styling codes in the sheet.
I think this could be done with REGEXREPLACE but I cant get it working. I hope that someone can help me to understand how I can get this working. Thank you!

use:
=ARRAYFORMULA(REGEXREPLACE(D2:D, "<.*?>", ))
in some cases that wont be enough so:
=ARRAYFORMULA(REGEXREPLACE(D2:D, "<\/\w+>|<\w+.*?>", ))
and in some cases even that wont be enough so:
=ARRAYFORMULA(REGEXREPLACE(D2:D, "</?\S+[^<>]*>", ))

Related

Google Data Studio - Custom Field REGEXP_EXTRACT

I am trying to use the REGEXP_EXTRACT custom field to pull a portion of my URL using the page dimension in Google Data Studio and cannot figure it out. The page url structure is similar to this -
website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage
I'd like to only extract the middle section "great_practiceinfo_part2". I've tried many different formulas, but nothing seems to work. Does the page dimension work in this scenario? Any help would be much appreciated.
Thanks
It seemed to work fine in Google Sheets when I =REGEXEXTRACT(A3,B3) using your string, website.forum.com/webforms/great_practiceinfo_part2.aspx?function=greatcoverage for A3 and the regex \/([^\/]*?)\.aspx\? for B3. I'm guessing you just need to learn more about how to make your regex pattern making string.

Regex to replace spam links in Wordpress

I am dealing with old hacked sites in Wordpress where there are injection spam links on images.
I have access to the database and would like to remove links that look like this:
<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
Now text varies inside the <href> it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.
I don't know regex, so I would appreciate the help. I've tried online generators but they don't seem to be working.

How can I replace all elements with a string+variable in vscode?

so we have these data-ui-test tags which is useful for me to write protractor tests, however all the values of said tags are the same at the moment. I want to replace all '<input' tags with '<input data-ui-test = "${someVariable}"', I am unable to do this in the search&replace bar (ctrl+shft+f).
I am utterly clueless on how to approach the problem
Please if you have any idea on how to do this, please let me know.

content empty when using scrapy

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Regex with iframe in Yahoo! Pipes

I'm building a Yahoo! Pipe to pull an RSS feed from Reddit which links to some content in the description. I'm using a regex to match the href attribute of the anchor link in an item.description field. The regex I'm using is:
^.+?href="([^"]+)">\[link\].+?$
As a test, I set the replace to simply:
$1
and I see that the entire description field has been replaced with the URL. So far, so good.
I then put the following in the replace field. The idea being to iframe the content that's linked to:
Content: <iframe src="$1">no iframe support</iframe> End
What I get out however is:
Content: no iframe support End
I've confirmed that this is also coming through in the pipe's output and not just in the Yahoo! Pipes debug console.
I've so far tried replacing my angle brackets with < and > entities. I've tried wrapping the entire thing in a <![CDATA[ ... ]]> block and still, I get nothing. If I break my iframe tag by removing an angle bracket, the broken content comes through fine, but if I have a well-formed iframe element, it vanishes, leaving the "no iframe support" text. Am I doing something wrong here, or is Yahoo! actively preventing me from using iframe tags in my generated pipe? A cursory search on Google isn't turning up anything related to this.
The pipe in question is here:
http://pipes.yahoo.com/pipes/pipe.info?_id=2ba41448cadd2347d86f377efd3d199f
This Pipes FAQ Question "Why does Pipes Strip <object> and <embed> tags... ?" shows that a certain amount of sanitization is performed, by placing content (at least certain content) into an iframe for the safety of RSS consumers - though it does not state it specifically, this probably also removes other iframes in order to avoid nesting and other work-arounds.
Yahoo is big enough I would doubt they have a week sanitizer, but an extremely long shot is that you might be able to fool it by nesting the iframe in a bunch of other tags (again I doubt this will work). Also depending upon which step does the sanitization, perhaps adding part of the tag in one step, then adding another part somewhere else might work (yet again, doubt overwhelms me)
Not sure what else to suggest, other than getting something else to consume and transform your RSS a little bit more (by fixing otherwise broken tags??) - but that's what you're using pipes for to begin with, isn't it? Idunno...
Good luck!
Pipes has an fanatical devotion to the RSS spec and the spec says the description field is plain text only. HTML etc is supposed to go in the content:encoded field, not that I've had much luck getting pipes to do that.