Regex to grab data from massive HTML string

Regex to grab data from massive HTML string - regex

I am grabbing a HTML source dump that includes some sort of JSON props created by react.
Trying to grab data in syntax like this: "siteName":"Example Site". I want to grab that "Example Site" text without the quotations.
I know I could be using an HTML parser but this is actually within some JS code in the source.
Any thoughts on how I could do this? Thanks

With this regex you get it but I would use something else like a Json parser
var regex = /"siteName":"(.+?)"/g;
var str = `{"siteName":"ABC Example Business","contactName":"Jeff","siteKey":"abcexample","tabKey":"service","entityKey":"1192289","siteId":152285976,"entityId":13123055221,"phone":"","mobile":"0100 000 000",}`;
var result = regex.exec(str);
console.log(result[1]);

How about that:
\"siteName\":\"(.+)\"

Related

Extract Hyperlinks Google Apps Script

For a long time the following code did work perfectly to extract hyperlinks from a text using regex-expression:
var text = "this is a http://google.de link!";
var link = text.match("(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+üäö&##/%=~_|$?!:,.]*\)|[-A-Z0-9+üäö&##/%=~_|$?!:,.])*(?:\([-A-Z0-9+üäö&##/%=~_|$?!:,.]*\)|[A-Z0-9+üäö&##/%=~_|$])", "gi");
The result should be "http://google.de" for the variable link. But it doesn't work anymore. I deem Google has changed something in GAS!?
Can you please tell me, which expression I can use to extract hyperlinks from a string?

Replacement Character inserted between each letter of CSV Dataset. How to Replace?

I'm working on importing a CSV dataset into a google sheet from my drive. I have the script working, however whenever the data imports it looks like this.
After Import
var file = DriveApp.getFileById(url);
var csvString = file.getBlob().getDataAsString('UTF-8').replace(/\uFFFD/g, '');
var csvData = Utilities.parseCsv(csvString);
var sheet = SpreadsheetApp.openById(sheetid);
var s = sheet.getSheetByName('Data');
s.getRange(1, 1, csvData.length, csvData[0].length).setValues(csvData);
I've tried a number of different regex expressions to replace the unknown characters but after a few days trying to figure it out, I figured I'd post it on here and get a bit of help. (I didn't include the .replace() in the code because I couldn't get it to work. This is the code that is working to only paste it to my sheet)
Edit* Here is the Expected Output - I've whited out the email addresses and usernames to keep the information private.
Expected Output

How to search and replace from a SafeHtml variable in Angular?

I've a very simple question.
I've a sanitized string and its type in Angular is SafeHtml.
How would be the best approach to search and replace some Html inside this SafeHtml variable?
...
const sanitzedHtml: SafeHtml = this.sanitizer.bypassSecurityTrustHtml(changes.pureHtml.currentValue);
...
My goal is to replace some string with some extra html code, so the best would be to be able to search only within the html nodes, not really everywhere in the code.
Are there faster way than reconverting the SafeHtml variable into a string and apply a basic replace with a RegExp?
Thanks

Change HTML code before sanitize
1 - Using regex
You can change your code by using Regex on your html string, then sanitize it.
let html = "<div>myHtml</div>"
const regex = /myRegexPattern/i;
html.replace(regex, 'Replacement html part'));
2 - Using DocumentFragment
You can also create a fragment of your html, modify what you want in it and string it before start your sanitize function
let str:string = "<div id='test'>myHtml</div>";
const sanitzedHtml:SafeHtml = this.sanitizer.bypassSecurityTrustHtml(changeMyHtml(str));
function changeMyHtml(htmlString:string):string{
let fragment= document.createRange().createContextualFragment(str);
//do what you need to do here like for exemple
fragment.getElementById('test').innerHtml = "myHtmlTest";
//then return a string of the modified html
const serializer = new XMLSerializer();
return serializer.serializeToString(element)
}

How to use regex in selenium locators

I'm using selenium RC and I would like, for example, to get all the links elements with attribute href that match:
http://[^/]*\d+com
I would like to use:
sel.get_attribute( '//a[regx:match(#href, "http://[^/]*\d+.com")]/#name' )
which would return a list of the name attribute of all the links that match the regex.
(or something like it)
thanks

The answer above is probably the right way to find ALL of the links that match a regex, but I thought it'd also be helpful to answer the other part of the question, how to use regex in Xpath locators. You need to use the regex matches() function, like this:
xpath=//div[matches(#id,'che.*boxes')]
(this, of course, would click the div with 'id=checkboxes', or 'id=cheANYTHINGHEREboxes')
Be aware, though, that the matches function is not supported by all native browser implementations of Xpath (most conspicuously, using this in FF3 will throw an error: invalid xpath[2]).
If you have trouble with your particular browser (as I did with FF3), try using Selenium's allowNativeXpath("false") to switch over to the JavaScript Xpath interpreter. It'll be slower, but it does seem to work with more Xpath functions, including 'matches' and 'ends-with'. :)

You can use the Selenium command getAllLinks to get an array of the ids of links on the page, which you could then loop through and check the href using the getAttribute, which takes the locator followed by an # and the attribute name. For example in Java this might be:
String[] allLinks = session().getAllLinks();
List<String> matchingLinks = new ArrayList<String>();
for (String linkId : allLinks) {
String linkHref = selenium.getAttribute("id=" + linkId + "#href");
if (linkHref.matches("http://[^/]*\\d+.com")) {
matchingLinks.add(link);
}
}

A possible solution is to use sel.get_eval() and write a JS script that returns a list of the links. something like the following answer:
selenium: Is it possible to use the regexp in selenium locators

Here's some alternate methods as well for Selenium RC. These aren't pure Selenium solutions, they allow interaction with your programming language data structures and Selenium.
You can also get get HTML page source, then regular expression the source to return a match set of links. Use regex grouping to separate out URLs, link text/ID, etc. and you can then pass them back to selenium to click on or navigate to.
Another method is get HTML page source or innerHTML (via DOM locators) of a parent/root element then convert the HTML to XML as DOM object in your programming language. You can then traverse the DOM with desired XPath (with regular expression or not), and obtain a nodeset of only the links of interest. From their parse out the link text/ID or URL and you can pass back to selenium to click on or navigate to.
Upon request, I'm providing examples below. It's mixed languages since the post didn't appear to be language specific anyways. I'm just using what I had available to hack together for examples. They aren't fully tested or tested at all, but I've worked with bits of the code before in other projects, so these are proof of concept code examples of how you'd implement the solutions I just mentioned.
//Example of element attribute processing by page source and regex (in PHP)
$pgSrc = $sel->getPageSource();
//simple hyperlink extraction via regex below, replace with better regex pattern as desired
preg_match_all("/<a.+href=\"(.+)\"/",$pgSrc,$matches,PREG_PATTERN_ORDER);
//$matches is a 2D array, $matches[0] is array of whole string matched, $matches[1] is array of what's in parenthesis
//you either get an array of all matched link URL values in parenthesis capture group or an empty array
$links = count($matches) >= 2 ? $matches[1] : array();
//now do as you wish, iterating over all link URLs
//NOTE: these are URLs only, not actual hyperlink elements
//Example of XML DOM parsing with Selenium RC (in Java)
String locator = "id=someElement";
String htmlSrcSubset = sel.getEval("this.browserbot.findElement(\""+locator+"\").innerHTML");
//using JSoup XML parser library for Java, see jsoup.org
Document doc = Jsoup.parse(htmlSrcSubset);
/* once you have this document object, can then manipulate & traverse
it as an XML/HTML node tree. I'm not going to go into details on this
as you'd need to know XML DOM traversal and XPath (not just for finding locators).
But this tutorial URL will give you some ideas:
http://jsoup.org/cookbook/extracting-data/dom-navigation
the example there seems to indicate first getting the element/node defined
by content tag within the "document" or source, then from there get all
hyperlink elements/nodes and then traverse that as a list/array, doing
whatever you want with an object oriented approach for each element in
the array. Each element is an XML node with properties. If you study it,
you'd find this approach gives you the power/access that WebDriver/Selenium 2
now gives you with WebElements but the example here is what you can do in
Selenium RC to get similar WebElement kind of capability
*/

Selenium's By.Id and By.CssSelector methods do not support Regex and By.XPath only does where XPath 2.0 is enabled. If you want to use Regex, you can do something like this:
void MyCallingMethod(IWebDriver driver)
{
//Search by ID:
string attrName = "id";
//Regex = 'a number that is 1-10 digits long'
string attrRegex= "[0-9]{1,10}";
SearchByAttribute(driver, attrName, attrRegex);
}
IEnumerable<IWebElement> SearchByAttribute(IWebDriver driver, string attrName, string attrRegex)
{
List<IWebElement> elements = new List<IWebElement>();
//Allows spaces around equal sign. Ex: id = 55
string searchString = attrName +"\\s*=\\s*\"" + attrRegex +"\"";
//Search page source
MatchCollection matches = Regex.Matches(driver.PageSource, searchString, RegexOptions.IgnoreCase);
//iterate over matches
foreach (Match match in matches)
{
//Get exact attribute value
Match innerMatch = Regex.Match(match.Value, attrRegex);
cssSelector = "[" + attrName + "=" + attrRegex + "]";
//Find element by exact attribute value
elements.Add(driver.FindElement(By.CssSelector(cssSelector)));
}
return elements;
}
Note: this code is untested. Also, you can optimize this method by figuring out a way to eliminate the second search.

How to use regular expression in WatiN

I'm working on WatiN automation tool. I'm having problem in regular expression. I've situation where i have to enter some text and click on a button in the popup window. I'm using AttachToIE method and URL attribute("http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd") of the popup to attach to the popup.
The problem is each time the popup appears the ID value in the URL changes. So i'm not able to access the popup. can anyone plz help with this by giving me Regular Expression for the changing value of ID in the below URL
("http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd")
thanking you

It appears that you have a URL with 2 query string parameters Type and ID and your pattern is:
"http://192.168.25.10:215/admin/SelectUsers.aspx?Type=Feedback&ID={some id}"
You can use the Find.ByUrl() attribute constraint method and pass it to AttachToIE() as shown below with the regex for matching that pattern.
string url = "http://192.168.25.10:215/admin/SelectUsers.aspx?Type=Feedback&ID="
Regex regex = new Regex(url + "[a-z0-9]+", RegexOptions.IgnoreCase);
IE ie = IE.AttachToIE(Find.ByUrl(regex));

string baseUrl ="http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID="
Regex urlIE= new Regex(baseUrl + "[\\wd]+", RegexOptions.IgnoreCase);
IE ie = IE.AttachToIE(Find.ByUrl(urlIE);

I'm not familiar with WatiN but it looks like it's runs on .Net so perhaps this might help?
var desiredId = "000000000000-0000-0000-000000000000";
var url = "http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=ef5ad7ef5490-4656-9669-32464aeba7cd&someMoreStuff";
var pattern = #"(?i)(?<=FeedBackId=)[-a-z0-9]+";
var result = Regex.Replace(url, pattern, desiredId);
Console.WriteLine(result);
//Output: http://192.168.25.10:215/admin/SelectUsers.aspx?Type=FeedbackID=000000000000-0000-0000-000000000000&someMoreStuff
The following pattern should have the same affect but is more defensive. It should only match stuff in the query string, it requires the id to be 35 characters and won't match similar parameter names like "PreviousFeedBackId".
var pattern = #"(?i)(?<=\?.*\bFeedBackId=)[-a-z0-9]{35,35}\b";
If you just want to extract the id:
var id = Regex.Match(url, pattern).Value;
Console.WriteLine(id);
//output: ef5ad7ef5490-4656-9669-32464aeba7cd

WatiN has a feature where in we can use the url by neglecting the query string. Below is the code which is working fine for me.
string baseUrl = "http://192.168.25.10:215/admin/SelectUsers.aspx";
IE ie = IE.AttachToIE(Find.ByUrl(baseUrl,true));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to grab data from massive HTML string - regex

How about that: \"siteName\":\"(.+)\"

Related

Extract Hyperlinks Google Apps Script

Replacement Character inserted between each letter of CSV Dataset. How to Replace?

How to search and replace from a SafeHtml variable in Angular?

How to use regex in selenium locators

How to use regular expression in WatiN

Categories

Resources