I am attempting to use software supplied by Kimonolabs to get a list of articles and their links from a web site. The problem I am having is that a string I have scraped from the web site has a date along with some text that I am unable to separate from the date.
Kimono uses this syntax for a regex:
/^()(.*?)()$/
first bracket => to the left of the required content
second bracket => this is what should get extracted
third bracket => to the right of the required content
Specifically the website I am trying to scrape is:
http://www.yashinquesada.com/futbol-nacional
Here is an example of the line I am trying to parse (I only want the date):
<p class="nspInfo nspInfo1 tleft fnone">Enero 08, 2016 <a href="/futbol-nacional/28-la-primera" >La Primera</a></p>
My attempts to parse this line returned no results, I have tried reading through regex reference materials but they are pretty complicated for me.
Any suggestions are appreciated!
The regular expressions Kimono expects need to have three groups (a group is a pair of parentheses). That means you always need to keep this structure:
/^()(.*?)()$/
This is Kimono's default, where the first group is empty, the second contains all the text (. matches any character, *? basically means "any number of times"), and the third is empty again.
You can adapt that arrangement to cut off unwanted text at the beginning and at the end - the value that ends up in your data will always be whatever the middle group matches.
I suspect the values you currently get currently are looking like this:
Enero 07, 2016 La Primera
so what you actually want to do is cut off text at the end.
Let's make the second and third groups more specific. We know the date always contains the year, which is four digits (\d\d\d\d or \d{4}) - and actually the match should end there. That's fairly easy:
/^()(.*?\d{4})(.*)$/
So, in English:
first group stays empty, no cut-off at the beginning
second group matches any character, but stops after matching four digits
third group matches the remainder of the value; Kimono will throw away that substring
Play around with the expression over at regex101: https://regex101.com/r/rM3tX0/1
Related
I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.
I am trying to extract cities and countries from several articles. The regular expression that I am using:
(at [A-Z](?:\w+)?|in [A-Z](?:\w+)?|of [A-Z](?:\w+)?)
It allows me to extract this kind of location:
of Mogadishu
in Istanbul
of Beletwein
However, it doesn't allow me to extract the location when is formulated as follows:
in downtown Tunis
in central Mogadishu
in a town near Mogadishu
What I would like to extract is any word starting with an uppercase after prepositions like (in,of,through, at) within a range of 3 words.
[In the link there is a sample of the text corpus] (https://regex101.com/r/0DRayP/6) and the regular expression
\b(at|in|of) (?:\w+\s){0,3}([A-Z]\w+)
I believe that hits everything in your example text.
\b makes sure the preposition is by itself and not part of another word.
The first group hits a preposition, which is easily modifiable to add more.
The second group isn't capturing and you can modify the number of additional words between the prep and location in the {0,3} bracket.
The third group gets your location.
Have a try with:
\b(?:at|in|of)\b[^A-Z]+([A-Z]\w*)
May be something like this (you can change number of words):
(at|in|of)( \w+){0,2} [A-Z](?:\w+)?
I'm working in Google Analytics and trying to use the RegEx advanced filter option to display page names that contain two /, but not three /. The text string within the first section will always be products; however, after the second / it is random.
For example,
I want to include these page name strings:
/products/skis
/products/snowboards
/products/skates
I want to exclude these page name strings:
/products/skis/mens
/products/snowboards/womens
/products/skates/red
Again, the products part is consistent...but the second text section is random.
Appreciate any help -- thanks!
One possibility would be this::
^\/products\/[a-zA-Z]+$
This would capture the first slash, followed by 'products', followed by a second slash, and then any text string (without special characters). Nothing else would come after.
To match pages names starting by /products/ and not containing a third slash, you can use this regex:
^\/products\/[^\/]+$
I have this dataset: (about 10k times)
<Id>HOW2SING</Id>
<PopularityRank>1</PopularityRank>
<Title><![CDATA[Superior Singing Method - Online Singing Course]]></Title>
<Description><![CDATA[High Quality Vocal Improvement Product With High Conversions. Online Singing Lessons Course Converts Like Crazy Using Content Packed Sales Video. You Make 75% On Every Sale Including Front End, Recurring, And 1-click Upsells!]]></Description>
<HasRecurringProducts>true</HasRecurringProducts>
<Gravity>45.9395</Gravity>
<PercentPerSale>74.0</PercentPerSale>
<PercentPerRebill>20.0</PercentPerRebill>
<AverageEarningsPerSale>74.9006</AverageEarningsPerSale>
<InitialEarningsPerSale>70.1943</InitialEarningsPerSale>
<TotalRebillAmt>16.1971</TotalRebillAmt>
<Referred>75.0</Referred>
<Commission>75</Commission>
<ActivateDate>2011-06-23</ActivateDate>
</Site>
I am trying to do the following:
Get the data from within the tags, and use it to create a URL, so in this example it should make
http://www.reviews.how2sing.domain.com
also, all other data has to go, i want to perform a REGEX function that will just give me a list of URLS.
I prefer to do it using notepad++ but i suck at regex, any help would be welome
To keep the regex relatively simple you can just use:
.*?<id>(.+?)</id>
Replace with:
http://www.reviews.\1.domain.com\n
That will search and replace all instances of Id tag and preceding text. You can then just remove the last manually.
Make sure matches newline is selected.
Regex is straightforward, only slightly tricky part is that it uses +? and *? which are non-greedy. This prevents the whole file from being matched. The () indicate a capture group that is used in the replacement, i.e. \1.
If you want to a regex that will include replacing the last part then use:
.*?(?:(<id>)?(.+?)</id>).+?(?:<id>|\Z)
This is a bit more tricky, it uses:
?:. A non-capturing group.
| OR
\Z end of file
Basically, the first time it will match everything up to the end of the first </id> and replace up to and including the next <id>. After that it will have replaced the starting <id> so everything before </id> goes in the group. On the last match it will match the end of file \Z.
If you only want the Id values, you can do:
'<Id>([^<]*)<\/Id>'
Then you can get the first captured group \1 which is the Id text value and then create a link from it.
Here is a demo:
http://regex101.com/r/jE9qN8
[UPDATE]
To get rid of all other lines, match this regex: '.*<Id>([^<]*)<\/Id>.*' and replace by first captured group \1. Note for the regex match, since there are multiple lines, you will need to have the DOTALL or /s flag activated to also match newlines.
Hope that helps.
I'm trying to create a regex for a search that will look at the following code and return only the ids and not the classes:
1 id="contact"
2 class="contact"
3 #contact
4 .contact
I want to return contact from the 1st and 3rd lines and NOT 2nd and 4th lines.
This is for a search across multiple files to avoid going through each one individually and checking whether it needs changing or not.
Is this possible?
Here you go:
/(?:#|id=")(\w+)"?/g
strings beginning with either # or id=" followed by word characters. You'll probably want to enhance it to handle dashes and underscores, I'd bet.
In this case, the first group is non-capturing, and the ID text will be your first capture group $1.
UPDATE
this one:
(?:(?<=id=")|(?<=#))(contact)
uses a positive lookbehind to find your prefixes and matches just the string "contact". This will NOT work in JavaScript (so you can't test it online) but will work in a text editor or CLI tool like ack.