Grok Pattern for Domain name in a list - regex

I tried to write an own grok pattern for domain name. But couldn't create it. Someone help me on this.
data format :
["test1.example.com", "test2.example.com", "test3.example.com", "new1.example.com", "new2.example.com", "new3.example.com", "dev.example.com"]
Sometimes, the above data count may vary means value is dynamic. Like as
["test1.example.com", "test2.example.com"]
Is there any possibility to create pattern for this?

Related

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Parse a string for open and close tags

Let's say I have the following strings:
"This [color=RGB]is[\color] a string."
"This [color=RGB][bold]is[\bold][\color] another string."
What I'm looking for is a good way to parse the string in order to extract the tag information and then reconstruct the original string without tags.
The tag informations will be used during text rendering.
Obviously I can achieve the goal by working directly with strings (find/substr/replace and so on), but I'm asking if there is another way cleaner, for example using regular expression.
Note:
There are very few tags I need, but there is the possibility to nest them (only of different type).
Can't use Boost.
There's a very simple answer that might work, depending on the complexity of your strings. (And me understanding you correctly, i.e. you just want to get the cleaned up strings, not actually extract the tags.) Just remove all tags. Replace
\[.*?]
with nothing. Example here
Now, if your string should be able to contain tag-like objects this might not work.
Regards

Yahoo Pipes Using Regex to change link

Hi I am pretty new to regex I can do some basic functions but having trouble with this. I need to change the link in the rss feed.
I have a url like this:
http://mysite.test/Search/PropDetail.aspx?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
and want to change it to updated site:
http://mysite.test/PropertyDetail/?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
Where only thing changed is from /Search/PropDetail.aspx
to /PropertyDetail/
I don't have access to the orginal rss feed or I would change it there so I have to use pipes. Please help, Thanks!
Use the regex control.
In it, specify the DOM address of the node containing your link (prefixed by "item.") within the "In" field. For the "replace" field type
(.*)//Search//PropDetail/.aspx
and in the "with" field type use:
$1//PropertyDetail//.*
I've 'escaped' the '/' character in the with field. However, I'm not sure you need to do this except before the '.*' Some trial and error may be needed.
Hopefully this will achieve the result you want.

Find Regex Pattern From Inputs

I have a user access log like this:
pagename url
broker_pv /broker/934832
broker_pv /broker/983432
broker_pv /broker/n/342349
listing_pv /listing/a1-b2/
listing_pv /listing/c3/
I want to find out whether a future url "/broker/245729" belongs to "broker_pv" or "listing_pv", or doesn't match at all.
It's like a machine learning process: I feed the computer some raw data, it learns, and then help me filtering things.
One way to do it, I can think of, is a "pattern finder" process. i.e., from the raw input, we human can deduce that "broker_pv" urls will match a pattern "/broker/(n/)?[0-9]+". So when a url like "/broker/245729" comes, I can test all the patterns against it, and judge which "pagename" it belongs to.
Then the question is, how to find out these patterns and thus build up a "pagename-pattern pair collection" for further use.
Or there's a better way, hopefully?

Search and Replace in Solr?

Im looking for something like a search and replace functionality in Solr.
I have dumped a document into solr, and doing some text analysis over it. At times i may need to group couple of words together and want solr to treat it as one single token.
For ex: "South Africa" will be treated as one single token for further processing. And also notice that these can be dynamic and im going to let the end user to decide which words he/she has to group. So NO Semantics required.
My current plan is to add a special character between these two words so Solr will treat it as one single token (StandardTokenizerFactory) for further processing.
So im looking for something like:
replace("South Africa",South_Africa")
Can anyone has any solution?
Use a Synonym filter and define these replacements in a synonyms.txt file. Once you have all of your definitions, rebuild the index.
You would probably have an entry like this to handle both the case where a field has a LowerCase filter before Synonym and where Synonym comes before LowerCase.
South Africa,south africa => southafrica
More info here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
You could perhaps use a PatternReplaceFilter and a clever regexp.