How to write generic regex to extract the data in ExtractText? - regex

My present data like below,It contains 100 rows
1,Ads,,12,CDMA,,12
2,,12,14,CDMA,,12
..
...
100,DVS,13,,CDMA,12,22
i have using GetFile-->SplitText-->ExtractText to split the data in row using 10 regex attributes for my present data.
For example my one of the input regex is (.+),(.+),,(.+),(.+),(.+) It will split the regex.1,regex.2 upto regex.5
For this data in ExtractText processor i have given 10 regex attributes to match all values in present data.
In Future there is another 100 rows will be added to present data.So i have to write regex attribute for future 100 lines also.
I need to add expression language support for all columns in extracted data in Processor also.
Is it possible to give common regex for all data in ExtractText processor?
Is there is anyother way to extract the data by delimiter like comma,pipe symbol in NIFI?
Any help appreciated.
Please anyone help me to solve this

I just find common regex for extract my data from csv file.,
([^,]*?),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)
It could be huge expensive if it might be better than this (.+),(.+),,(.+),(.+),(.+)
It may be helpful for someone.

Related

Google sheet Regex

Trying to fetch meaning of an entered text from urban dictionary. The problem is that urban dictionary shows several definitions posted by different users. I've used 'importxml' for fetching the first page that shows up when someone searches for a particular word.
Now I want this data to be split in different columns so that I can get each definition in seperate column.
If we look at the fetched data, at the end of every definition there is "by username month dd,yyyy" string.
How can I use this string to split that raw data into definitions in separate columns?
Tried RegEx but could not figure it out because this is the first time I'm using Regex.
replace string to unique symbol and then split by it
to capture string use the pattern:
"by username .+ \d+,\d{4}"
As you can read here, regex is not the correct tool for parsing HTML.
In your situation I will use Google Apps Script in combination with a DOMParser library, as cheerio.
Example:
const content = getContent_('https://www.urbandictionary.com/define.php?term=nah');
const $ = Cheerio.load(content);
Logger.log($('.contributor').text());

How can I extract specific patterns from a string?

I currently have a dataset filled with the following pattern:
My goal is to get each value into a different cell.
I have tried with the following formula, but it's not yielded the results I am looking for.
=SPLIT(D8,"[Stock]",FALSE,FALSE)
I would appreciate any guidance on how I can get to the ideal output, using Google Sheets.
Thank you in advance!
I will assume here from your post that your original data runs D8:D.
If you want to retain [Stock] in each entry, try the following in the Row-8 cell of a column that is otherwise empty from Row 8 downward:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","(\[Stock\]).","$1~"),"~",1,1))))
If you don't want to retain [Stock] in each entry, use this version:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","\[Stock\].","~"),"~",1,1))))
These formulas don't function based on using any punctuation at all as markers. They also assure that you don't wind up with blank (and therefore unusable) cells interspersed for ending SPLITs.
, only used in the separator
=ARRAYFORMULA(SPLIT(D8:D,", ",FALSE))
, used also in each string ([stock] will be replaced)
=ARRAYFORMULA(SPLIT(D8:D," [Stock], ",FALSE))
, used also in each string ([stock] will not be replaced)
=ArrayFormula(SPLIT(REGEXREPLACE(M9:M11,"(\[Stock\]), ","$1♦"),"♦"))
use:
=INDEX(TRIM(IFNA(SPLIT(D8:D; ","))))

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Finding matches after a specific line in Perl/Notepad++

My problem is that I have a document that is split into sections, each section is noted by a single line header - [Header1], [Header2], etc. - and contains various types of data sets separated into individual lines, where each line is begun by a label indicating what type of data follows, like this:
[Header1]
data_label_type1 = 1,2,3
data_label_type2 = 1,2,3,4
data_label_type1 = 1,2,3,4,5
data_label_type3 = 1,2
Note the headers/sections are out of order, so Header1 doesn't always start a document and Header2 won't always follow.
A bit off topic, but the data sets are results from an experiment I'm mainting for a thesis.
I want to be able to capture type 1 data found only in the first section (under Header1) using a single regex function. After capturing it I was going to use replace and another function to convert the captured data to a different form.
Initially I was using the regex type1\h*=\h*([[:graph:]]*) but this only goes line by line, and I've got hundreds of documents - potentially tens of thousands of individal lines to catch.
I can use regex to convert my data well enough, but my problem lies in that I have no idea how capture type 1 data from Header1 exclusively. Any help, tips or pointers to start some experimenting would be really appreciated!
Regex apparently not capable of providing a solution, will use alternatives such as a parser instead.

Function regex_extract in hive

I'm extracting information from logs in hive with this sentences:
regexp_extract(values, "^(\\w{3} \\s?\\d+ \\d\\d:\\d\\d:\\d\\d \\w+-\\w+ \\w+:) (\\[)(\\d{2})(\\/)(\\w{3})(\\/)(\\d{4})(.*\\])",3)day,
regexp_extract(values, "^(\\w{3} \\s?\\d+ \\d\\d:\\d\\d:\\d\\d \\w+-\\w+ \\w+:) (\\[)(\\d{2})(\\/)(\\w{3})(\\/)(\\d{4})(.*\\])",5)month
I use the same regular expression for extract two fields in two different regex_extract call. It is possible to extract more than one field only executing regex_extract once?
Maybe not exactly what you are looking for, but if your really want to have one extraction that will give you multiple fields instead of one, this is what I found:
http://dev.bizo.com/2012/01/using-genericudfs-to-return-multiple.html
Note that for this solution you need to write a UDF with object inspectors, but see for yourself.