Hive regex for extracting keywords from url - regex

Filenames are following :
file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4
file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4
file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4
file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp
I want to write hive regex to extract words from each string.
for example in 1st string output should be : storage,emulated,....
UPDATE
This Code gives me result , but i wanted regex instead of below code.
package uri_keyword_extractor;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
public class UDFUrlKeywordExtractor extends UDF {
private Text result = new Text();
public Text evaluate(Text url) {
if (url == null) {
return null;
}
String keywords = url_keyword_maker(url.toString());
result.set(keywords);
return result;
}
private static String url_keyword_maker(String url) {
// TODO Auto-generated method stub
ArrayList<String> keywordAr = new ArrayList<String>();
char[] charAr = url.toCharArray();
for (int i = 0; i < charAr.length; i++) {
int current_index = i;
// check if character is a-z or A-Z
char ch = charAr[i];
StringBuilder sb = new StringBuilder();
while (current_index < charAr.length-1 && isChar(ch)) {
sb.append(ch);
current_index = current_index+1;
ch = charAr[current_index];
}
String word = sb.toString();
if (word.length() >= 2) {
keywordAr.add(word);
}
i = current_index;
}
//
StringBuilder sb = new StringBuilder();
for(int i =0; i < keywordAr.size();i++) {
String current = keywordAr.get(i);
sb.append(current);
if(i < keywordAr.size() -1) {
sb.append(",");
}
}
return sb.toString();
}
private static boolean isChar(char ch) {
// TODO Auto-generated method stub
int ascii_value = (int) ch;
// A-Z => (65,90) a-z => (97,122)
// condition 1 : A-Z , condition 2 : a-z character check
if ( (ascii_value >= 65 && ascii_value <= 90) || (ascii_value >= 97 && ascii_value <= 122) ) {
return true;
} else {
return false;
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
String test1 = "file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4";
String test2 = "file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4";
String test3 = "file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp";
System.out.println(url_keyword_maker(test1).toString());
System.out.println(url_keyword_maker(test2).toString());
System.out.println(url_keyword_maker(test3).toString());
}
}

Use split(str, regex_pattern) function, it splits str using regex as delimiter pattern and returns array. Then use lateral view + epxlode to explode array and filter keywords by length as in your Java code. Then apply collect_set to re-assemble array of keywords+concat_ws(delimeter, str) function to convert array to the delimited string if necessary.
The regex I passed to the split function is '[^a-zA-Z]'.
Demo:
select url_nbr, concat_ws(',',collect_set(key_word)) keywords from
(--your URLs example, url_nbr here is just for reference
select 'file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4' as url, 1 as url_nbr union all
select 'file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4' as url, 2 as url_nbr union all
select 'file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4' as url, 3 as url_nbr union all
select 'file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp' as url, 4 as url_nbr)s
lateral view explode(split(url, '[^a-zA-Z]')) v as key_word
where length(key_word)>=2 --filter here
group by url_nbr
;
Output:
OK
1 file,storage,emulated,SHAREit,videos,Dangerous,Hero,Latest,South,Indian,Full,Hindi,Dubbed,Movie,mp
2 file,storage,emulated,VidMate,download,Promo,Songs,Khiladi,Khesari,Lal,Bho,mp
3 file,storage,emulated,WhatsApp,Media,Video,VID,WA,mp
4 file,storage,emulated,bluetooth,DChitaChola,gp
Time taken: 37.767 seconds, Fetched: 4 row(s)
Maybe I have missed something from your java code, but hope you have caught the idea, so you can easily modify my code and add additional processing if necessary.

Related

How do you do this list operations in an efficient way in Dart (Bloc/Flutter)?

I have BLoC with the following state which contains 2 words ,
// 2 words are fixed and same length.
// in word_state.dart
abstract class WordState extends Equatable
const WordState(this.quest, this.answer, this.word, this.clicked);
final List<String> wordA; // WordA = ['B','A','L',L']
final List<String> wordB; // WordB = ['','','','']
#override
List<Object> get props => [wordA,wordB];
}
I want to ADD and REMOVE letters.
// in word_event.dart
class AddLetter extends WordEvent {
final int index;
const AddLetter(this.index);
}
class RemoveLetter extends WordEvent {
final int index;
const RemoveLetter(this.index);
}
1.ADD:
If I select the index of 'L' in wordA, then I add the letter 'L' in the first occurrence of '' (empty) in wordB.
// in word_bloc.dart
void _onLetterAdded(AddLetter event, Emitter<WordState> emit) {
final b = [...state.wordB];
b[b.indexOf('')] = state.wordA[event.index];
emit(WordLoaded(state.wordA, b));
}
//wordB, ['','','',''] into ['L','','','']
2.REMOVE:
If I deselect the index of 'L' in wordA, then I remove the last occurence of letter 'L' in wordB and shift the right side letters to left
void _onLetterRemoved(RemoveLetter event, Emitter<WordState> emit) {
final b = [...state.wordB];
final index = b.lastIndexOf(state.wordA[event.index]);
for (int i = index; i < 4 - 1; i++) {
b[i] = b[i + 1];
}
b[3] = '';
emit(WordLoaded(state.wordA, b));
}
}
// What i am trying to
// ['B','L','A','L']
// if index is 1 then ['B','A','L','']
This code is working fine, But I want to do the list operations in efficient way.
can you please check this code understand it and run it on dart pad.
List<String> wordA=['B','A','L','L'];
List<String> wordB=['','','',''];
List<String> wordBAll=['B','L','A','L'];
void main() {
wordB.insert(0,wordA[2]);
wordBAll.removeAt(wordBAll.length-1);
print(wordB);
print(wordBAll);
}

Capitalize First Letter Of Each Name after Hyphen "-" and Space " "

I'm currently using this String extension to Capitalize the letter of each word in a textField :
"happy sunshine" .toTitleCase() gives "Happy Sunshine"
extension StringExtension on String {
String toTitleCase() => replaceAll(RegExp(' +'), ' ')
.split(' ')
.map((str) => str.toCapitalized())
.join(' ');
String toCapitalized() =>
length > 0 ? '${this[0].toUpperCase()}${substring(1).toLowerCase()}' : '';
}
but I'd also like to Capitalize letters that come after a hyphen - with the same toTitleCase method
ex : "very-happy sunshine" .toTitleCase() would give "Very-Happy Sunshine"
Currently .toTitleCase() gives "Very-happy Sunshine" : (
I am sure a wizard with expert knowledge in regular expression can do this better but I think this solution solves your problem:
void main() {
print('happy sunshine'.toTitleCase()); // Happy Sunshine
print('very-happy sunshine'.toTitleCase()); // Very-Happy Sunshine
}
extension StringExtension on String {
String toTitleCase() => replaceAllMapped(
RegExp(r'(?<= |-|^).'), (match) => match[0]!.toUpperCase());
}
If you call the method a lot of times, you might consider having the RegExp as a cached value like:
extension StringExtension on String {
static final RegExp _toTitleCaseRegExp = RegExp(r'(?<= |-|^).');
String toTitleCase() =>
replaceAllMapped(_toTitleCaseRegExp, (match) => match[0]!.toUpperCase());
}
You can tweak your code as well. But I've used the same thing somewhere in my project so you can do something like this as well.
Working: First I'm creating an empty array looping through each character in a particular string and checking if space (" ") and hyphen ("-") are current_position - 1 then I'm making current_position to uppercase.
String capitalize(String s) {
String result = "";
for (int i = 0; i < s.length; i++) {
if (i == 0) {
result += s[i].toUpperCase();
} else if (s[i - 1] == " ") {
result += s[i].toUpperCase();
} else if (s[i - 1] == "-") {
result += s[i].toUpperCase();
} else {
result += s[i];
}
}
return result;
}

Replacing dynamic variable in string UNITY

I am making a simple dialogue system, and would like to "dynamise" some of the sentences.
For exemple, I have a Sentence
Hey Adventurer {{PlayerName}} !
Welcome in the world !
Now In code I am trying to replace that by the real value of the string in my game. I am doing something like this. But it doesn't work. I do have a string PlayerName in my component where the function is situated
Regex regex = new Regex("(?<={{)(.*?)(?=}})");
MatchCollection matches = regex.Matches(sentence);
for(int i = 0; i < matches.Count; i++)
{
Debug.Log(matches[i]);
sentence.Replace("{{"+matches[i]+"}}", this.GetType().GetField(matches[i].ToString()).GetValue(this) as string);
}
return sentence;
But this return me an error, even tho the match is correct.
Any idea of a way to do fix, or do it better?
Here's how I would solve this.
Create a dictionary with keys as the values you wish to replace and values as what you will be replacing them to.
Dictionary<string, string> valuesToReplace;
valuesToReplace = new Dictionary<string, string>();
valuesToReplace.Add("[playerName]", "Max");
valuesToReplace.Add("[day]", "Thursday");
Then check the text for the values in your dictionary.
If you make sure all of your keys start with "[" and end with "]" this will be quick and easy.
List<string> replacements = new List<string>();
//We will save all of the replacements we are about to perform here.
//This is done so we won't be modifying the original string while working on it, which will create problems.
//We will save them in the following format: originalText}newText
for(int i = 0; i < text.Length; i++) //Let's loop through the entire text
{
int startOfVar = 9999;
if(text[i] == '[') //We have found the beginning of a variable
{
startOfVar = i;
}
if(text[i] == ']') //We have found the ending of a variable
{
string replacement = text.Substring(startOfVar, i - startOfVar); //We have found the section we wish to replace
if (valuesToReplace.ContainsKey(replacement))
replacements.Add(replacement + "}" + valuesToReplace[replacement]); //Add the replacement we are about to perform to our dictionary
}
}
//Now let's perform the replacements:
foreach(string replacement in replacements)
{
text = text.Replace(replacement.Split('}')[0], replacement.Split('}')[1]); //We split our line. Remember the old value was on the left of the } and the new value was on the right
}
This will also work much faster, since it allows you to add as many variables as you wish without making the code slower.
Using Regex.Replace method, and a MatchEvaluator delegate (untested):
Dictionary<string, string> Replacements = new Dictionary<string, string>();
Regex DialogVariableRegex = new Regex("(?<={{)(.*?)(?=}})");
string Replace(string sentence) {
DialogVariableRegex.Replace(sentence, EvaluateMatch);
return sentence;
}
string EvaluateMatch(Match match) {
var matchedKey = match.Value;
if (Replacements.ContainsKey(matchedKey))
return Replacements[matchedKey];
else
return ">>MISSING KEY<<";
}
This is kind of old now, but I figured I'd update the accepted code above. It won't work since the start index is reset every time the loop iterates, so setting startOfVar = i gets completely reset by the time it hits the closing character. Plus there are problems if there's an open bracket '[' and no closing one. You can also no longer use those brackets in your text.
There's also setting the splitter to a single character. It tests fine, but if I set my player name to "Rob}ert", that will cause problems when it performs the replacements.
Here is my updated take on the code which I've tested works in Unity:
public string EvaluateVariables(string str)
{
Dictionary<string, string> varDict = GetVariableDictionary();
List<string> varReplacements = new List<string>();
string matchGuid = Guid.NewGuid().ToString();
bool matched = false;
int start = int.MaxValue;
for (int i = 0; i < str.Length; i++)
{
if (str[i] == '{')
{
if (str[i + 1] == '$')
{
start = i;
matched = true;
}
}
else if (str[i] == '}' && matched)
{
string replacement = str.Substring(start, (i - start) + 1);
if (varDict.ContainsKey(replacement))
{
varReplacements.Add(replacement + matchGuid + varDict[replacement]);
}
start = int.MaxValue;
matched = false;
}
}
foreach (string replacement in varReplacements)
{
str = str.Replace(replacement.Split(new string[] { matchGuid }, StringSplitOptions.None)[0], replacement.Split(new string[] { matchGuid }, StringSplitOptions.None)[1]);
}
return str;
}
private Dictionary<string, string> GetVariableDictionary()
{
Dictionary<string, string> varDict = new Dictionary<string, string>();
varDict.Add("{$playerName}", playerName);
varDict.Add("{$npcName}", npcName);
return varDict;
}

RE2 regular expressions on streams?

Is it possible to use Google RE2 with streams? Some input literals that we are suppose to process with regular expressions can potentially be too large to hold in-memory.
If there is a maximum match length, you could read the data in blocks of at least twice that length. If the match fails, or starts less than that many characters from the end, cut the current string, and append another block.
The length of the match string would never be more than the block length + max match length.
Example in C#:
public static IEnumerable<StreamMatch> MatchesInStream(
this Regex pattern, TextReader reader,
int maxMatchLength, int blockLength)
{
if (maxMatchLength <= 0)
{
throw new ArgumentException("Must be positive", "maxMatchLength");
}
if (blockLength < maxMatchLength)
{
throw new ArgumentException("Must be at least as long as maxMatchLength", "blockLength");
}
char[] buffer = new char[blockLength];
string chunk = "";
int matchOffset = 0;
// Read one block, and append to the string
int charsRead = reader.ReadBlock(buffer, 0, blockLength);
chunk += new string(buffer, 0, charsRead);
while (charsRead > 0 && chunk.Length > maxMatchLength)
{
int cutPosition = 0;
foreach (Match match in pattern.Matches(chunk))
{
if (match.Index > chunk.Length - maxMatchLength)
{
// The match could possibly have matched more characters.
// Read another block before trying again.
break;
}
yield return new StreamMatch(matchOffset, match);
cutPosition = match.Index + match.Length;
}
cutPosition = Math.Max(cutPosition, chunk.Length - maxMatchLength);
matchOffset += cutPosition;
chunk = chunk.Substring(cutPosition);
charsRead = reader.ReadBlock(buffer, 0, blockLength);
chunk += new string(buffer, 0, charsRead);
}
// Stream has ended. Try to match the last remaining characters.
foreach (Match match in pattern.Matches(chunk))
{
yield return new StreamMatch(matchOffset, match);
}
}
public class StreamMatch
{
public int MatchOffset { get; private set; }
public Match Match { get; private set; }
public StreamMatch(int matchOffset, Match match)
{
MatchOffset = matchOffset;
Match = match;
}
}
// One horrible XML parser
var reader = new StreamReader(stream);
var pattern = new Regex(#"<(/?)([\w:-]{1,15})([^<>]{0,50}(?<!/))(/?)>");
foreach (StreamMatch match in pattern.MatchesInStream(reader, 69, 128))
{
Console.WriteLine(match.Match.Value);
}

Fuzzy Matches on dijit.form.ComboBox / dijit.form.FilteringSelect Subclass

I am trying to extend dijit.form.FilteringSelect with the requirement that all instances of it should match input regardless of where the characters are in the inputted text, and should also ignore whitespace and punctuation (mainly periods and dashes).
For example if an option is "J.P. Morgan" I would want to be able to select that option after typing "JP" or "P Morgan".
Now I know that the part about matching anywhere in the string can be accomplished by passing in queryExpr: "*${0}*" when creating the instance.
What I haven't figured out is how to make it ignore whitespace, periods, and dashes. I have an example of where I'm at here - http://jsfiddle.net/mNYw2/2/. Any help would be appreciated.
the thing to master in this case is the store fetch querystrings.. It will call a function in the attached store to pull out any matching items, so if you have a value entered in the autofilling inputfield, it will eventually end up similar to this in the code:
var query = { this.searchAttr: this.get("value") }; // this is not entirely accurate
this._fetchHandle = this.store.query(query, options);
this._fetchHandle.then( showResultsFunction );
So, when you define select, override the _setStoreAttr to make changes in the store query api
dojo.declare('CustomFilteringSelect', [FilteringSelect], {
constructor: function() {
//???
},
_setStoreAttr: function(store) {
this.inherited(arguments); // allow for comboboxmixin to modify it
// above line eventually calls this._set("store", store);
// so now, 'this' has 'store' set allready
// override here
this.store.query = function(query, options) {
// note that some (Memory) stores has no 'fetch' wrapper
};
}
});
EDIT: override queryEngine function as opposed to query function
Take a look at the file SimpleQueryEngine.js under dojo/store/util. This is essentially what filters the received Array items on the given String query from the FilteringSelect. Ok, it goes like this:
var MyEngine = function(query, options) {
// create our matching query function
switch(typeof query){
default:
throw new Error("Can not query with a " + typeof query);
case "object": case "undefined":
var queryObject = query;
query = function(object){
for(var key in queryObject){
var required = queryObject[key];
if(required && required.test){
if(!required.test(object[key])){
return false;
}
}else if(required != object[key]){
return false;
}
}
return true;
};
break;
case "string":
/// HERE is most likely where you can play with the reqexp matcher.
// named query
if(!this[query]){
throw new Error("No filter function " + query + " was found in store");
}
query = this[query];
// fall through
case "function":
// fall through
}
function execute(array){
// execute the whole query, first we filter
var results = arrayUtil.filter(array, query);
// next we sort
if(options && options.sort){
results.sort(function(a, b){
for(var sort, i=0; sort = options.sort[i]; i++){
var aValue = a[sort.attribute];
var bValue = b[sort.attribute];
if (aValue != bValue) {
return !!sort.descending == aValue > bValue ? -1 : 1;
}
}
return 0;
});
}
// now we paginate
if(options && (options.start || options.count)){
var total = results.length;
results = results.slice(options.start || 0, (options.start || 0) + (options.count || Infinity));
results.total = total;
}
return results;
}
execute.matches = query;
return execute;
};
new Store( { queryEngine: MyEngine });
when execute.matches is set on bottom of this function, what happens is, that the string gets called on each item. Each item has a property - Select.searchAttr - which is tested by RegExp like so: new RegExp(query).test(item[searchAttr]); or maybe a bit simpler to understand; item[searchAttr].matches(query);
I have no testing environment, but locate the inline comment above and start using console.debug..
Example:
Stpre.data = [
{ id:'WS', name: 'Will F. Smith' },
{ id:'RD', name:'Robert O. Dinero' },
{ id:'CP', name:'Cle O. Patra' }
];
Select.searchAttr = "name";
Select.value = "Robert Din"; // keyup->autocomplete->query
Select.query will become Select.queryExp.replace("${0]", Select.value), in your simple queryExp case, 'Robert Din'.. This will get fuzzy and it would be up to you to fill in the regular expression, here's something to start with
query = query.substr(1,query.length-2); // '*' be gone
var words = query.split(" ");
var exp = "";
dojo.forEach(words, function(word, idx) {
// check if last word
var nextWord = words[idx+1] ? words[idx+1] : null;
// postfix 'match-all-but-first-letter-of-nextWord'
exp += word + (nextWord ? "[^" + nextWord[0] + "]*" : "");
});
// exp should now be "Robert[^D]*Din";
// put back '*'
query = '*' + exp + '*';