How would a programmer approach this text processing task?

How would a programmer approach this text processing task? - regex

Some background
I'm asking how a programmer would approach this task because I'm not really a programmer. I'm a grad student studying quantitative social science, and while I've been programming consistently for a year now I have no formal training.
I'm not concerned about implementing a generic algorithm. I'm happy to work in Bash, AWK, R, or Python. I've also written small snippets of code (beyond "hello world," but not much further) in Java, C, JavaScript, and Matlab. However, if there's some language or some feature of a language that would make this task easier or more natural, I'd love to know about it.
Instead, I'm interested in algorithms and data structures. What do I grab, when do I grab it, where do I save it, etc.? I imagine I could probably do all this with a few cleverly-constructed regular expressions, and I am pretty comfortable with intermediate-level regex features like lookarounds, but anything I make up on my own will undoubtedly be hacky and ad-hoc.
The task
What I have is code (it happens to be in R) that looks something like this, where #'s indicate comments:
items = list(
day1 = list(
# a apples
# b oranges
# c pears
# d red grapes
# m.
# 1 peanuts
# 2 cashews
type1 = c("a", "b", "d", "m.2") # this returns a vector of strings
type2 = c("c", "m.1")
), # this returns a list of vectors
day2 = list(
# a apples
# b oranges
# c pears
# d red grapes
# e plums
# m.
# 1 peanuts
# 2 cashews
# 3 pistachios
type1 = c("a", "b", "d", "e", "m.2")
type2 = c("c", "m.1", "m.3")
)
) # this returns a list of lists of vectors
and what I would like instead is code that looks like this:
items = list(
day1 = list(
type1 = c(
"apples" = "a",
"oranges" = "b",
"red grapes" = "d",
"cashews" = "m.2"
),
type2 = c(
"pears" = "c",
"peanuts" = "m.1"
)
),
day2 = list(
type1 = c(
"apples" = "a",
"oranges" = "b",
"red grapes" = "d",
"plums" = "e",
"cashews" = "m.2"
),
type2 = c(
"pears" = "c",
"peanuts" = "m.1",
"pistachios" = "m.3"
)
)
)
Some things to note:
I can rely on the commented text following that format.
I cannot rely on the naming for day1 being "nested" inside the naming for day2. Some of the letters might swap around.
I can rely on there being the same number and name of types within days.
The vertical spacing isn't important; I mostly just want to get the comments into the code as shown, although having the script do all the spacing for me would be a nice touch.
So, how would a programmer approach the task of programmatically turning the first code snippet into the second? I can have it done in about 15 minutes of copying and pasting, but I'd like to learn something here. And again, I'm not asking for pre-written code, I'm just looking for some direction since right now I'm just groping in the dark.

Given your sample of code it should be doable by putting together a transformation that consists of a couple steps. At a high level you'd need to read the comments into a data collection that you can query, then parse the code and do a find/replace referencing the data collection.
Without getting in too deep this might look like:
Generate a text file of only the comments. Using a regex with the intent of "find all lines that start with whitespace, then a #" (something like ^\s*#.*$) would give you a result like:
# a apples
# b oranges
# c pears
# d red grapes
# m.
# 1 peanuts
# 2 cashews
# a apples
# b oranges
# c pears
# d red grapes
# e plums
# m.
# 1 peanuts
# 2 cashews
# 3 pistachios
Using the above results you can utilize some basic text parsing to break down each line. To handle the m. cases requires some assumptions. Based on your sample I'd start with some pseudocode like:
For each line
Get the first character after the # and call it "key"
Find the word after the letter and call it "value"
If the key is a letter
Add "key" => "value" to the dictionary
Next line
If the key is a number
Get the last key added to the dictionary and call it as "parentkey"
Add "parentkey"+"key" => "value" to the dictionary
Next line
This would give you a structure like this:
{
"a": "apples",
"b": "oranges",
"c": "pears",
"d": "red grapes",
"m.": "",
"m.1": "peanuts",
"m.2": "cashews",
"a": "apples",
"b": "oranges",
"c": "pears",
"d": "red grapes",
"e": "plums",
"m.": "",
"m.1": "peanuts",
"m.2": "cashews",
"m.3": "pistachios"
}
You can clean out the empty "m." entries by iterating over it and removing items with an empty value.
At this point you can iterate over your dictionary and perform an find/replace in your code file:
For each dictionary entry (key, value)
Find strings like "key" and replace with strings like "value" = "key"
All in all it's not terribly efficient or elegant, but it's not difficult to code and should work. Granted there's probably additional details to consider (there always are) but this is a fairly simple approach given your samples.

I would use a quick regular expression substitution to reduce the work to do and then fix it up by hand. For example you get over halfway there with:
s/# (\w+) ([\w ]+)/"\2" = "\1"/
The exact regular expression to write and how you use it depends on your tool. Different editors and programming languages are very different. Google for what you are using to learn more. (You may have multiple easy options - a Python command line would use one syntax, and the vi editor a different one.)
If you have to do this task regularly or for more code, then you would need to learn about parsing. Which is a lot more work (too much to be worthwhile for something like this if you don't have code sitting around to do it) but also a lot more powerful in the long run.

Related

Map Reduce Removing Duplicates

I have been given a large text file and want to find the number of different words that start with each letter. I am trying to understand input and output values for map and reduce functions.
I understand a simpler problem which does not need to deal with duplicate words: determine the frequency with which each letter of the alphabet starts a word in the text using map reduce.
Map input: <0, “everyday i am city in tomorrow easy over school i iterate tomorrow city community”>
Map output: [<e,1>,<i,1>,<a,1>,<c,1>,<i,1>,<t,1>,<e,1>,<o,1>,<s,1>,<i,1>,<i,1>,<t,1>,<c,1>,<c,1>]
Reduce input: <a,[1]>,<c,[1,1,1]>,<e,[1,1]>,<i,[1,1,1,1]>,<o,[1]>,<s,[1]>,<t,[1,1]>
Reduce output: [<a,1>,<c,3>,<e,2>,<i,4>,<o,1>,<s,1>,<t,2>]
For the above problem the words 'i' 'city' and 'tomorrow' appear more than once so my final output should be:
Reduce output: [<a,1>,<c,2>,<e,2>,<i,3>,<o,1>,<s,1>,<t,1>]
I am unsure of how I would ensure duplicate words are remove in the above problem (would it be done in a pre processing phase or could it be implemented on either map or reduce functions). If I could get help understanding the map and reduce outputs of the new problem I would appreciate it.

You can do it in two map-reduce passes:
find all the distinct word by using word as a map output and in reduce outputting each word once
you already solved - find frequency of each initial letter on each unique word.
Alternatively, since there are not many unique words you can cache them in mapper and output each one (or its first letter) only once and the reduce will be identical to your simpler problem. Actually, no, that won't work because same words can appear in different mappers. But you can still cache the words in mapper in the first solution and output each word only once per mapper - a little bit less traffic between map and reduce.

Maybe something like this would help,
let str = "everyday i am city in tomorrow easy over school i iterate tomorrow city community"
let duplicatesRemoved = Set(str.split(separator: " "))
Output:
["city", "community", "tomorrow", "easy", "everyday", "over", "in", "iterate", "i", "am", "school"]
And maybe you don't need those map statements and can achieve something like this,
Code
var varCount = [Character: Int]()
for subStr in duplicatesRemoved {
if let firstChar = subStr.first {
varCount[firstChar] = (varCount[firstChar] ?? 0) + 1
}
}
Output
["i": 3, "t": 1, "e": 2, "c": 2, "s": 1, "a": 1, "o": 1]

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks

You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50

Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

ML program to find the acronyms of a given list

I am researching and learning about the ML language. I have met with a question and have difficulty in solving it. I'm sure I will use the Traverse, Size and Substring functions, but i cannot put it in some way, I'm a bit confused. Could you help?
Question:
val x = [ ["National", "Aeronautics", "and", "Space", "Administration"]
, ["The", "North", "Atlantic", "Treaty", "Organization"]
]
Sample run:
val it = [ {acronym="NASA", name="National Aeronautics and Space Administration"},
, {acronym="NATO", name="The North Atlantic Treaty Organization"}
]
: nm list

Looking at the information in your question, I'm guessing that the goal of the problem is to write a function acronyms which meets the following specification. I've taken some liberty of renaming types to make it clearer:
type words = string list
type summary = {acronym : string, name : string}
val acronyms : words list -> summary list
This function takes a list of organization names (which have been split into words) and produces a list of summaries. Each summary in the output describes the corresponding organization from the input.
The tricky part is writing a function acronym : words -> summary which computes a single summary. For example,
- acronym ["National", "Aeronautics", "and", "Space", "Administration"];
val it = {acronym="NASA",name="National Aeronautics and Space Administration"}
: summary
Once you have this function, you can apply it to each organization name of the input with List.map:
fun acronyms orgs = List.map acronym orgs
I'll leave the acronym function to you. As a hint to get started, consider filtering the list of words to remove words such as "and" and "the".

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.

Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.

Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js