Is there a fairly simple way for a script to tell (from context) whether "her" is a possessive pronoun?

Is there a fairly simple way for a script to tell (from context) whether "her" is a possessive pronoun? - regex

I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".

Okay. Lets look at this like a linguist might. I am thinking aloud here.
"Her" is a pronoun. It can either be a:
1. possessive pronoun
This is her book.
2. personal pronoun
Give it to her. (after preposition)
He wrote her a letter. (indirect object)
He treated her for a cold. (direct object)
So lets look at case (1), possessive pronoun. That is it is a pronoun which is in the "genitive" case (meaning, it is a noun which is being "possessive." Okay, that detail isn't quite as important as the next one.)
In this case, "her" is acting as a "determiner". Determiners may occur in two places in a sentence (this is a simplification):
Det + Noun ("her book")
Det + Adj + Noun ("her nice book")
So to figure out if her is a determiner, you could have this logic:
a. If the word following "her" is a noun, then "her" is a determiner.
b. If the 2 words following "her" is an adjective, then a noun, then "her" is a determiner"
And if you establish that "her" is a determiner, then you know that you must replace it with "his", which is also a determiner (aka genitive noun, aka possessive pronoun).
If it doesn't match criteria (a) and (b) above, then you could possibly conclude that it is not a determiner, which means it must be a personal pronoun. In that case, you would replace "her" with "him".
You wouldn't even have to do the tests below, but I'll try to describe them anyway.
Looking at (2) from above: personal pronoun, rather than possessive. This gets trickier.
The examples above show "her" occurring in 3 ways:
(1) Give it to her. (after preposition. we call this the "object of a preposition".)
So you could maybe devise a rule: "If 'her' occurs immediately after a preposition, then it should be treated as a noun, so we would replace it with 'him'".
The next two are tricky. "her" can either be a direct object or an indirect object.
(2) He wrote her a letter. (indirect object)
(3) He treated her for a cold. (direct object)
Syntactically, how can we tell the difference?
A direct object occurs immediately after a verb.
If you have a verb, followed by a noun, then that noun is a direct object. eg:
He treated her.*
If you have a verb, followed by a noun, followed by a prepositional phrase, then the noun is a direct object.
He treated her for a cold. ("her" is a noun, and it comes immediately after the verb "treated". "for a cold" is a prepositional phrase.)
Which means that you could say "If you have Verb + Noun + Prep" then the noun is a direct object. Since the noun is a direct object, then it is a personal pronoun, so use "him". (note, you only have to check for a preposition, not the entire prep phrase, since the phrase will always begin with a preposition.)
If it is an indirect object, then you'll have the form "verb + noun + noun".
He wrote her a letter. ("her" is a noun, "letter" is a noun. well, "a letter" is a "noun phrase", so you'd have to account for determiners as well.)
So... if "her" is a direct object, indirect object, or obj of prep, you could change it to "him", otherwise, change it to "his".
This method seems a lot more complicated - so I'd just start by checking to see if "her" is a determiner (see above), and if it is a determiner, use "his" otherwise, just use "him".
So, the above has a lot of simplifications. It doesn't cover "interrupting phrases", or clause structures, or constituency tests, or embedded clauses, or punctuation, or anything like that.
Also, this solution requires a dictionary - a list of "nouns" and "verbs" and "prepositions" so that you can determine the lexical category of each word in the sentence.
And even there, man, natural language processing is hard. You'd want to do some sort of "training" for your model to have a good solution. BUT for very simple things, try some of the stuff described above.
Sorry for being so verbose! (None of the existing answers gave any hard data, or precise linguistic definitions, so here goes.)

Given the scope of your project: reversing all gender-related words, it appears that :
The "investment" in a more fundamental approach would be justified
No heuristic based on simple lookup/substitution will adequately serve all or even most cases.
Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).
Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.
This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.

G'day,
This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.
I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?
You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:
Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.
Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.
Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:
Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.
HTH

Trying to determine whether her is a possessive or personal pronoun is harder than trying to determine the class of him or his. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences of him and his. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicates him or his. Then you can use those probability estimates on an occurrence of her to determine whether you should be using him or his. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.
I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of her in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:
her_followed_by_noun ? possessive : personal
You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.
Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.

I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.
Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory

Definitely no. You would have to do syntactic analysis on your input text (parsing the English language, really, that's where the word “to parse” comes from). That's the only way you can determine with certainty what the “her” in your text stand for, you can't rely on search-and-replace. There are many ways to do that, but none would qualify as “fairly simple”, I think.

I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.
When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.

About the only one I can think of (and I'm sure someone in the comments will prove me wrong!) is any instance of her followed by punctuation can most probably be replace with him. But I still agree with the previous answers that you're probably best off doing a manual replace.

OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.
Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:
If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.
And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)

Related

How to find changing date format from a string [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?

To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.

I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.

They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.

They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

Is it possible to check if a word is really an english word via regex?

When I say english I really mean vs gobbledy gook. I'm not trying to filter out maitre'd or espanol or whatever.
So basically I'm trying to test whether a word is made of entirely of pronounceable syllables.
So here would be a regex:
if re.match(r'^([^aeiouy]{1,3}[aeiouy]{1,3}[^aeiouy]{1,3}|[aeiouy]{1,3}[^aeiouy]{1,3})+
print "gobbledy gook!!!"
What its checking for: C=consonant V=vowel
CVC or VC groups of characters. groups are 1-3 characters in length
Does this make sense?,the_word) is None:
xCodexBlockxPlacexHolderx
What its checking for: C=consonant V=vowel
CVC or VC groups of characters. groups are 1-3 characters in length
Does this make sense?

Yes and no. It is, in a certain sense, possible; the comments give the trivial (and horribly verbose and sluggish) way to do so. But as to whether it is in any sense useful to abuse regexen for this task? No. There's simply far too much variation between valid words, and even the weakened verification you're doing that makes no attempt to handle plausible-but-wrong words like 'rong' will require absolutely impractical customization to do the job.
This sort of problem is why JWZ (Jamie Zawinski) said:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

How to detect deviations in a sequence from a pattern?

i want to detect the passages where a sequence of action deviates from a given pattern and can't figure out a clever solution to do this, although the problem sound really simple.
The objective of the pattern is to describe somehow a normal sequence. More specific: "What actions should or shouldn't be contained in an action sequence and in which order?" Then i want to match the action sequence against the pattern and detect deviations and their locations.
My first approach was to do this with regular expressions. Here is an example:
Example 1:
Pattern: A.*BC
Sequence: AGDBC (matches)
Sequence: AGEDC (does not match)
Example 2:
Pattern: ABCD
Sequence: ABD (does not match)
Sequence: ABED (does not match)
Sequence: ABCED (does not match)
Example 3:
Pattern: ABCDEF
Sequence: ABXDXF (does not match)
With regular expressions it is simple to detect a error but not where it occurs. My approach was to successively remove the last regex block until I can find the pattern in the sequence. Then i will know the last correct actions and have at least found the first deviation. But this doesn't seem the best solution to me. Further i can't all deviations.
Other soultions in my mind are working with state machines, order tools like ANTLR. But I don't know if they can solve my problem.
I want to detect errors of omission and commission and give an user the possibility to create his own pattern. Do you know a good way to do this?

While matching your input, a regular expression engine has information on the location of a mismatch -- it may however not provide it in a way where you can easily access it.
Consider e.g. a DFA implementing the expression. It fetches characters sequentially, matching them with expectation, and you are interested at the point in the sequence where there is no valid match.
Other implementations may go back and forth, and you would be interested in the maximum address of any character that was fetched.
In Java, this could possibly be done by supplying a CharSequence implementation to
java.util.regex.Pattern.matches(String regex, CharSequence input)
where the accessor methods keep track of the maximum index.
I have not tried that, though. And it does not help you categorizing an error, either.

have you looked at markov chains? http://en.wikipedia.org/wiki/Markov_chain - it sounds like you want unexpected transitions. maybe also hidden markov models http://en.wikipedia.org/wiki/Hidden_Markov_Models

Find an open-source implementation of regexs, and add in a hook to return/set/print/save the index at which it fails, if a given comparison does not match. Alternately, write your own RE engine (not for the faint of heart) so it'll do exactly what you want!

In "aa67bc54c9", is there any way to print "aa" 67 times, "bc" 54 times and so on, using regular expressions?

I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?

Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.

how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}

Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.

Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.

Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.

Is it feasible to ascribe pronunciations to distinct source code concepts?

I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.

Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.

As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.

So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js