how to delete all English words, except special punctuation, in R

how to delete all English words, except special punctuation, in R - regex

I have a data file in R,
data <- "conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :)"
from this I want to remove all words, only the smiles will be there, and the output I am expecting,
":< :D :> :)"
Is there any library or method in R for doing this task easily?

You can use [[:alnum:]] as a regexp pattern for all numeric and alphanumeric characters of a string
s <- gsub("[[:alnum:]]*", "", "conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :) ")
gsub(" +", " ", s)
[1] " :< : :> :) "

Related

What is the best way to create control flow in templates?

I am working on making my own "language/template" that "compiles" to another "language" called mcfunction. The main reasoning for this being that mcfunction does not contain loops or lambdas, so I essentially just want to add those two features to it. I am doing this by creating my own file extension and having a program I run convert my own custom syntax into syntax that makes sense to the mcfunction "language".
I've been mostly successful but adding the looping system I want has been difficult, and I wanted to know if there was a better way to do it than I currently am. My dad suggested I use a template, but I have no idea how that works and looking up how to do it I couldn't really find anything that helps.
Basically, the Syntax I want to implement is something like
[('foo','bar'),('baz','qux'),('quux','quz')](
say {1}
tellraw #a "{2}, {1}"
)
into
say foo
tellraw #a "bar, foo"
say baz
tellraw #a "qux, baz"
say quux
tellraw #a "quz, quux"
I need to replace every instance of this syntax in a giant string with the output there and ideally be able to escape single quotes, and put the input on multiple lines.
so
[
('foo','bar'),
('baz','qux'),
('quux','quz')
](
say {1}
tellraw #a "{2}, {1}"
)
should output the same thing.
I started working on a mess of a regex to handle this for me or break it down to help me, but my dad told me a templating engine might help and I couldn't figure out how to make that work so I came here for help. Thanks for reading this.

Extracting sentences using scan() in R

I've been told that I shouldn't use R to scan text (but I have been doing so, anyway, pending the acquisition of other skills) and encountered a problem that has confused me sufficiently to retreat to these fora. Thanks for any help, in advance.
I'm trying to store a large amount of text (e.g., a short story) as a vector of strings, each of which is a separate sentence. I've been doing this using the scan() function, but I am encountering two basic problems: (1) scan() only seems to allow a single separating character, whereas sentences can obviously end in multiple ways. I know how to mark the end of a sentence using regex (e.g. [!?\.], but I don't know of a function in R that uses regular expressions to split text. (2) scan() seems to automatically regard a new line as a new field, whereas I want it to ignore new lines unless they coincide with the end of a sentence.
download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
sep=".",quote=NULL)
If I don't include the 'quote=NULL' option, scan() throws the warning that an EOF (end of field, I'm guessing) falls within a quoted string. This produces a handful of multi-line elements/fields, but pretty erratically. I can't seem to discern a pattern.
Sorry if this has been asked before. I'm sure there's an easy solution. I would prefer one that helps me make sense of why scan() isn't working the way I would expect, but if there are better tools to read text in R, please do let me know.

R has some really strong text mining capability, with many strong packages. For example, tm, rvest, stringi and others.
But here is a simple example of doing this almost completely in base R. I only use the %>% pipe from magrittr because I think this makes the code a bit more readable.
the specific answer to your question is you can use regular expressions to search for multiple punctuation marks. In the example below I use "[\\.?!] ", meaning a period, question mark or exclamation mark, followed by a space. You may have to experiment.
Try this:
library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"
corpus <- url %>%
paste(readLines(url), collapse=" ") %>%
gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)
head(corpus)
z <- corpus %>%
gsub(" +", " ", .) %>%
strsplit(split = "[\\.?!] ")
z[[1]]
The results:
z[[1]]
[1] " THE THREE LITTLE PIGS Once upon a time "
[2] ""
[3] ""
[4] "there were three little pigs, who left their mummy and daddy to see the world"
[5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"
[6] "None were happier than the three little pigs, and they easily made friends with everyone"
[7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"
[8] "Autumn came and it began to rain"
[9] "The three little pigs started to feel they needed a real home"
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"
...etc

c++ char value + "-some words here-" but error C2110

Normally I want a variable contain this "Hey you!".
In Javascript we can
var str = 'Hey' + 'you!';
In Web language we can
$str = 'Hey'.'you!';
but in c++
+ or . also cannot combine it..
Any ideas? I believe maybe it's just a simple thing but i really have no idea how to combine this in c++, please help...

If I well understood, you just need
"Hey" "you"
(no punctuation in between)
Just a note about the space:
NOTE: in all the OP provided samples, you will get "Heyyou" with no spaces in between.
I just reproduced the OP request. (so adding a space in this answer is wrong, since it will not match the requirement)
Whether that can be not the real intention (he just wanted "Hey you") than a space after Hey or before you is required.

replacing specific characters in between specific elements

I'd like to use a regular expression to replace a space in a string. The space in question is the only space between two elements in the string. The string itself however contains much more elements and spaces. So far i've tried
(<-)[\s]*?(->)
But that doesnt work. It is supposed to take
<-word anotherword->
and allow me to replace the space in it.
As \s selects all spaces, and
(<-)[\s\S]*?(->)
Selects all characters inbetween the <- and ->, i tried to re-use the expression but then for the spaces only.
I'm not so good at these expressions, and i can't for the life of me find an answer anywhere.
If anyone could just point me to the answer, that would be great. Thanks.

It's difficult to be sure what you want, post some before and after examples. And, specify what language you are using.
But, it looks like (<-\S+)\s*(\S+->) should probably do it (deletes spaces).
If the <- and -> are NOT to be preserved, move them out of the parentheses, like so:
<-(\S+)\s*(\S+)->
Here's what it would look like in JavaScript:
var before = "Ten years ago a crack <-flegan esque-> unit was sent to prison by a military "
+ "court for a crime they didn't commit.\n"
+ "These men promptly escaped from a maximum security stockade to the "
+ "<-flargon 666-> underground.\n"
+ "Today, still wanted by the government, they survive as soldiers of fortune.\n"
+ "If you have a problem and no one else can help, and if you can find them, "
+ "maybe you can hire the <-flugen 9->.\n"
;
var after = before.replace (/(<-\S+)\s*(\S+->)/g, "$1$2");
alert (after);
Which yields:
Ten years ago a crack <-fleganesque-> unit was sent to prison by a military court for a crime they didn't commit.
These men promptly escaped from a maximum security stockade to the <-flargon666-> underground.
Today, still wanted by the government, they survive as soldiers of fortune.
If you have a problem and no one else can help, and if you can find them, maybe you can hire the <-flugen9->.

Regular expression for validating names and surnames?

Although this seems like a trivial question, I am quite sure it is not :)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
^[a-z -']+$
However, I need to support also these cases:
other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
no numbers or symbols or unnecessary punctuation or runes, etc..
titles, middle initials, suffixes are not part of this data
names are already separated by surnames.
we are prepared to force ultra rare names to be simplified (there's a person named '#' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
note that many countries have laws about names so there are standards to follow
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?
I would be looking for something similar to the many "email address" regexes that you can find on google.

I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.

I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //&apos; does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, #"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}

I would just allow everything (except an empty string) and assume the user knows what his name is.
There are 2 common cases:
You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
In case (1), you can allow all characters because you're checking against a paper document.
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".

I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like ##$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.
EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:
s/[\<\>\"\'\%\;\(\)\&\+]//g;
"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)
v3.72, 2015-09-19 is a more recent version.

BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.

I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.

You could use the following regex code to validate 2 names separeted by a space with the following regex code:
^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$
or just use:
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-ZÀ-Ú]
[[:alpha:]] = [A-Za-zÀ-ú]
[[:alnum:]] = [A-Za-zÀ-ú0-9]

It's a very difficult problem to validate something like a name due to all the corner cases possible.
Corner Cases
Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.

A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">

This one worked perfectly for me in JavaScript:
^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$
Here is the method:
function isValidName(name) {
var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
return found > -1;
}

Steps:
first remove all accents
apply the regular expression
To strip the accents:
private static string RemoveAccents(string s)
{
s = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
}
return sb.ToString();
}

This somewhat helps:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$

This one should work
^([A-Z]{1}+[a-z\-\.\']*+[\s]?)*
Add some special characters if you need them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to delete all English words, except special punctuation, in R - regex

Related

What is the best way to create control flow in templates?

Extracting sentences using scan() in R

c++ char value + "-some words here-" but error C2110

replacing specific characters in between specific elements

Regular expression for validating names and surnames?

Categories

Resources