Replace special characters to make JSON API output valid - regex

I'm working in my R script with the Twitter REST API 1.1 (user_timeline.json). I collect a large amount of tweets.
Unfortunately, the texts contain a lot of special characters like \n, ^ or single \. So far, I was able to replace them with str_replace_all or gsub before importing them via fromJSON (jsonlite package):
correctJSON <- function(string) {
string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
string <- str_replace_all(string, pattern = "\n", replacement = " ")
string <- str_replace_all(string, pattern = "\r", replacement = " ")
string <- str_replace_all(string, pattern = "\\^", replacement = " ")
return(string)
}
Now I have a string with special characters like \xed or \xa0. When trying to import it (via fromJSON(correctJSON(string))), I get as an error of the correctJSON function:
Fehler in parseJSON(txt) : lexical error: invalid bytes in UTF8 string.
uch sind.Mutig von bd. Seiten�������������������������������
(right here) ------^
The tweet containing the problematic characters is AFAICS:
[{\"created_at\":\"Fri Feb 07 18:35:02 +0000 2014\",\"id\":431858659656990721,\"id_str\":\"431858659656990721\",\"text\":\"RT #FHubersr: #peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig v…\",\"source\":\"Twitter for iPhone\",\"truncated\":false,\"in_reply_to_status_id\":null,\"in_reply_to_status_id_str\":null,\"in_reply_to_user_id\":null,\"in_reply_to_user_id_str\":null,\"in_reply_to_screen_name\":null,\"user\":{\"id\":378693834,\"id_str\":\"378693834\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweeted_status\":{\"created_at\":\"Fri Feb 07 18:32:30 +0000 2014\",\"id\":431858022366064640,\"id_str\":\"431858022366064640\",\"text\":\"#peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig von bd. Seiten\xed\xa0\xbd\xed\xb1\x8d\xed\xa0\xbd\xed\xb8\x8e\",\"source\":\"Twitter for iPhone\",\"truncated\":false,\"in_reply_to_status_id\":431845492579123201,\"in_reply_to_status_id_str\":\"431845492579123201\",\"in_reply_to_user_id\":378693834,\"in_reply_to_user_id_str\":\"378693834\",\"in_reply_to_screen_name\":\"peteraltmaier\",\"user\":{\"id\":2172292811,\"id_str\":\"2172292811\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweet_count\":3,\"favorite_count\":4,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"},\"retweet_count\":3,\"favorite_count\":0,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"}]
I already tried a lot of things but even after reading some threads here I cannot come up with a solution which can replace all problematic special characters.
Note: It's quite funny that when I want to import the single tweet via fromJSON, I do not get an error. But as soon as I import the correctJSON-string, it throws the error. But I need correctJSON because of the many \n appearances...
PS: I only pasted the problematic tweet. Here you can see the whole output of my API call also containing this one: https://p.mehl.mx/?53c04753c247a48a#5w+HtSCYpcjRwSk0PdsP3P1w3u+Z22/f6GKMJRoW//8=
Thanks for help!

Okay, I found a possible answer myself which works for the first 5000 tweets I gathered so far:
correctJSON <- function(string) {
string <- str_replace_all(string, pattern = "[^[:print:]]", replacement = " ")
string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
return(string)
}
The regex [^[:print:]] is suitable for special characters like \xed, \n and maybe also \U..... Only for single \ you'll need the second (perl) regex.
So it works for now, hopefully also for the many upcoming tweets I'll import. I'll edit if something unexpected happens.

Related

Regex to find 4th value inside bracket

How i can read 4th Value(inside "" i.e "vV0...." using Regex in below condition ?
I am updating a bit this part - Is it possible to first find Word "LaunchFileUploader" and then select the 4th Value, if there are multiple instance of LaunchFileUploader in the file just select 4th Value of first word found ? Attaching screenshot of file where this needs to be searched (In the file word is "LaunchFileUploader")
I tried this but it gives as - I need 4th value (Group 1 is giving me third value)
\bLaunchFileUploader\b(\:?.*?,){3}.*?\)
Match 1
Full match 11030-11428 LaunchFileUploader("ERM-1BLX3D04R10-0001", 1662, "2ecbb644-34fa-4919-9809-a5ff47594c2d", "8dZOPyHKBK...
Group 1. n/a "2ecbb644-34fa-4919-9809-a5ff47594c2d",
I am still looking for solution for this. Any help is aprreciated.
Depending on what's available to you to use, there's a couple of ways to do it.
Either way, this would work better if there were no new lines in the string, just plain ("value1","value2","value3","value4") etc. It'll still work, but you may need to clean up some new lines from the resulting string.
The easy way - use code for the hard part. Grab the inner string with:
(?<=\().*?(?=\))
This will get everything that's between the 2 parentheses (using positive lookarounds). In code, you could then split/explode this string on , and take the 4th item.
If you want to do it all in regex, you could use something along the lines of:
(?<=\()(?:.*?,){3}(.*?)(?=\))
This would a) match the entire contents of the parentheses and b) capture the 4th option in a capture group. To go even deeper:
(?<=\()(?:.*?,){3}\"(.*?)\"(?=\))
would capture the contents of the "" quotation marks only.
Some tools don't allow you to use lookarounds, if this is the case let me know and I'll see what other ways there are around it.
EDIT Ran this in JS console on browser. This absolutely does work.
EDIT 2 I see you've updated your question with the text you're actually searching in. This pattern will include the space and the new line character as per the copy/paste of the above text.
(?<=\(\")(?:.*?,\s?\n?){3}\"(.*?)\"(?=\))
See my second image for the test in console
This works for python and PHP:
(?<=\")(.*)(?:\"\);)\Z
Demo for Python and PHP
For Java, replace \Z with $ as follows:
(?:")(.*)(?:\"\);)$
Demo for JavaScript
NOTE: Be sure to look the captured group and not the matched group.
UPDATE:
Try this for your updated request:
"(.*)"(?:[\\);\] \/>}]*)$
Demo for updated input string
all the above regex patterns assume there is a line break after each comma
Auto-generated Java Code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\"(.*)\"(?:[\\\\);\\] \\/>\\}]*)$";
final String string = "\n"
+ "}$(document).ready( function(){ PathUploader\n"
+ " (\"ERM-1BLX3D04R10-0001\", \n"
+ " 1662, \n"
+ " \"1bff5c85-7a52-4cc5-86ef-a4ccbf14c5d5\", \n"
+ "\"vV0mX3VadCSPnN8FsAO7%2fysNbP5b3SnaWWHQETFy7ORSoz9QUQUwK7jqvCEr%2f8UnHkNNVLkJedu5l%2bA%2bne%2fD%2b2F5EWVlGox95BYDhl6EEkVAVFmMlRThh1sPzPU5LLylSsR9T7TAODjtaJ2wslruS5nW1A7%2fnLB%2bljZaQhaT9vZLcFkDqLjouf9vu08K9Gmiu6neRVSaISP3cEVAmSz5kxxhV2oiEF9Y0i6Y5%2f5ASaRiW21w3054SmRF0rq3IwZzBvLx0%2fAk1m6B0gs3841b%2fw%3d%3d\"); } );//]]>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

python regex to remove extra characters from papers' doi

i am new to regex and i have a list of some papers' DOIs. some of the DOIs include some extra characters or strings. I want to remove all those extras. Here is the sample data:
10.1038/ncomms3230
10.1111/hojo.12033
blog/uninews #ivalid
article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2FPLoSONE+%28PLOS+ONE+Alerts%3A+New+Articles%29
#want to extract 10.1371/journal.pone.0076852
utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2 #invalid
10.1002/dta.1578
enhanced/doi #invalid
doi/pgen.1005204
doi:10.2135/cropsci2014.11.0791 # =want to remove "doi:"
10.1126/science.aab1052
gp/about-springer
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
now some of the entries don't have DOIs at all. I want to replace them with "".
Here is my regex expression that i came up with:
for doi in doi_lst:
doi = re.sub(r"^[^10\.][^a-z0-9//\.]+", "", doi)
but it does nothing. i searched in many other stack overflow questions but couldn't get the one for my case. Kindly help me out here.
P.s. i am working with Python 3
Assuming the pattern for DOIs is a substring starting with 10. and more digits, / and then 1+ word or . chars, you may convert the strings using urlib.parse.unquote first (to convert entities to literal strings) and then use re.search with \b10\.\d+/[\w.]+\b pattern to extract each DOI from the list items:
import re, urllib.parse
doi_list=["10.1038/ncomms3230", "10.1111/hojo.12033", "blog/uninews", "article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852? ", "utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2",
"10.1002/dta.1578", "enhanced/doi", "doi/pgen.1005204", "doi:10.2135/cropsci2014.11.0791", "10.1126/science.aab1052", "gp/about-springer", "10.1038/srep14556","10.1002/rcm.7274", "10.1177/0959353515592899"]
new_doi_list = []
for doi in doi_list:
doi = urllib.parse.unquote(doi)
m = re.search(r'\b10\.\d+/[\w.]+\b', doi)
if m:
new_doi_list.append(m.group())
print(m.group()) # DEMO
Output:
10.1038/ncomms3230
10.1111/hojo.12033
10.1371/journal.pone.0076852
10.1002/dta.1578
10.2135/cropsci2014.11.0791
10.1126/science.aab1052
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
To include empty items upon no match add else: new_doi_list.append("") condition to the above code.

Split string and also get delimiter in array

I want to split a string like this:
"Street:§§§§__inboundRow['Adress']['Street']__§§§§ und Postal: §§§§__inboundRow['Adress']['Postal']__§§§§ super"
My code in Groovy:
def parts = ret.split(/§§§§__inboundRow.+?__§§§§/)
So the array what I get is
["Street:", " und Postal: ", " super"]
But what I want is:
["Street:", "§§§§__inboundRow['Adress']['Street']__§§§§", " und Postal: ", "§§§§__inboundRow['Adress']['Postal']__§§§§", " super"]
How do I achieve this?
Try splitting on a positive lookahead. Since you want to retain the delimiters you use to split, a lookaround is probably the way to go here.
var str = "String1:§§§§__inboundRow[Test]__§§§§andString2:§§§§__inboundRow[Test1]__§§§§";
console.log(str.split(/(?=§§§§__inboundRow\[Test\d*\]__§§§§|and)/));
I don't know what exact language you are using, but this should work anywhere you may split using regex, with lookahead (JavaScript certainly supports it). The pattern use to split was:
(?=§§§§__inboundRow\[Test\d*\]__§§§§|and)
This says to split when we can assert that what follows is either the text §§§§__inboundRow[Test\d*]__§§§§ or and.

Dealing with Spaces and NA's when Uniting Multiple Columns with Tidyr

So using the simple dataframe below, I want to create a new column that has all the days for each person, separated by a semi-colon.
For example, using Doug, it should look like - Monday; Wednesday; Friday
I would like to use Tidyr's Unite function for this but when I use it, I get - Monday;;Wednesday;;Friday, because of the NA's, which also could be blank spaces as well. Sometimes there are semi-colons at the beginning and end as well. So I'm hoping there's a way to keep using "unite" but enhanced with a regular expression so that I end up with each day of the week separated by one semi-colon, and no semi-colons at the beginning or end.
I would also like to stick with Tidyr, Dplyr, Stringr, etc.
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday"," "," ","Monday","Monday")
Tuesday<-c(" ","Tuesday","Tuesday"," ","Tuesday")
Wednesday<-c(" ","Wednesday","Wednesday","Wednesday"," ")
Thursday<-c(" "," "," "," ","Thursday")
Friday<-c(" "," "," "," ","Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
Days<-Days%>%unite(BestDays,Monday,Tuesday,Wednesday,Thursday,Friday,sep="; ",remove=FALSE)
You can try :
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday",NA,NA,"Monday","Monday")
Tuesday<-c(NA,"Tuesday","Tuesday",NA,"Tuesday")
Wednesday<-c(NA,"Wednesday","Wednesday","Wednesday",NA)
Thursday<-c(NA,NA,NA,NA,"Thursday")
Friday<-c(NA,NA,NA,NA,"Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
concat_str = function(str) str %>% na.omit %>% paste(collapse = "; ")
Days$BestDaysConcat = apply(Days[,c("Monday","Tuesday","Wednesday","Thursday","Friday")], 1, concat_str)
From getAnywhere("unite_.data.frame"), unite is calling do.call("paste", c(data[from], list(sep = sep))) underhood, and paste as far as I know doesn't provide a functionality to omit NAs unless manually implemented in some way;
Nevertheless, you can use a regular expression method as follows with gsub from base R to clean up the result column:
gsub("^\\s;\\s|;\\s{2}", "", Days$BestDays)
# [1] "Monday" "Tuesday; Wednesday"
# [3] "Tuesday; Wednesday" "Monday; Wednesday"
# [5] "Monday; Tuesday; Thursday; Friday"
This removes either ^\\s;\\s pattern or ;\\s{2} pattern, the former handle the case when the string starts with space string where we can just remove the space and it's following ;\\s, otherwise remove ;\\s{2} which can handle cases where \\s are both in the middle of the string and at the end of the string.

Remove from a string all except selected characters

I want to remove from a string all characters that are not digits, minus signs, or decimal points.
I imported data from Excel using read.xls, which include some strange characters. I need to convert these to numeric. I am not too familiar with regular expressions, so need a simpler way to do the following:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154°")
unwanted <- unique(unlist(strsplit(gsub("[0-9]|\\.|-", "", excel_coords), "")))
clean_coords <- gsub(do.call("paste", args = c(as.list(unwanted), sep="|")),
replacement = "", x = excel_coords)
> clean_coords
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"
Bonus if somebody can tell me why these characters have appeared in some of my data (the degree signs are part of the original Excel worksheet, but the others are not).
Short and sweet. Thanks to comment by G. Grothendieck.
gsub("[^-.0-9]", "", excel_coords)
From http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html: "A character class is a list of characters enclosed between [ and ] which matches any single character in that list; unless the first character of the list is the caret ^, when it matches any character not in the list."
Can also be done by using strsplit, sapply and paste and by indexing the correct characters rather than the wrong ones:
excel_coords <- c(" 19.53380Ý°", " 20.02591°", "-155.91059°", "-155.8154°")
correct_chars <- c(0:9,"-",".")
sapply(strsplit(excel_coords,""),
function(x)paste(x[x%in%correct_chars],collapse=""))
[1] "19.53380" "20.02591" "-155.91059" "-155.8154"
gsub("(.+)([[:digit:]]+\\.[[:digit:]]+)(.+)", "\\2", excel_coords)
[1] "9.53380" "0.02591" "5.91059" "5.8154"