drop these characters like line break - line-breaks

i´d like to know if yo can help me with this doubt. i´ve been working in a text´s file and this is the code I used to import the file from my directory:
f = open("C:/Users/alexis/Documents/constitucion/textodefinitivopropuestanuevaconstitucion.txt", "rb")
const_ch = f.read()
and this is the output if i consult to the variable const_ch:
b" \r\n\r\n\x0c\r\n \r\n\r\n\x0c\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\nCONSTITUCI\xc3\x93N POL\xc3\x8dTICA DE LA \r\nREP\xc3\x9aBLICA DE CHILE \r\n\r\n \r\n\r\n\x0c\r\n \r\n\r\n\x0c\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\nNosotras y nosotros, el pueblo de Chile, conformado \r\npor diversas naciones, nos otorgamos libremente esta \r\nConstituci\xc3\xb3n, acordada en un proceso participativo, \r\nparitario y democr\xc3\xa1tico.
so i want to drop the characters like (\r or \n) so i can work only with the text, but i try with the method strip or rstrip and it didn´t work. Could you teach me what im doing wrong or what could i do to drop those characters so i could work with the text?
I appreciate your help!

Related

Delete lines that do not include only chars

I am trying to delete lines that contains anything else apart from characters of the alphabet and withe-spaces. Numbers, commas, quotes, math symbols: any line with them has to be removed.
Input:
FISIOLOGIA UMANA
FISIOLOGIA UMANA
http://id.loc.gov/vocabulary/subjectSchemes/FI
Sepúlveda, Luis
La sirenetta Walt Disney
La sirenetta
CFIV007842
CFIV006619
Lubac, Henri : de
Roma
Expected output:
FISIOLOGIA UMANA
FISIOLOGIA UMANA
La sirenetta Walt Disney
La sirenetta
Roma
So far, I used :%g!:[A-Za-z]:d with Vim, that was supposed to do the trick. Curiously, it states that it matches every line (as expected) but it does not delete lines where non alphabetic chars are found. What is the reason behind such behavior? How could the issue be smartly approached?
:%g!:[A-Za-z]:d is checking to see what lines match the alphabet and then delete the non-matching lines. Since every line matches then no lines will be deleted.
I think it would be easier to search for characters that you want to delete:
:g/[^a-zA-Z ]/d_
Your regex matches all lines. You need one that checks that there are only characters on the entire line:
:%g!:^[A-Za-z ]\+$:d
Note also that I included the space character since you seem to want to allow that too.

Replace special characters to make JSON API output valid

I'm working in my R script with the Twitter REST API 1.1 (user_timeline.json). I collect a large amount of tweets.
Unfortunately, the texts contain a lot of special characters like \n, ^ or single \. So far, I was able to replace them with str_replace_all or gsub before importing them via fromJSON (jsonlite package):
correctJSON <- function(string) {
string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
string <- str_replace_all(string, pattern = "\n", replacement = " ")
string <- str_replace_all(string, pattern = "\r", replacement = " ")
string <- str_replace_all(string, pattern = "\\^", replacement = " ")
return(string)
}
Now I have a string with special characters like \xed or \xa0. When trying to import it (via fromJSON(correctJSON(string))), I get as an error of the correctJSON function:
Fehler in parseJSON(txt) : lexical error: invalid bytes in UTF8 string.
uch sind.Mutig von bd. Seiten�������������������������������
(right here) ------^
The tweet containing the problematic characters is AFAICS:
[{\"created_at\":\"Fri Feb 07 18:35:02 +0000 2014\",\"id\":431858659656990721,\"id_str\":\"431858659656990721\",\"text\":\"RT #FHubersr: #peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig v…\",\"source\":\"Twitter for iPhone\",\"truncated\":false,\"in_reply_to_status_id\":null,\"in_reply_to_status_id_str\":null,\"in_reply_to_user_id\":null,\"in_reply_to_user_id_str\":null,\"in_reply_to_screen_name\":null,\"user\":{\"id\":378693834,\"id_str\":\"378693834\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweeted_status\":{\"created_at\":\"Fri Feb 07 18:32:30 +0000 2014\",\"id\":431858022366064640,\"id_str\":\"431858022366064640\",\"text\":\"#peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig von bd. Seiten\xed\xa0\xbd\xed\xb1\x8d\xed\xa0\xbd\xed\xb8\x8e\",\"source\":\"Twitter for iPhone\",\"truncated\":false,\"in_reply_to_status_id\":431845492579123201,\"in_reply_to_status_id_str\":\"431845492579123201\",\"in_reply_to_user_id\":378693834,\"in_reply_to_user_id_str\":\"378693834\",\"in_reply_to_screen_name\":\"peteraltmaier\",\"user\":{\"id\":2172292811,\"id_str\":\"2172292811\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweet_count\":3,\"favorite_count\":4,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"},\"retweet_count\":3,\"favorite_count\":0,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"}]
I already tried a lot of things but even after reading some threads here I cannot come up with a solution which can replace all problematic special characters.
Note: It's quite funny that when I want to import the single tweet via fromJSON, I do not get an error. But as soon as I import the correctJSON-string, it throws the error. But I need correctJSON because of the many \n appearances...
PS: I only pasted the problematic tweet. Here you can see the whole output of my API call also containing this one: https://p.mehl.mx/?53c04753c247a48a#5w+HtSCYpcjRwSk0PdsP3P1w3u+Z22/f6GKMJRoW//8=
Thanks for help!
Okay, I found a possible answer myself which works for the first 5000 tweets I gathered so far:
correctJSON <- function(string) {
string <- str_replace_all(string, pattern = "[^[:print:]]", replacement = " ")
string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
return(string)
}
The regex [^[:print:]] is suitable for special characters like \xed, \n and maybe also \U..... Only for single \ you'll need the second (perl) regex.
So it works for now, hopefully also for the many upcoming tweets I'll import. I'll edit if something unexpected happens.

Regex, can't get it

Seems easy but it doesn't work. I have something like this :
5224Reportage chez Ben ferme Ayrshire 2000 inc.2009-08-26T00:00:00-04:00En 2001, plutôt que de prendre le chemin de l’expansion, Ben ferme Ayrshire 2000 d’Hébertville au Lac-Saint-Jean a décidé d’ajouter le volet fromagerie à l’entreprise en misant sur la qualité.Revue/PLQ-2009-09/reportage.pdf5144Un deuxième Revue/PLQ-2014-07/production.pdf
From this I need an array containing :
Revue/PLQ-2009-09/reportage.pdf
Revue/PLQ-2014-07/production.pdf
I used :
$pdfResult = array();
preg_match_all('/^Revue.*pdf$/',$string, $pdfResult);
It returns nothing...
.* is greedy by default. You need to make it non-greedy by adding ? quantifier next to *. And you don't need to put anchors, since the strings you want isn't at the start.
preg_match_all('~Revue.*?\.pdf~',$string, $pdfResult);
DEMO

How to remove CRLF conditionally from a text file preferably in Notepad ++

I've been looking for this one all day now, this is the closest useful ref I found.
My problem: huge files are imported from a closed system (can't be altered at the source) and need to be imported. These files are | separated and have a CRLF at the end of each line
(until the last one). Now they found it funny to include a new type that can contain text with CR and CRLF in the text (instedd of <br>).
So what I need to do before I can process this file in our system, is to replace all CRLF and CR occurrences that are not preceded by a | to <br>, so that every line starts with a code like 000| ... 600|
Closest I've got in Notepad ++:
Find: (?<![\|])[\r\n]+$
Replace: <br>
The prroblem is that it will not give a <br> for every crlf, misses crlf after cr... Other attempts to select the |crlf too forget the CR altogether.
Any thoughts greatly appreciated. Do keep in mind that the file can be over 500MB (complicating things a bit)
Extract of the file:
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :CR
CRLF
------------------------------CR
CRLF
CRLF
Gezien U het evaluatierooster niet heeft ingevuld, blijft CR
CRLF
CRLF
|||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET CR
CRLF
(DE) GEBOUW(EN) CR
CRLF
CR
CRLF
indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF
...
Required result: (rough copy paste job ;))
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :<BR><BR>---------------------<BR><BR><BR>Gezien U het evaluatierooster niet heeft ingevuld, blijft <BR><BR>||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET <BR><BR>(DE) GEBOUW(EN) <BR><BR><BR><BR>indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF
Wow, this one phased me for a little while...
It's tricky to do it in one pass.
The N++ constraint probably makes it tougher than it needs to be, but short of writing some code to do what you want it's a good way to go I guess.
While I'm not sure it's optimal, I had success with this combo.
Find:
([^|])\r([\r\n])*
Replace:
$1<br>
You need the $1 in the replace or you lose a character from your replaced lines - probably not what you want!
Ideally, you should look into some Perl (I'm no perl advocate, other scripting languages handling regex are available...) or something to do this.
Edit:
Just a thought. This makes the assumption that there won't be sections of your file that contain |CRLF or |CR or |CRCR that are not 'real' line endings.
Edit: Scrapped my last suggestions - didn't work
As suggested by BunjiquoBianco, I think that this is not possible in one pass.
Would be much better if you could use awk. If you are using Windows, try http://gnuwin32.sourceforge.net/packages/gawk.htm
If awk is a viable option, re-ask the question and the awk nuts will probably suggest a one-liner from command prompt to parse the whole file.
awk is fast too - would give you a much faster transformation and can be included in other scripts more easily thereby cutting out any manual N++ process.

NSRegularExpression match is not working

I'm trying to replace some escaped unicode in an NSString. I haven't had any luck with the CFString functions, so I thought I would try regular expressions.
Here is the regex
NSRegularExpression *regexUnicode2 = [NSRegularExpression regularExpressionWithPattern:#"(\\u([0-9A-Fa-f]){4}){2}" options:0 error:&error];
Then I try to get matches using this
NSArray *twoEscapeArray = [regexUnicode2 matchesInString:selfCopy options:0 range:NSMakeRange(0, self.length)];
selfCopy is a mutable copy of the input string. Here is a piece of that string:
muestran al p\u00c3\u00bablico las encuadernaciones de las colecciones
reales adem\u00c3\u00a1s de otros objetos hist\u00c3\u00b3ricos en
relaci\u00c3\u00b3n con \u00c3\u00a9stas.La muestra,
considerada a nivel mundial como uno de los conjuntos ligatorios
hist\u00c3\u00b3ricos m\u00c3\u00a1s importantes, se completa con
obras de arte como armas, alfombras y relojes. Estos son objetos que
ayudan a entender la encuadernaci\u00c3\u00b3n como elemento
fundamental de la cultura de corte.Los fondos de la Real
Biblioteca, del Real Monasterio de San Lorenzo de El Escorial, del
Monasterio de Santa Mar\u00c3\u00ada la Real de las Huelgas de Burgos,
del Monasterio de las
Without proper conversion, these escaped unicode pairs are being treated as individual characters (each pair produces two characters) when I put them into a UIWebView.
This is how the raw JSON data is coded, and I haven't had any luck getting it to convert to Latin characters properly.
Anyway, the problem is that the array twoEscapeArray is nil after the match attempt. I'm not sure why.
You mean \u00c3\u00ba is getting converted to ú? That looks like the correct behavior to me. The real question is how those Unicode escapes got in there. It looks like the text was decoded incorrectly at some point (possibly when the NSString was created?), and what should have been the two-byte UTF-8 encoding of the letter ú (U+00FA, Latin Small Letter U With Acute) was decoded as two characters.
Try going back to where you created the NSString, this time specifying UTF-8 as the encoding.