I have been struggling with trying to remove all quotation marks in an XML-file within specific tags in my Ruby on Rails project. The simple question is this: How do I remove all existing " if, and only if, they are within the description tag in the XML-file (using gsub)?
Example
<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really "awesome"></description></xml>
so that it becomes
<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really awesome></description></xml>
I have been struggling with regex for a few hours without getting anywhere.
I.e.
myxml_file.gsub(<regex matching quotation marks>, "")
This is a part of a bigger problem where I use the gem "Ox" to parse XML-files using Ox.load(myxml_file, mode: :hash) to load the XML-file but the description parts hold CDATA which Ox seems to ignore (just sets it all to nil) so I do a gsub to remove the CDATA tags but then some description seems to include quotation marks which crashes the Ox load. So, this problem could (preferrably) be solved already in the Ox.load part, for example by telling it to ignore CDATA-tags...
Edit Upon request:
I fetch the XML-file (which is a product feed) from a url which is in this case gzipped (which I am quite sure does not affect the issue in case):
tmp_data = Net::HTTP.get(URI.parse(url))
gz = Zlib::GzipReader.new(StringIO.new(tmp_data))
data = gz.read
#feed = Ox.load(data, mode: :hash)
The product descriptions in this case looks like this example (where I have added a " just for sake of the issue):
<products><product><merchant_deep_link>https://www.sportlala.se/lopning-40y-edition-2-pack-thundercrus/22361/express</merchant_deep_link><display_price>SEK319</display_price><merchant_product_id>05353-392410-XS</merchant_product_id><merchant_image_url>https://www.sportlala.se/images/products/22361/1905353_392410_40y_Edition_2-Pack_Set_F.png</merchant_image_url><merchant_category></merchant_category><search_price>319</search_price><merchant_name>Sportlala SE</merchant_name><category_id>0</category_id><aw_deep_link>...</aw_deep_link><category_name></category_name><last_updated></last_updated><product_name>40y Edition 2-Pack Thunder/Crus</product_name><aw_product_id>24553291137</aw_product_id><aw_image_url>https://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=ssl%3Awww.sportlala.se%2Fimages%2Fproducts%2F22361%2F1905353_392410_40y_Edition_2-Pack_Set_F.png&feedId=35735&k=477d0110b807fbbbcddc9fb74c52fc30c401ca4a</aw_image_url><delivery_cost></delivery_cost><data_feed_id>35735</data_feed_id><description><![CDATA[I detta paket får du två av Craft's absolut bästa baslager jerseys. Dessa "jerseys" har samlat det bästa från Craft's kollektioner och har den absolut högsta kvalitén! Material: 100% Polyester]]></description><merchant_id>17150</merchant_id><currency>SEK</currency><store_price></store_price><language></language></product></products>
This will make the description=nil in the resulting hash from Ox which I am quite certain is due to the CDATA wrapping in the tag (as it is always nil, no matter if there are quotation marks (") or not.
I did a gsub that removed the CDATA with a gsub (I removed it now but it was something like .gsub("<description><![CDATA[", "<description>").gsub("]]</description>", "</description>") which efffectively removed the CDATA but then brought out the quotation marks-issue.
So, this problem can either be solved on the (preferrably) "Ox load"-level through some configuration that yet have not seen or by regexp on the "-marks that extends over the entire text.
Code:
s = '<xml attribute="stuff"><name>Two inch thing (2")</name><description>This thing is really "awesome"></description></xml>'
t = s.gsub(/(<description>)(.*?)(<\/description>)/) do
open_tag, content, end_tag = $1, $2, $3
content = content.gsub(/"/, '')
[open_tag, content, end_tag].join
end
p s
p t
Output:
"<xml attribute=\"stuff\"><name>Two inch thing (2\")</name><description>This thing is really \"awesome\"></description></xml>"
"<xml attribute=\"stuff\"><name>Two inch thing (2\")</name><description>This thing is really awesome></description></xml>"
Limitations: This is very specific to the exact format of the XML. Many valid changes to the XML that do not change its meaning will break this code. For external use only; use only as directed. Stop taking this regular expression if serious side effects occur.
I am battling with regex and I can't figure it out.
I have a bid data base extracted from last.fm (www.lastfm.com).
The file is a .txt file where each column from each line is delimited by "," (comma) with over 1.7 GB and there are some characters messing up the reading into R. Until now I managed to understand where everything goes wrong and the main problem comes from " (quotation marks) inside other quotation marks.
To elucidate, here is an example of the .txt file when readLines is applied.
[1] "user,\"Method Man & Redman\",\"Da Rockwilder\",0,2012,2,10,8,0,41"
[2] "user,\"Method Man & Redman\",\"Y.O.U.\",0,2012,2,10,7,56,25"
[3] "user,\"Method Man & Redman\",\"Blackout\",0,2012,2,10,7,51,53"
[4] "user,\"Chuckie\",\"Who Is Ready To Jump (Club Mix)\",0,2012,2,10,7,40,12"
[5] "user,\"Opgezwolle\",\"Volle Kracht\",0,2012,2,10,7,36,31"
[6] "user,\"Opgezwolle\",\"Ut Is Wat Het Is\",0,2012,2,10,7,33,25"
Basically this becomes a data frame with 10 columns: username, "Artist", "Track", loved (0/1), year, month, day, hour, minute, second
The above example can easily be read without any problems but I get problems when something like this happens:
[1] "user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33"
In the first case, due to the double quotation marks, the comma in the name of the track makes this into two different columns and instead of the 10 columns I get 11 columns.
On the second case, the 12" leaves the string "open" and only stops until it finds a similar case. When this happens, I loose several lines of the data frame.
What I want as a solution? I want to remove all the " (quotations marks) except the ones that surround the name of the Artist and name of the Track.
Output:
The output would have in total four (4) " (quotation marks) in each line. "Artist" and "Track Name". So the output for those 2 lines that give me problem would be:
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
I tried to use Regex with gsub and gstring but I can't get it to extract only the " marks that are in excess.
If this is too complicated, something that would extract all the " except the first 3 (quotation marks around Artist name and first quotation mark around Track name) and the last one (quotation mark at the end of Track name), might work for most of the cases (and I would do the rest manually).
I am assuming here that no Artist name contains quotation marks.
Any help would be appreciated and if you need any further explanation or data please let me know.
Use negative lookarounds to remove all the \" which are neither preceded nor followed by commas.
(?<!,)\\"(?!,)
DEMO
> x <- c('user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37', 'user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33')
> gsub("(?<!,)\\\"(?!,)", "", x, perl=T)
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
Notice that there needs to be an extra backslash in the pattern argument, because backslashes are escape operators in both R and the regex-engine.
Character classes with alphanumeric and double quote and backreferences can do it:
gsub("([ 0-9a-zA-Z\"])(\\\")([ 0-9a-zA-Z\"])", "\\1\\3",test)
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"
Could also consider:
gsub("([ [:alpha:][:digit:]\"])(\\\")([ [:alpha:][:digit:]\"\"])",
"\\1\\3", test)
Basically removing any double-quote mark that is flanked both sides by a class that doesn't have a comma in it. Would break down if there were spaces between your quoting-marks and the correct separating marks. The ?regex page describes your options for using character classes. The parentheses are the delimiters for backreferences: first backref is '\\1' and refers to the characters matched by the character class inside the first paired parentheses: ([ [:alpha:][:digit:]\"]). By omitting the middle backreference from the replacement argument the matching double-quotes get eliminated.
I have to parse a string in VB.NET, which has the following structure
records separated by new line
fixed number of fields per record, separated by comma
fields can be quoted (strings) or not quoted (other type of data - date, int, etc)
Comment fields (strings) can contains both new line and comma
so, due to point 4, comma and new line must be ignored as field / record separators if between a odd and even quote (e.g. if between quote 1 and 2, they are in comment field and must be ignored, but if between quotes 2 and 3, they are field / record delimiter.
I can write manual parsing code for this, but think a regex can be more reliable. But I have very limited experience with regex.
Example string
(record 1)
10,"Test",10.1,,,"123"
(record 2)
20,"Test, has comma
and new line",,2.1,,"aaa"
So actual string is
10,"Test",10.1,,,"123"
20,"Test, has comma
and new line",,2.1,,"aaa"
EDIT:
I need to add more clarifications:
1. records can have more or less then 4 fields
2. fields can be empty
So an actual test input string can be
10,"Test",10.1,,,"123"
20,"Test, has comma
and new line",,2.1,,"aaa"
So apparently the problem should be split in two:
Extract records (where new line is not between quotes)
for each record, extract fields (where delimited by comma not between quotes)
How should I split the regex, (or have two regexes) to match this?
Thanks
I don't know how to eliminate the redundancy for the expression for each field, but the following appears to work for your example, per this test:
("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+)
If you use a repeating group, the match will only be retained for the last instance. If anyone knows how to get around this duplication, I'd be inerested.
Update: If you know something about the type of each positional field (e.g. whether it's a quoted string, integer, float, etc.) you can of course adjust the regex accordingly.
Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.