How to parse XML string (not file) in PIG (without load)

How to parse XML string (not file) in PIG (without load) - regex

I haven't found any approach to parse string with whole XML doc into separate tuples, pls suggest me how can I do this?
Suppose we have avro file:
{fieldname: id, fieldname: xml}
Xml structure:
<?xml version='1.0' encoding='UTF-8'?>
<response>
<name>Ghty</name>
<main>
<data>
<id>1</id>
<text>ABC mask</text>
<title>Some text</title>
</data>
<data>
<id>2</id>
<text>Second value</text>
<title>To</title>
</data>
<data>
<id>3</id>
<text>Evolving to</text>
<title>Hint 567</title>
</data>
</main>
</response>
When we do a load from xml file, its clear that input xml splits
into parts, according to the tag we put into statement:
DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader('data');
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
xml = LOAD '$XPATH' using XMLLoader as (x:chararray);
DUMP xml;
(<data><id>1</id><text>ABC mask</text><title>Some text</title></data>)
(<data><id>2</id><text>Second value</text><title>To</title></data>)
(<data><id>3</id><text>Evolving to</text><title>Hint 567</title></data>)
xml_parse = FOREACH xml GENERATE
XPath(x, 'data/id') as (id:chararray),
XPath(x, 'data/text') as (text:chararray),
XPath(x, 'data/title') as (title:chararray);
DUMP xml_parse;
(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)
I want to do the same with the xml in the string, without LOAD
operation. But how can we do the same if we have such xmls in a
string and they are not splited for further XPath action?
(<?xml version='1.0' encoding='UTF-8'?><response><name>Ghty</name><main><data><id>1</id><text>ABC mask</text><title>Some text</title></data><data><id>2</id><text>Second value</text><title>To</title></data><data><id>3</id><text>Evolving to</text><title>Hint 567</title></data></main></response>)
1. I tried to apply this approach, but haven't got any success, because I'm getting only the first element from xml string:
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPath($0, 'data');
DUMP xml_parse ;
(1,ABC mask,Some text)
2. I tried to use XPathAll, but haven't got success as well, all values was put in one tuple:
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPathAll($0, 'data'),
XPathAll($0, 'data'),
XPathAll($0, 'data'),
DUMP xml_parse ;
((1,ABC mask,Some text,2,Second value,To,3,Evolving to,Hint 567))
3. Then I tried to use XPathAll with full tag paths, but result was a tuple of tuples. I need somehow to split them in a right order,
but don't know how.
xml = LOAD 'xml_set.avro' using
org.apache.pig.piggybank.storage.avro.AvroStorage();
xml_parse = foreach xml generate
XPathAll($0, 'data/id'),
XPathAll($0, 'data/text'),
XPathAll($0, 'data/title'),
DUMP xml_parse ;
((1,2,3),(ABC mask,Second value, Evolving to),(Some text,To,Hint 567))
Seems need some kind of pivot to be done here. The goal is to get:
(1,ABC mask,Some text)
(2,Second value,To)
(3,Evolving to,Hint 567)
Ofc I can store all xmls from avro to 1 big xml file and then load it with XMLLoader, but its redundunt step I assume.
Appreciate any help and suggestions. Stuck with it for a long time (((

Related

How to convert text to multiple xml in wso2

<payloadFactory media-type="xml" description="Select Sheets">
<format>
<Response>$1</Response>
</format>
<args>
<arg evaluator="xml" expression="get-property('name')"/>
</args>
</payloadFactory>
<script language="js"><![CDATA[var csv = mc.getPayloadXML();
var lines = (csv + "").split("\n");
for (var l = 1; l <= lines.length; l++) {
cells = (lines[l] + "").split(";");
}
]]></script>
I am trying to get the data from excel through esb option, and i am getting the output also but not properly getting. Please guide me to do. Retrieving the data From multiple excel sheets or multiple excel files are working from that retrieved data i formed csv. From the csv i need to form multiple xml then need to insert into db. How to form csv to multiple xml?
suppose some sheets containing 3 columns,4 columns or 5 columns. Depends on that need to form child node of xml.
Please let me know

You can use Data-mapper mediator to construct a xml payload using the csv file. Please refer the examples related to CSV and XML transformations.
For example, you can create a xml payload depending on the number of columns in csv sheet.
<rows>
<row>
<col1>value1</col1>
<col2>value2</col2>
<col3>value3</col3>
</row>
<row>
<col1>value4</col1>
<col2>value5</col2>
<col3>value6</col3>
</row>
</rows>
Once you have created the xml payload, you can use Iterate mediator or ForEach mediator to iterate through each sub xml element (i.e. elements start with 'row' tag) for a given xpath (eg: xpath="//row"), and execute db queries for each sub element.
If you are using a data service to execute insert queries, you can use the Iterate mediator, which calls the data service for each xml sub element. Else, you can use ForEach mediator along with DBReport mediator to execute insert queries for each xml sub element.

Retrive event logs contains specific string in DATA tag

I have developed a MFC application which reads windows event logs from event log file (EVTX) file and parse it to render in application
For reading log file, I am using XPATH query to retrieve specific event logs from event log file file consist of 40000 records
Sample log records look like below code
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Error_Log"/>
<EventID Qualifiers="20225">6002</EventID>
<Level>4</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2018-05-31T10:37:16.000000000Z"/>
<EventRecordID>11679958</EventRecordID>
<Channel>Application</Channel>
<Security/>
</System>
<EventData>
<Data>16:07:16.339:(A)[app.exe] [scan] m_id = [1254]</Data>
<Binary>31363A30373A31362E3333393A2841295B7275706170702E6578655D205B5363616E5D206D5F6964203D205B313235345D</Binary>
</EventData>
</Event>
Here I want to retrieve only those log records where <DATA> tag contains sub-string value m_id. To achieve this I tried below query
LPWSTR Query = _T("Event/EventData[Data(Data='m_id')]");
EVT_HANDLE Results = EvtQuery(NULL, Path, Query, EvtQueryFilePath | EvtQueryForwardDirection);
But I am not able to retrieve any logs even if string m_id is present in input log file as shown in above code

You should be able to do this with XPath 2 by using contains()
Full Events:
/Event[EventData/Data[contains(text(),'m_id')]]
/Event[EventData/Data[contains(string(),'m_id')]]
Data Only:
/Event/EventData/Data[contains(string(),'m_id')]
/Event/EventData/Data[contains(text(),'m_id')]
Test Xpaths here
Advanced Xpath Filtering
string vs text

Python Writing a XML Child Element Without Namespace

I read multiple threads before posting about xml namespace, but still having issue with writing child xml element without namespace in a file.
Even though I mentioned registered namespace as empty before parsing/reading the file, "findall" not returning any elements. I verified the namespace present in code and xml file, also printed on root.tag.
If I completely remove the xmlns from tag, code is working, but I wanted to read the xml file without namespace and write into file without namespace. Could you please let me know the mistake I m doing here ?
This is the code I tried.
import xml.etree.ElementTree as ET
ET.register_namespace("","urn:iso:2012.tech.xsd.001.04") ##Making sure parse a xml file without namespace
tree = ET.parse("sample.xml")
root = tree.getroot()
print("%s : %s"%(root.tag, root.attrib))
out_handle = open("customer_header.xml","ab")
for elt in root.iter():
all_ntry = elt.findall('Customer') ## Not returning all Customer elements, even though ET.register_namespace('',uri) mentioned before parsing
for ele in all_ntry:
print("Customer Block Found:%s"%ele)
ele_tree = ET.ElementTree(ele)
ele_tree.write(out_handle)
XML File(sample.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns:xsi="http://www.company.org/2000/instance"
xmlns="urn:iso:2012.tech.xsd.001.04">
<BackToCustomer>
<CustGrup>
<Mid>000002</Mid>
<Date>2017-09-24T00:54:26</Date>
<Info>TEST</Info>
</CustGrup>
<Batch>
<Id>12345678</Id>
<Date>2017-09-22T13:54:26</Date>
<ListInfo>
<Id>
<Othr>
<Id>TEST_ListInfo</Id>
</Othr>
</Id>
</ListInfo>
<Details>
<Total>
<Count>5</Count>
<Amt>25.80</Amt>
</Total>
</Details>
<Customer>
<CustomerRef>ABC123</CustomerRef>
</Customer>
<Customer>
<CustomerRef>XYZ123</CustomerRef>
</Customer>
</Batch>
</BackToCustomer>
</Document>
I need to write a file only Customer element without namespace.
.

not able to remove tags that "xsi:nil" in them via xslt

I have following xml which contains several xml tags with xsi:nil="true". These are tags that are basically null. I am not able to use/find any sxlt transformer to remove these tags from the xml and obtain the rest of the xml.
<?xml version="1.0" encoding="utf-8"?>
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a">
<retrieveAllValues>
<messages xsi:nil="true" />
<existingValues>
<Values>
<value1> 10.00</value1>
<value2>123456</value2>
<value3>1234</value3>
<value4 xsi:nil="true" />
<value5 />
</Values>
</existingValues>
<otherValues xsi:nil="true" />
<recValues xsi:nil="true" />
</retrieveAllValues>
</p849:retrieveAllValues>

The reason of error you get
[Fatal Error] file2.xml:5:30: The prefix "xsi" for attribute "xsi:nil" associated with an element type "messages" is not bound.
is absence of prefix named "xsi" declared, you should specify it in root element such as:
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<retrieveAllValues>
<messages xsi:nil="true" />
// other code...
update
If you could not change xml document you're receiving from webservice, you could try next approach(if this approach is acceptable for you):
Change your xslt document to process xml documents without specifying element prefixes
Set property namespaceAware of DocumentBuilderFactory to false
After this yout transformer shouldn't complain

It doesn't look like this is going to be possible in XSLT - because of the missing namespace declarations you have to parse the XML file with a non-namespace-aware parser, but all the XSLT processors I've tried don't get on well with such documents, they must rely on some information that is only present when parsing with namespace awareness enabled, even if the document in question doesn't actually contain any namespaced nodes.
So you'll have to approach it a different way, for example by traversing the DOM tree yourself. Since you say you're working in Java, here's an example using Java DOM APIs (the example runs as-is in the Groovy console, or wrap it up in a proper class definition and add whatever exception handling is required to run it as Java)
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.w3c.dom.ls.*;
public void stripNils(Node n) {
if(n instanceof Element &&
"true".equals(((Element)n).getAttribute("xsi:nil"))) {
// element is xsi:nil - strip it out
n.getParentNode().removeChild(n);
} else {
// we're keeping this node, process its children (if any) recursively
NodeList children = n.getChildNodes();
for(int i = 0; i < children.getLength(); i++) {
stripNils(children.item(i));
}
}
}
// load the document (NB DBF is non-namespace-aware by default)
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document xmlDoc = db.parse(new File("input.xml"));
stripNils(xmlDoc);
// write out the modified document, in this example to stdout
LSSerializer ser =
((DOMImplementationLS)xmlDoc.getImplementation()).createLSSerializer();
LSOutput out =
((DOMImplementationLS)xmlDoc.getImplementation()).createLSOutput();
out.setByteStream(System.out);
ser.write(xmlDoc, out);
On your original example XML this produces the correct result:
<?xml version="1.0" encoding="UTF-8"?>
<p849:retrieveAllValues xmlns:p849="http://package.de.bc.a">
<retrieveAllValues>
<existingValues>
<Values>
<value1> 10.00</value1>
<value2>123456</value2>
<value3>1234</value3>
<value5/>
</Values>
</existingValues>
</retrieveAllValues>
</p849:retrieveAllValues>
The empty lines are not actually empty, they contain the whitespace text nodes either side of the removed elements, as only the elements themselves are being removed here.

how to add root node tag in a XML document with XSLT

Iam parsing the xml document in SSIS through the xmlsource. It does not have any root tag. So iam trying to add the root tag to my xml document through XSLT, but getting the error as
[XML Task] Error: An error occurred with the following error message: "There are multiple root elements. Line 11, position 2.".
what is the XSL to be used to add the root element.? Please help..this is very urgent..
Please find the xml source below
<organizational_unit>
<box_id>898</box_id>
<hierarchy_id>22</hierarchy_id>
<parent_box_id>0</parent_box_id>
<code>Team</code>
<description />
<name>CAPS Teams</name>
<manager_title />
<level>0</level>
</organizational_unit>
<organizational_unit>
<box_id>967</box_id>
<hierarchy_id>31</hierarchy_id>
<parent_box_id>0</parent_box_id>
<code>main</code>
<description />
<name>Protegent</name>
<manager_title />
<level>0</level>
<organizational_unit>
<box_id>968</box_id>
<hierarchy_id>31</hierarchy_id>
<parent_box_id>967</parent_box_id>
<code>19L</code>
<description>19L</description>
<name>19L</name>
<level>1</level>
<managers>
<manager>
<hierarchy_mgr_id>243</hierarchy_mgr_id>
<hierarchy_id>31</hierarchy_id>
<box_id>968</box_id>
<rep_id>19499</rep_id>
<unique_rep_id>100613948</unique_rep_id>
<first_name>Ed</first_name>
<last_name>Kill</last_name>
</manager>
</managers>
</organizational_unit>
<organizational_unit>
<box_id>1152</box_id>
<hierarchy_id>31</hierarchy_id>
<parent_box_id>967</parent_box_id>
<code>UNKNOWN_m</code>
<description>Unknown Reps</description>
<name>Unknown Reps</name>
<level>1</level>
</organizational_unit>
</organizational_unit>

Well which XSLT processor do you use, how do you use it? I usually don't suggest to use string processing to construct XML but if you have a fragment without a root element then perhaps doing string concatenation "<root>" + fragment + "</root>" is the easiest way to get a well-formed document. XSLT can work with fragments but how you do that depends on the XSLT processor or XML parser you use, for instance .NET can use an XmlReader with XmlReaderSettings with ConformanceLevel set to fragment, which can then be loaded in an XPathDocument (for processing with XSLT 1.0 and XslCompiledTransform) and probably also with Saxon's XdmNode (although I am not sure I remember that correctly).
The stylesheet would then simply do
<xsl:template match="/">
<root>
<xsl:copy-of select="node()"/>
</root>
</xsl:template>
to wrap all top level nodes into a root element.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to parse XML string (not file) in PIG (without load) - regex

Related

How to convert text to multiple xml in wso2

Retrive event logs contains specific string in DATA tag

Python Writing a XML Child Element Without Namespace

not able to remove tags that "xsi:nil" in them via xslt

how to add root node tag in a XML document with XSLT

Categories

Resources