I want to implement the mssql outer apply functionality for parsing xml in an imperative language (c/c++, vba, java etc.), in principle any, the algorithm itself is interesting.
For example, I have a such xml:
<?xml version="1.0" encoding="utf-8"?>
<main>
<fio n="attr">
<name>name1</name>
<pasp>
<pi>pi1</pi>
<pi>pi2</pi>
</pasp>
</fio>
<fio n="attr">
<name>name2</name>
</fio>
<sub_node>
<cr>
<val>val1</val>
<val>val2</val>
<extra>
<ex>ex1</ex>
<ex>ex2</ex>
</extra>
</cr>
<cr>
<val>val3</val>
<val>val4</val>
</cr>
</sub_node>
</main>
On mssql I can parse it like this:
declare #sCmd nvarchar(max), #sFullPath varchar(max), #x xml
set #sFullPath = 'c:\in\test2.xml'
set #sCmd = N'select #xml = cast(t.data as xml) from OPENROWSET (BULK '+quotename(#sFullPath, N'''') + N', SINGLE_BLOB) t(data)'
exec sp_executesql #sCmd, N'#xml xml output', #x output;
select distinct
fio.n.value('name[1]', 'varchar(32)') 'fio.name'
, pi_.n.value('.', 'varchar(32)') 'pi_'
, val.n.value('.', 'varchar(5)') 'val'
, ex.n.value('.', 'varchar(10)') 'ex'
from #x.nodes('main') as main(n)
outer apply main.n.nodes('fio[#n="attr"]') as fio(n)
outer apply fio.n.nodes('pasp/pi') as pi_(n)
outer apply main.n.nodes('sub_node/cr') as cr(n)
outer apply cr.n.nodes('val') as val(n)
outer apply cr.n.nodes('extra/ex') as ex(n)
And I get the following result:
name1 pi1 val1 ex1
name1 pi1 val1 ex2
name1 pi1 val2 ex1
name1 pi1 val2 ex2
name1 pi1 val3 NULL
name1 pi1 val4 NULL
name1 pi2 val1 ex1
name1 pi2 val1 ex2
name1 pi2 val2 ex1
name1 pi2 val2 ex2
name1 pi2 val3 NULL
name1 pi2 val4 NULL
name2 NULL val1 ex1
name2 NULL val1 ex2
name2 NULL val2 ex1
name2 NULL val2 ex2
name2 NULL val3 NULL
name2 NULL val4 NULL
Tell me please how this can be implemented in an any imperative language using, for example, only loops, recursion, SelectNodes(), SelectSingleNode(), etc. Despite the fact that this is only an exemplary xml, for example, tags may be missing in some other file, and nested loops just don’t work correctly... And the xpath strings can be different, how to make a universal algorithm?
Related
I am quiet new to pig environment. I have tried to implement my pig script file in two ways.
I.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword),display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword),display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/'),time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
i get error:
2016-09-29 02:45:40,826 INFO org.apache.pig.Main: Apache Pig version
0.10.0-cdh4.2.1 (rexported) compiled Apr 22 2013, 12:04:54 2016-09-29 02:45:40,827 INFO org.apache.pig.Main: Logging error messages to:
/home/training/training_materials/analyst/exercises/pig_etl/pig_1475131540824.log
2016-09-29 02:45:42,371 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1025: Invalid field
projection. Projected field [keyword] does not exist in schema:
campaign_id:chararray,date:chararray,time:chararray,org.apache.pig.builtin.upper_keyword_12:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int.
Details at logfile: /home/hduser/pig_etl/pig_1475131540824.log
But When i integrate the UPPER,TRIM and REPLACE in one statement then it works:
II.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,REPLACE(date, '-', '/'),time,TRIM(UPPER(keyword)),display_site,placement,was_clicked,cpc;
dump val;
So, I just want someone to explain me that why I. method didn't work and what is the error message.
While you are applying TRIM in val1 there is nothing called "keyword" in val.
Note when you are applying any Function use alias so that error u can avoid..
or before creating a new relation it is always good to use describe so that the schema is clear to u..
Solution will be:
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword) as keyword,display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword) as keyword,display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/') as date,time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
I am using Fortran 90 to read a file that contains data in the following format
number# 125 var1= 2 var2= 1 var3: 4
.
.
.
.
number# 234 var1= 3 var2= 5 var3: 1
I tried the following command and works fine
read (2,*) tempstr , my_param(1), tempstr , my_param(2), tempstr , my_param(3)
Problem is when the numbers become larger and there is no space between string and number, i.e. the data looks as following:
number# 125 var1= 2 var2=124 var3: 4
I tried
read (2,512) my_param(1), my_param(2), my_param(3)
512 format('number#', i, 'var1=', i, 'var2=', i, 'var3:', i)
It reads all number as zero
I can't switch to some other language. The data set is huge, so I can't pre-process it. Also, the delimiters are not the same every time.
Can someone please help with the problem?
Thanks in advance
First up, 720 thousand lines is not too much for pre-processing. Tools like sed and awk work mostly on a line-by-line basis, so they scale really well.
What I have actually done was to convert the data in such a way that I could use namelists:
$ cat preprocess.sed
# Add commas between values
# Space followed by letter -> insert comma
s/ \([[:alpha:]]\)/ , \1/g
# "number" is a key word in Fortran, so replace it with num
s/number/num/g
# Replace all possible data delimitors with the equals character
s/[#:]/=/g
# add the '&mydata' namelist descriptor to the beginning
s/^/\&mydata /1
# add the namelist closing "/" character to the end of the line:
s,$,/,1
$ sed -f preprocess.sed < data.dat > data.nml
Check that the data was correctly preprocessed:
$ tail -3 data.dat
number#1997 var1=114 var2=130 var3:127
number#1998 var1=164 var2=192 var3: 86
number#1999 var1=101 var2= 48 var3:120
$ tail -3 data.nml
&mydata num=1997 , var1=114 , var2=130 , var3=127/
&mydata num=1998 , var1=164 , var2=192 , var3= 86/
&mydata num=1999 , var1=101 , var2= 48 , var3=120/
Then you can read it with this fortran program:
program read_mixed
implicit none
integer :: num, var1, var2, var3
integer :: io_stat
namelist /mydata/ num, var1, var2, var3
open(unit=100, file='data.nml', status='old', action='read')
do
read(100, nml=mydata, iostat=io_stat)
if (io_stat /= 0) exit
print *, num, var1, var2, var3
end do
close(100)
end program read_mixed
While I still stand with my original answer, particularly because the input data is already so close to what a namelist file would look like, let's assume that you really can't make any preprocessing of the data beforehand.
The next best thing is to read in the whole line into a character(len=<enough>) variable, then extract the values out of that with String Manipulation. Something like this:
program mixed2
implicit none
integer :: num, val1, val2, val3
character(len=50) :: line
integer :: io_stat
open(unit=100, file='data.dat', action='READ', status='OLD')
do
read(100, '(A)', iostat=io_stat) line
if (io_stat /= 0) exit
call get_values(line, num, val1, val2, val3)
print *, num, val1, val2, val3
end do
close(100)
contains
subroutine get_values(line, n, v1, v2, v3)
implicit none
character(len=*), intent(in) :: line
integer, intent(out) :: n, v1, v2, v3
integer :: idx
! Search for "number#"
idx = index(line, 'number#') + len('number#')
! Get the integer after that word
read(line(idx:idx+3), '(I4)') n
idx = index(line, 'var1') + len('var1=')
read(line(idx:idx+3), '(I4)') v1
idx = index(line, 'var2') + len('var3=')
read(line(idx:idx+3), '(I4)') v2
idx = index(line, 'var3') + len('var3:')
read(line(idx:idx+3), '(I4)') v3
end subroutine get_values
end program mixed2
Please note that I have not included any error/sanity checking. I'll leave that up to you.
I have a huge dictionary (70 million keys) that is structured as such:
mydict[key] = [val1, val2, val3]
I need to put val1 in a separate variable if I find a positive match for val2, and I am trying to do this without a for loop to save time. Basically, I want to do this:
for key, val in mydict.items():
val1, val2, val3 = val
if val2 == 'awesome match':
new_var = val1
but I want to avoid the for loop.
I want to export data from Stata into a csv-file:
outsheet "$dirLink/analysis.csv", replace comma
but I got the following error message:
factor variables and time-series operators not allowed
r(101);
I couldn't find the solution in the net. Thanks for help.
Here are the variable definitions:
tab0 str1 %9s
linkid float %10.0g
recid2 float %9.0g
recid1 float %10.0g
patient1 str28 %28s
patient2 str9 %9s
totwght int %10.0g
status str1 %9s
fname str13 %13s
sname str13 %13s
doby int %10.0g
cohort str2 %9s
res1 str27 %27s
res2 str29 %29s
residence str50 %50s
facility str22 %22s
maxwght int %10.0g
The syntax should be
outsheet using "$dirLink/analysis.csv", replace comma
That is, the word using is required. This is documented in the help: no need for an internet search. It is true that the error message is not transparent, because Stata is guessing that you are trying something other than using or a variable list, either of which could follow outsheet.
P.S. in my answer to your previous question I pointed out that the correct spelling is Stata: see http://www.stata.com/support/faqs/resources/statalist-faq/#spell That's still true.
I am getting queryset object which I want to convert it into only list.
[<yyyy : val1 val2 val3 val4 xxxx#xxx.com True True>, <yyyy : 2 None False True>]
I tried by
res.values_list()
But I am getting
[(val1,val2,val3,val4),(val11,val22,val33,val44)]
tuple(list(obj) for obj in res.values_list())