Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   XML (http://www.velocityreviews.com/forums/f32-xml.html)
-   -   Xml search (http://www.velocityreviews.com/forums/t954093-xml-search.html)

huamin_chen@ymail.com 11-01-2012 04:41 AM

Xml search
 
Hi,
Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
http://dl.dropbox.com/u/40211031/List.zip

Many Thanks & Best Regards,
HuaMin

Jongware 11-01-2012 11:15 AM

Re: Xml search
 
On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:
> Hi,
> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
> http://dl.dropbox.com/u/40211031/List.zip


Did you generate these 1,000,002 lines of XML data, or is this from the
real world?

In case someone does not like downloading 57 megs of zipped file, or
expanding it into 722 megs of rather pointless example lines: here is an
abbreviated version:

<?xml version="1.0" encoding="UTF-16"?>
<Appdata>
<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
.... (999,998 similar lines omitted) ...
<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
Attr16="99999816" Attr17="99999817" Attr18="99999818"
Attr19="99999819">Node_Number999998</Data>
<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
Attr16="99999916" Attr17="99999917" Attr18="99999918"
Attr19="99999919">Node_Number999999</Data>
</Appdata>

I'm assuming you *generated* this file by way of example. If not, well,
it's so extremely structured that you could throw it away and use a
simple algorithm to generate the "data" for any line immediately. (And
then it would not be "data", it would be a calculation.)

Anyway, XML is a poor choice for this particular set of data. Write a
program to convert it into a binary format, where each "line" uses 10
integers and one string of a fixed length of 20 bytes. That takes up no
more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
enough to be loaded into the RAM of today's computers.

Search "quickly" depends on what you want to search for. If, for
example, you may need to grab a single digit out of any attribute or
content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
you are better off storing everything as string. You could also sort the
list on one or more of the Attr fields, and, if you prefer lookup speed
over memory usage, you could even sort on *all* of the attribute fields
plus the data field, and save pointers to the 'actual' data.

[Jw]

wmedwardchan@gmail.com 11-02-2012 02:20 AM

Re: Xml search
 
On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:
> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:
>
> > Hi,

>
> > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

>
> > http://dl.dropbox.com/u/40211031/List.zip

>
>
>
> Did you generate these 1,000,002 lines of XML data, or is this from the
>
> real world?
>
>
>
> In case someone does not like downloading 57 megs of zipped file, or
>
> expanding it into 722 megs of rather pointless example lines: here is an
>
> abbreviated version:
>
>
>
> <?xml version="1.0" encoding="UTF-16"?>
>
> <Appdata>
>
> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
>
> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
>
> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
>
> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
>
> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
>
> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
>
> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
>
> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
>
> ... (999,998 similar lines omitted) ...
>
> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
>
> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
>
> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
>
> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
>
> Attr16="99999816" Attr17="99999817" Attr18="99999818"
>
> Attr19="99999819">Node_Number999998</Data>
>
> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
>
> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
>
> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
>
> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
>
> Attr16="99999916" Attr17="99999917" Attr18="99999918"
>
> Attr19="99999919">Node_Number999999</Data>
>
> </Appdata>
>
>
>
> I'm assuming you *generated* this file by way of example. If not, well,
>
> it's so extremely structured that you could throw it away and use a
>
> simple algorithm to generate the "data" for any line immediately. (And
>
> then it would not be "data", it would be a calculation.)
>
>
>
> Anyway, XML is a poor choice for this particular set of data. Write a
>
> program to convert it into a binary format, where each "line" uses 10
>
> integers and one string of a fixed length of 20 bytes. That takes up no
>
> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
>
> enough to be loaded into the RAM of today's computers.
>
>
>
> Search "quickly" depends on what you want to search for. If, for
>
> example, you may need to grab a single digit out of any attribute or
>
> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
>
> you are better off storing everything as string. You could also sort the
>
> list on one or more of the Attr fields, and, if you prefer lookup speed
>
> over memory usage, you could even sort on *all* of the attribute fields
>
> plus the data field, and save pointers to the 'actual' data.
>
>
>
> [Jw]


Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

Many Thanks & Best Regards,
HuaMin

Jongware 11-02-2012 09:22 AM

Re: Xml search
 
On 02-Nov-12 3:20 AM, wmedwardchan@gmail.com wrote:
> On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:
>> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:
>>
>>> Hi,

>>
>>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

>>
>>> http://dl.dropbox.com/u/40211031/List.zip

>>
>>
>>
>> Did you generate these 1,000,002 lines of XML data, or is this from the
>>
>> real world?
>>
>>
>>
>> In case someone does not like downloading 57 megs of zipped file, or
>>
>> expanding it into 722 megs of rather pointless example lines: here is an
>>
>> abbreviated version:
>>
>>
>>
>> <?xml version="1.0" encoding="UTF-16"?>
>>
>> <Appdata>
>>
>> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
>>
>> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
>>
>> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
>>
>> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
>>
>> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
>>
>> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
>>
>> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
>>
>> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
>>
>> ... (999,998 similar lines omitted) ...
>>
>> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
>>
>> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
>>
>> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
>>
>> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
>>
>> Attr16="99999816" Attr17="99999817" Attr18="99999818"
>>
>> Attr19="99999819">Node_Number999998</Data>
>>
>> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
>>
>> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
>>
>> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
>>
>> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
>>
>> Attr16="99999916" Attr17="99999917" Attr18="99999918"
>>
>> Attr19="99999919">Node_Number999999</Data>
>>
>> </Appdata>
>>
>>
>>
>> I'm assuming you *generated* this file by way of example. If not, well,
>>
>> it's so extremely structured that you could throw it away and use a
>>
>> simple algorithm to generate the "data" for any line immediately. (And
>>
>> then it would not be "data", it would be a calculation.)
>>
>>
>>
>> Anyway, XML is a poor choice for this particular set of data. Write a
>>
>> program to convert it into a binary format, where each "line" uses 10
>>
>> integers and one string of a fixed length of 20 bytes. That takes up no
>>
>> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
>>
>> enough to be loaded into the RAM of today's computers.
>>
>>
>>
>> Search "quickly" depends on what you want to search for. If, for
>>
>> example, you may need to grab a single digit out of any attribute or
>>
>> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
>>
>> you are better off storing everything as string. You could also sort the
>>
>> list on one or more of the Attr fields, and, if you prefer lookup speed
>>
>> over memory usage, you could even sort on *all* of the attribute fields
>>
>> plus the data field, and save pointers to the 'actual' data.
>>
>>
>>
>> [Jw]

>
> Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.


That would be

qsort (...);
result = bsearch (..);

-- you can look up the correct syntax for both qsort and bsearch
elsewhere. (It's beyond the scope of c.t.xml anyway.)

[Jw]

wmedwardchan@gmail.com 11-03-2012 08:10 AM

Re: Xml search
 
On Friday, November 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
> On 02-Nov-12 3:20 AM, wmedwardchan@gmail.com wrote:
>
> > On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

>
> >> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:

>
> >>

>
> >>> Hi,

>
> >>

>
> >>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

>
> >>

>
> >>> http://dl.dropbox.com/u/40211031/List.zip

>
> >>

>
> >>

>
> >>

>
> >> Did you generate these 1,000,002 lines of XML data, or is this from the

>
> >>

>
> >> real world?

>
> >>

>
> >>

>
> >>

>
> >> In case someone does not like downloading 57 megs of zipped file, or

>
> >>

>
> >> expanding it into 722 megs of rather pointless example lines: here is an

>
> >>

>
> >> abbreviated version:

>
> >>

>
> >>

>
> >>

>
> >> <?xml version="1.0" encoding="UTF-16"?>

>
> >>

>
> >> <Appdata>

>
> >>

>
> >> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

>
> >>

>
> >> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

>
> >>

>
> >> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

>
> >>

>
> >> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

>
> >>

>
> >> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

>
> >>

>
> >> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

>
> >>

>
> >> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

>
> >>

>
> >> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

>
> >>

>
> >> ... (999,998 similar lines omitted) ...

>
> >>

>
> >> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

>
> >>

>
> >> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

>
> >>

>
> >> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

>
> >>

>
> >> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

>
> >>

>
> >> Attr16="99999816" Attr17="99999817" Attr18="99999818"

>
> >>

>
> >> Attr19="99999819">Node_Number999998</Data>

>
> >>

>
> >> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

>
> >>

>
> >> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

>
> >>

>
> >> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

>
> >>

>
> >> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

>
> >>

>
> >> Attr16="99999916" Attr17="99999917" Attr18="99999918"

>
> >>

>
> >> Attr19="99999919">Node_Number999999</Data>

>
> >>

>
> >> </Appdata>

>
> >>

>
> >>

>
> >>

>
> >> I'm assuming you *generated* this file by way of example. If not, well,

>
> >>

>
> >> it's so extremely structured that you could throw it away and use a

>
> >>

>
> >> simple algorithm to generate the "data" for any line immediately. (And

>
> >>

>
> >> then it would not be "data", it would be a calculation.)

>
> >>

>
> >>

>
> >>

>
> >> Anyway, XML is a poor choice for this particular set of data. Write a

>
> >>

>
> >> program to convert it into a binary format, where each "line" uses 10

>
> >>

>
> >> integers and one string of a fixed length of 20 bytes. That takes up no

>
> >>

>
> >> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

>
> >>

>
> >> enough to be loaded into the RAM of today's computers.

>
> >>

>
> >>

>
> >>

>
> >> Search "quickly" depends on what you want to search for. If, for

>
> >>

>
> >> example, you may need to grab a single digit out of any attribute or

>
> >>

>
> >> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

>
> >>

>
> >> you are better off storing everything as string. You could also sort the

>
> >>

>
> >> list on one or more of the Attr fields, and, if you prefer lookup speed

>
> >>

>
> >> over memory usage, you could even sort on *all* of the attribute fields

>
> >>

>
> >> plus the data field, and save pointers to the 'actual' data.

>
> >>

>
> >>

>
> >>

>
> >> [Jw]

>
> >

>
> > Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

>
>
>
> That would be
>
>
>
> qsort (...);
>
> result = bsearch (..);
>
>
>
> -- you can look up the correct syntax for both qsort and bsearch
>
> elsewhere. (It's beyond the scope of c.t.xml anyway.)
>
>
>
> [Jw]


Thanks. But did you see my Xml file above? Qsort is to sort a list of items. How is it applicable to my Xml file?

Many Thanks & Best Regards,
Edward Chan

huamin_chen@ymail.com 11-03-2012 03:27 PM

Re: Xml search
 
On Friday, November 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
> On 02-Nov-12 3:20 AM, wmedwardchan@gmail.com wrote:
>
> > On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

>
> >> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:

>
> >>

>
> >>> Hi,

>
> >>

>
> >>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

>
> >>

>
> >>> http://dl.dropbox.com/u/40211031/List.zip

>
> >>

>
> >>

>
> >>

>
> >> Did you generate these 1,000,002 lines of XML data, or is this from the

>
> >>

>
> >> real world?

>
> >>

>
> >>

>
> >>

>
> >> In case someone does not like downloading 57 megs of zipped file, or

>
> >>

>
> >> expanding it into 722 megs of rather pointless example lines: here is an

>
> >>

>
> >> abbreviated version:

>
> >>

>
> >>

>
> >>

>
> >> <?xml version="1.0" encoding="UTF-16"?>

>
> >>

>
> >> <Appdata>

>
> >>

>
> >> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

>
> >>

>
> >> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

>
> >>

>
> >> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

>
> >>

>
> >> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

>
> >>

>
> >> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

>
> >>

>
> >> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

>
> >>

>
> >> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

>
> >>

>
> >> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

>
> >>

>
> >> ... (999,998 similar lines omitted) ...

>
> >>

>
> >> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

>
> >>

>
> >> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

>
> >>

>
> >> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

>
> >>

>
> >> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

>
> >>

>
> >> Attr16="99999816" Attr17="99999817" Attr18="99999818"

>
> >>

>
> >> Attr19="99999819">Node_Number999998</Data>

>
> >>

>
> >> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

>
> >>

>
> >> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

>
> >>

>
> >> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

>
> >>

>
> >> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

>
> >>

>
> >> Attr16="99999916" Attr17="99999917" Attr18="99999918"

>
> >>

>
> >> Attr19="99999919">Node_Number999999</Data>

>
> >>

>
> >> </Appdata>

>
> >>

>
> >>

>
> >>

>
> >> I'm assuming you *generated* this file by way of example. If not, well,

>
> >>

>
> >> it's so extremely structured that you could throw it away and use a

>
> >>

>
> >> simple algorithm to generate the "data" for any line immediately. (And

>
> >>

>
> >> then it would not be "data", it would be a calculation.)

>
> >>

>
> >>

>
> >>

>
> >> Anyway, XML is a poor choice for this particular set of data. Write a

>
> >>

>
> >> program to convert it into a binary format, where each "line" uses 10

>
> >>

>
> >> integers and one string of a fixed length of 20 bytes. That takes up no

>
> >>

>
> >> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

>
> >>

>
> >> enough to be loaded into the RAM of today's computers.

>
> >>

>
> >>

>
> >>

>
> >> Search "quickly" depends on what you want to search for. If, for

>
> >>

>
> >> example, you may need to grab a single digit out of any attribute or

>
> >>

>
> >> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

>
> >>

>
> >> you are better off storing everything as string. You could also sort the

>
> >>

>
> >> list on one or more of the Attr fields, and, if you prefer lookup speed

>
> >>

>
> >> over memory usage, you could even sort on *all* of the attribute fields

>
> >>

>
> >> plus the data field, and save pointers to the 'actual' data.

>
> >>

>
> >>

>
> >>

>
> >> [Jw]

>
> >

>
> > Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

>
>
>
> That would be
>
>
>
> qsort (...);
>
> result = bsearch (..);
>
>
>
> -- you can look up the correct syntax for both qsort and bsearch
>
> elsewhere. (It's beyond the scope of c.t.xml anyway.)
>
>
>
> [Jw]


JW,
Furthermore, do you think it is feasible to load the very long list (shown above) into an array, like what you said

Many Thanks & Best Regards,
HuaMin

Jongware 11-05-2012 09:40 AM

Re: Xml search
 
On 03-Nov-12 16:27 PM, huamin_chen@ymail.com wrote:> On Friday, November
2, 2012 5:22:05 PM UTC+8, Jongware wrote:
>> On 02-Nov-12 3:20 AM, wmedwardchan@gmail.com wrote:
>>
>>> On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

>>
>>>> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:
>>>>> Hi,
>>>>> Can you please show the way to quickly search such big Xml file,

in a Visual C++ project?
>>>>> http://dl.dropbox.com/u/40211031/List.zip
>>>>
>>>> Did you generate these 1,000,002 lines of XML data, or is this

from the
>>>> real world?
>>>>
>>>> In case someone does not like downloading 57 megs of zipped file, or
>>>> expanding it into 722 megs of rather pointless example lines: here

is an
>>>> abbreviated version:
>>>>
>>>> <?xml version="1.0" encoding="UTF-16"?>
>>>> <Appdata>
>>>> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04"

Attr5="05"
>>>> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
>>>> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
>>>> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
>>>> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14"

Attr5="15"
>>>> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
>>>> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
>>>> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
>>>> ... (999,998 similar lines omitted) ...
>>>> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
>>>> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
>>>> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
>>>> Attr12="99999812" Attr13="99999813" Attr14="99999814"

Attr15="99999815"
>>>> Attr16="99999816" Attr17="99999817" Attr18="99999818"
>>>> Attr19="99999819">Node_Number999998</Data>
>>>> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
>>>> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
>>>> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
>>>> Attr12="99999912" Attr13="99999913" Attr14="99999914"

Attr15="99999915"
>>>> Attr16="99999916" Attr17="99999917" Attr18="99999918"
>>>> Attr19="99999919">Node_Number999999</Data>
>>>> </Appdata>
>>>>
>>>> I'm assuming you *generated* this file by way of example. If not,

well,
>>>> it's so extremely structured that you could throw it away and use a
>>>> simple algorithm to generate the "data" for any line immediately. (And
>>>> then it would not be "data", it would be a calculation.)
>>>>
>>>> Anyway, XML is a poor choice for this particular set of data. Write a
>>>> program to convert it into a binary format, where each "line" uses 10
>>>> integers and one string of a fixed length of 20 bytes. That takes

up no
>>>> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
>>>> enough to be loaded into the RAM of today's computers.
>>>>
>>>> Search "quickly" depends on what you want to search for. If, for
>>>> example, you may need to grab a single digit out of any attribute or
>>>> content (say, a '9' that can occur in the middle of

'Attr2="4593252"'),
>>>> you are better off storing everything as string. You could also

sort the
>>>> list on one or more of the Attr fields, and, if you prefer lookup

speed
>>>> over memory usage, you could even sort on *all* of the attribute

fields
>>>> plus the data field, and save pointers to the 'actual' data.
>>>>
>>>> [Jw]

>>
>>> Many thanks Jong. Can I have the details in Visual C++ codes? To

search the binary format in the way you suggested.
>>
>> That would be
>>
>> qsort (...);
>>
>> result = bsearch (..);
>>
>> -- you can look up the correct syntax for both qsort and bsearch
>> elsewhere. (It's beyond the scope of c.t.xml anyway.)



>>>> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:


>>[..] did you see my Xml file above? Qsort is to sort a list of items.

How >>is it applicable to my Xml file?

bsearch is a function for very quickly looking up any item, but the
items have to be sorted first.
That's also the reason you have to pick a single key to sort on -- the
key you want to look up 'quickly'. If you want to be able to look up
*any* value of the 20 attributes, plus the content string, make 21
sorted lists.
To be able to give a less generic answer, we'd need to know much more of
the data set and what data item(s) need to be looked up.


> Furthermore, do you think it is feasible to load the very long list
> (shown above) into an array, like what you said


Why would it not be feasible? It seems a very simple data array, with 20
integers and a string content (possibly of a limited length).

I advise you to ask on one of the comp.programming groups; preferably
NOT on one dealing with 'Windows', because the requirement for Visual C
is virtually unimportant here, but on one of the generic C/C++ groups.

[Jw]

wmedwardchan@gmail.com 11-06-2012 01:47 AM

Re: Xml search
 
On Monday, November 5, 2012 5:40:28 PM UTC+8, Jongware wrote:
> On 03-Nov-12 16:27 PM, huamin_chen@ymail.com wrote:> On Friday, November
>
> 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
>
> >> On 02-Nov-12 3:20 AM, wmedwardchan@gmail.com wrote:

>
> >>

>
> >>> On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

>
> >>

>
> >>>> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:

>
> >>>>> Hi,

>
> >>>>> Can you please show the way to quickly search such big Xml file,

>
> in a Visual C++ project?
>
> >>>>> http://dl.dropbox.com/u/40211031/List.zip

>
> >>>>

>
> >>>> Did you generate these 1,000,002 lines of XML data, or is this

>
> from the
>
> >>>> real world?

>
> >>>>

>
> >>>> In case someone does not like downloading 57 megs of zipped file, or

>
> >>>> expanding it into 722 megs of rather pointless example lines: here

>
> is an
>
> >>>> abbreviated version:

>
> >>>>

>
> >>>> <?xml version="1.0" encoding="UTF-16"?>

>
> >>>> <Appdata>

>
> >>>> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04"

>
> Attr5="05"
>
> >>>> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

>
> >>>> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

>
> >>>> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

>
> >>>> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14"

>
> Attr5="15"
>
> >>>> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

>
> >>>> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

>
> >>>> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

>
> >>>> ... (999,998 similar lines omitted) ...

>
> >>>> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

>
> >>>> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

>
> >>>> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

>
> >>>> Attr12="99999812" Attr13="99999813" Attr14="99999814"

>
> Attr15="99999815"
>
> >>>> Attr16="99999816" Attr17="99999817" Attr18="99999818"

>
> >>>> Attr19="99999819">Node_Number999998</Data>

>
> >>>> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

>
> >>>> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

>
> >>>> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

>
> >>>> Attr12="99999912" Attr13="99999913" Attr14="99999914"

>
> Attr15="99999915"
>
> >>>> Attr16="99999916" Attr17="99999917" Attr18="99999918"

>
> >>>> Attr19="99999919">Node_Number999999</Data>

>
> >>>> </Appdata>

>
> >>>>

>
> >>>> I'm assuming you *generated* this file by way of example. If not,

>
> well,
>
> >>>> it's so extremely structured that you could throw it away and use a

>
> >>>> simple algorithm to generate the "data" for any line immediately. (And

>
> >>>> then it would not be "data", it would be a calculation.)

>
> >>>>

>
> >>>> Anyway, XML is a poor choice for this particular set of data. Write a

>
> >>>> program to convert it into a binary format, where each "line" uses 10

>
> >>>> integers and one string of a fixed length of 20 bytes. That takes

>
> up no
>
> >>>> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

>
> >>>> enough to be loaded into the RAM of today's computers.

>
> >>>>

>
> >>>> Search "quickly" depends on what you want to search for. If, for

>
> >>>> example, you may need to grab a single digit out of any attribute or

>
> >>>> content (say, a '9' that can occur in the middle of

>
> 'Attr2="4593252"'),
>
> >>>> you are better off storing everything as string. You could also

>
> sort the
>
> >>>> list on one or more of the Attr fields, and, if you prefer lookup

>
> speed
>
> >>>> over memory usage, you could even sort on *all* of the attribute

>
> fields
>
> >>>> plus the data field, and save pointers to the 'actual' data.

>
> >>>>

>
> >>>> [Jw]

>
> >>

>
> >>> Many thanks Jong. Can I have the details in Visual C++ codes? To

>
> search the binary format in the way you suggested.
>
> >>

>
> >> That would be

>
> >>

>
> >> qsort (...);

>
> >>

>
> >> result = bsearch (..);

>
> >>

>
> >> -- you can look up the correct syntax for both qsort and bsearch

>
> >> elsewhere. (It's beyond the scope of c.t.xml anyway.)

>
>
>
>
>
> >>>> On 01-Nov-12 5:41 AM, huamin_chen@ymail.com wrote:

>
>
>
> >>[..] did you see my Xml file above? Qsort is to sort a list of items.

>
> How >>is it applicable to my Xml file?
>
>
>
> bsearch is a function for very quickly looking up any item, but the
>
> items have to be sorted first.
>
> That's also the reason you have to pick a single key to sort on -- the
>
> key you want to look up 'quickly'. If you want to be able to look up
>
> *any* value of the 20 attributes, plus the content string, make 21
>
> sorted lists.
>
> To be able to give a less generic answer, we'd need to know much more of
>
> the data set and what data item(s) need to be looked up.
>
>
>
>
>
> > Furthermore, do you think it is feasible to load the very long list

>
> > (shown above) into an array, like what you said

>
>
>
> Why would it not be feasible? It seems a very simple data array, with 20
>
> integers and a string content (possibly of a limited length).
>
>
>
> I advise you to ask on one of the comp.programming groups; preferably
>
> NOT on one dealing with 'Windows', because the requirement for Visual C
>
> is virtually unimportant here, but on one of the generic C/C++ groups.
>
>
>
> [Jw]


Thanks a lot. What is the algorithm to sort my sample Xml file above? Which other group is better for me to have any other related question for my current issue?

huamin_chen@ymail.com 11-07-2012 03:10 AM

Re: Xml search
 
On Thursday, November 1, 2012 12:41:11 PM UTC+8, huami...@ymail.com wrote:
> Hi,
>
> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
>
> http://dl.dropbox.com/u/40211031/List.zip
>
>
>
> Many Thanks & Best Regards,
>
> HuaMin


JW,
Any advice to this?

Manuel Collado 11-07-2012 08:43 AM

Re: Xml search
 
El 07/11/2012 4:10, huamin_chen@ymail.com escribió:
> On Thursday, November 1, 2012 12:41:11 PM UTC+8, huami...@ymail.com wrote:
>>
>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
>>
>> http://dl.dropbox.com/u/40211031/List.zip
>>
>> Many Thanks & Best Regards,
>>
>> HuaMin

>
> JW,
> Any advice to this?


You could have a look at:

http://vtd-xml.sourceforge.net/

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado





All times are GMT. The time now is 09:19 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.