The Anything-to-XML Converter

Da_Man

What do you guys say to this:

#!/bin/bash if [ $1 eq "" ]; echo "Usage: bin2xml <infile> <outfile>" exit fi echo "<xml version=\"1.0\"><root><![CDATA[" > $2 cat $1 >> $2 echo "]]></root>" >> $2

Random832

Fails on any input file containing the byte string "]]>".

vt_mruhlin

base 64 encode it!

Cthulhu_reencoded

@Random832 said:

Fails on any input file containing the byte string "]]>".

Yes, but it's way more broken than that. It uses the default UTF-8 encoding, so feeding random binary in is almost certain to generate invalid code sequences. And C0 controls will likely break it too.

So we have a tool used to wrap binary data, that really only works properly with ASCII printable characters. Most of the time. And what it does is pointless anyway: if you just want to dump random binary garbage into an XML format, why not use the existing standard, OOXML?

yafake

Are you just confused or really that stupid? Office "Open" XML from Microsoft, not being any kind of serious standard, does exactly that: Dumping OLE objects as base64-encoded binary fiels into an XML structure for the pure sake of buzzword compatibility. Did you rather mean OpenDocument? If so, then I still fail to see how the WTF above should be related to document formats exclusively or especially.

No, I'm pretty sure that's actually what s/he meant. OOXML (the MS Office "standard") is a bunch of "random binary garbage" dumped into pseudoXML.

You worse-than-fail.

MarcB

Well, OOXML, while being a steaming pile of BillG toilet flushings, isn't totally the equivalent of "ren whatever.doc whatever.xml". There is a fair amount of XML in there. You could, with a bit of work, extract most of the content of a document, if you poked at the bytes long enough. The biggest problem with it is that the spec refers to other Microsoft "standards" on how to implement various formatting and layout options.

Here's a small sample of the differences between the OpenOffice and OOXML specs (paraphrased):

OpenOffice definition: a <bullet> tag shall be formatted with a leading character <whatever glyph/character entity they use, like •>
OOXML: a <bullet> tag shall be formatted by bullet style #1 as used in Word 97

OpenOffice: a <date> tag represents a date with epoch of Jan 1/1970, formatted & presented as per ISO <whatever>
OOXML: a <date> tag represents a date with epoch of Jan 1/1900, with leap years calculated as in Excel

Microsoft has basically created an "open" specification which is impossible for anyone but Microsoft to implement properly, because no one but Microsoft has access to all the required "external" elements, like the Word 97 bullet specifications, or the Excel date algorithms (which have a LONGstanding bug defining 1900 as a leap year).

As well, certain definitions are inconsistent throughout the spec, depending on which Office app it's coming from. Dates in Access are mutually inconsistent with dates in Word, Excel, and PowerPoint, so there's 4 different specifications for date tags in OOXML, and none of those specifications really say HOW to specify a date, they're all "see Word/Excel/Access/PP docs at pages X,Y,Z,A"

OO.org, by comparison, CAN be implemented completely by anyone who'd care to sit down and do so. They do refer to external specifications, but without fail they're other ISO standards which anyone can use. There's only ONE date definition, which is used across all OO.org application types, etc...

And to top it off, Microsoft had to use something around 900+ pages for their specs, most of which redefines the wheel MAX_INT times. OO.org's spec is what... 50 pages? 100?

Atrophy

The weird part is that ODF is *also* just the internal representation of the document dumped into an arbitrary XML candy coating. I think it's high time someone made a document format that was just XML and nothing else ... like this:

<title>This is a title</title>

<paragraph>This text is <u>underlined.</u></paragraph>

<paragraph>This is an image <image>(embedded image)</image>.</paragraph>

</ActuallyOpenDocument>

That way you could be sure that years down the road even if nobody could get your specification to work, they could still pull the information they need out of the file for as long as there are text editors.

Iago

@Atrophy said:

The weird part is that ODF is also just the internal representation of the document dumped into an arbitrary XML candy coating.

That would be weird if it was true, but it isn't.

@Atrophy said:

I think it's high time someone made a document format that was just XML and nothing else ... like this

There are plenty. HTML springs to mind, or DocBook. They all have the problem that what [i]users[/i] want is a document where they can control the actual layout easily, and the only way to get that is to have a rich markup language that allows you to specify a lot of complicated detail that isn't directly connected to the semantics of the document.

dmitriy

I cannot find it right now, but there was a post on the front page a while ago where an XML file contained CSV data within a CDATA element. This example reminds me of that post.

pauluskc

@Iago said:

@Atrophy said:
The weird part is that ODF is *also* just the internal representation of the document dumped into an arbitrary XML candy coating.
That would be weird if it was true, but it isn't. @Atrophy said:
I think it's high time someone made a document format that was just XML and nothing else ... like this
There are plenty. HTML springs to mind, or DocBook. They all have the problem that what [i]users[/i] want is a document where they can control the actual layout easily, and the only way to get that is to have a rich markup language that allows you to specify a lot of complicated detail that isn't directly connected to the semantics of the document.

what does HTML+CSS do for ya? XML has attributes. wowsa! All these JS editors that create a RTF-like editing environment and create XHTML compliant code are just about the answer. Why MS hasn't just souped up one of these opensource editors and replaced word with it is beyond me. Oh yeah. MS <> Sun. Silly proprieters...

<doc>

<text heirarchy="heading" heirarch_level="1" style="font-family:creative_font;font-size:500px;">This is stupid.<image encoding="base64">89734ofhao489fhao84faw889yf9yfofa8yfa8f74kd87o8rytca498oyfz8prg</image></text>

</doc>

Atrophy

@Iago said:

@Atrophy said:
The weird part is that ODF is *also* just the internal representation of the document dumped into an arbitrary XML candy coating.
That would be weird if it was true, but it isn't.

Here's where I got that from: http://news.com.com/Microsofts+standards+choice/2010-1013_3-6161285.html

"Further, the letter claims that "ODF is closely tied to OpenOffice and related products" (bad!) while OOXML "reflects the rich set of capabilities in Office 2007" (good!). A more even-handed sentence might read: ODF is an XML-based dump of the internal data structures of OpenOffice, while OOXML is an XML-based dump of the internal data structures of Microsoft Office."

pauluskc

@Atrophy said:

@Iago said:
@Atrophy said:
The weird part is that ODF is also just the internal representation of the document dumped into an arbitrary XML candy coating.
That would be weird if it was true, but it isn't.

Here's where I got that from: http://news.com.com/Microsofts+standards+choice/2010-1013_3-6161285.html

"Further, the letter claims that "ODF is closely tied to OpenOffice and related products" (bad!) while OOXML "reflects the rich set of capabilities in Office 2007" (good!). A more even-handed sentence might read: ODF is an XML-based dump of the internal data structures of OpenOffice, while OOXML is an XML-based dump of the internal data structures of Microsoft Office."

ding ding ding! this is the most accurate answer.

PSWorx

Bah, you all fail to see the future: Office 2010 will of course feature XML that contains base64'ed XML that contains base64'ed binary data...

PSWorx

Seriously though, I read there is lately a new, saner approach to putting binary data into XML: "hybrid files", basically zip archives that contain one (or more) XML "index" files with all of the XML data and various other files containing the binary data. (The internal directory structure and/or file names are of course part of the spec.)
Since the binary files wouldn't need to be encoded in any other way but the compression, you could "extract" binary data by unzipping the appropriate file.

The only drawback I see in this approach might be that the result is not a text file anymore. But since all you need to "hack into" it is a simple zip utility, I don't think that's much of a problem.

Kemp

@PSWorx said:

Seriously though, I read there is lately a new, saner approach to
putting binary data into XML: "hybrid files", basically zip archives
that contain one (or more) XML "index" files with all of the XML data
and various other files containing the binary data. (The internal
directory structure and/or file names are of course part of the spec.)
Since
the binary files wouldn't need to be encoded in any other way but the compression, you could "extract" binary data by unzipping the
appropriate file.
The only drawback I see in this approach might
be that the result is not a text file anymore. But since all you need
to "hack into" it is a simple zip utility, I don't think that's much of
a problem.

Congratulations on being the first person to actually describe the format open office uses. Go ahead, rename one of your files as a zip and open it up. I have no idea where people get their information from, but it's not reality.

XIU

@Kemp said:

@PSWorx said:
Seriously though, I read there is lately a new, saner approach to
putting binary data into XML: "hybrid files", basically zip archives
that contain one (or more) XML "index" files with all of the XML data
and various other files containing the binary data. (The internal
directory structure and/or file names are of course part of the spec.)
Since
the binary files wouldn't need to be encoded in any other way but the compression, you could "extract" binary data by unzipping the
appropriate file.
The only drawback I see in this approach might
be that the result is not a text file anymore. But since all you need
to "hack into" it is a simple zip utility, I don't think that's much of
a problem.
Congratulations on being the first person to actually describe the format open office uses. Go ahead, rename one of your files as a zip and open it up. I have no idea where people get their information from, but it's not reality.

And so is OOXML :)

Arancaytar

@PSWorx said:

Bah, you all fail to see the future: Office 2010 will of course feature XML that contains base64'ed XML that contains base64'ed binary data...

... bas64'ed binary data of an image, which is a photograph taken of a printed XML document on a wooden table. ;-)

Random832

@PSWorx said:

Seriously though, I read there is lately a new, saner approach to
putting binary data into XML: "hybrid files", basically zip archives
that contain one (or more) XML "index" files with all of the XML data
and various other files containing the binary data. (The internal
directory structure and/or file names are of course part of the spec.)
Since
the binary files wouldn't need to be encoded in any other way but the compression, you could "extract" binary data by unzipping the
appropriate file.
The only drawback I see in this approach might
be that the result is not a text file anymore. But since all you need
to "hack into" it is a simple zip utility, I don't think that's much of
a problem.

Kind of a WTF because there's an existing standard whose result is a text file - MHTML (the approach can easily be generalized to any other problem involving XML+other/binary resources)

PSWorx

@Random832 said:

@PSWorx said:
Seriously though, I read there is lately a new, saner approach to
putting binary data into XML: "hybrid files", basically zip archives
that contain one (or more) XML "index" files with all of the XML data
and various other files containing the binary data. (The internal
directory structure and/or file names are of course part of the spec.)
Since
the binary files wouldn't need to be encoded in any other way but the compression, you could "extract" binary data by unzipping the
appropriate file.
The only drawback I see in this approach might
be that the result is not a text file anymore. But since all you need
to "hack into" it is a simple zip utility, I don't think that's much of
a problem.
Kind of a WTF because there's an existing standard whose result is a text file - MHTML (the approach can easily be generalized to any other problem involving XML+other/binary resources)

I wouldn't necessarily call it a WTF, because I think the zip approach actually IS superior to MHTML. As far as I know, the two possibillities in MHTML are a) going for space and putting the binary contents into the text file unencoded - after which you end up with some strange mix of binary and text sections that I personally would be scared to open with a text editor. Or b) going for security and encoding the data as Quoted Printable, Base64 or Hex (!) - in which case you do have a text file but ... see start of the topic :)

VaclavK

I have seen MS and few others doing exactly that... later claiming that product now supports XML :-)

@Cthulhu reencoded said:

if you just want to dump random binary garbage into an XML format, why not use the existing standard, OOXML?

You sir, have made my day.

@Atrophy said:

ODF is an XML-based dump of the internal data structures of OpenOffice

Not exactly. ODF reuses open standards, like SVG, MathML, Dublin Core, Xforms, ... OOXML reuses Microsoft's "standards" VML, DrawingML, their own math markup language, et cetera.