XML to Avro Conversion
We all know what XML is right? Just in case not, no problem here is what it is all about.
1 2 3 | < root > < node >5</ node > </ root > |
Now, what the computer really needs is the number five and some context around it. In XML you (human and computer) can see how it represents context to five. Now lets say instead you have a business XML document like FPML
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | < FpML xmlns = "http://www.fpml.org/2007/FpML-4-4" xmlns:fpml = "http://www.fpml.org/2007/FpML-4-4" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" version = "4-4" xsi:schemaLocation = "http://www.fpml.org/2007/FpML-4-4 ../fpml-main-4-4.xsd http://www.w3.org/2000/09/xmldsig# ../xmldsig-core-schema.xsd" xsi:type = "RequestTradeConfirmation" > <!-- start of distinct --> < strike > < strikePrice >32.00</ strikePrice > </ strike > < numberOfOptions >150000</ numberOfOptions > < optionEntitlement >1.00</ optionEntitlement > < equityPremium > < payerPartyReference href = "party2" /> < receiverPartyReference href = "party1" /> < paymentAmount > < currency >EUR</ currency > < amount >405000</ amount > </ paymentAmount > < paymentDate > < unadjustedDate >2001-07-17Z</ unadjustedDate > < dateAdjustments > < businessDayConvention >NONE</ businessDayConvention > </ dateAdjustments > </ paymentDate > < pricePerOption > < currency >EUR</ currency > < amount >2.70</ amount > </ pricePerOption > </ equityPremium > </ equityOption > < calculationAgent > < calculationAgentPartyReference href = "party1" /> </ calculationAgent > < documentation > < masterAgreement > < masterAgreementType >ISDA2002</ masterAgreementType > </ masterAgreement > < contractualDefinitions >ISDA2002Equity</ contractualDefinitions > <!-- populate credit support document with correct value --> < creditSupportDocument >TODO</ creditSupportDocument > </ documentation > < governingLaw >GBEN</ governingLaw > </ trade > < party id = "party1" > < partyId >Party A</ partyId > </ party > < party id = "party2" > < partyId >Party B</ partyId > </ party > </ FpML > |
That is a lot of extra unnecessary data points. Now lets look at this using Apache Avro.
With Avro, the context and the values are separated. This means the schema/structure of what the information is does not get stored or streamed over and over and over and over (and over) again.
The Avro schema is hashed. So the data structure only holds the value and the computer understands the fingerprint (the hash) of the schema and can retrieve the schema using the fingerprint.
1 | 0x d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592 |
This type of implementation is pretty typical in the data space.
When you do this you can reduce your data between 20%-80%. When I tell folks this they immediately ask, “why such a large gap of unknowns”. The answer is because not every XML is created the same. But that is the problem because you are duplicating the information the computer needs to understand the data. XML is nice for humans to read, sure … but that is not optimized for the computer.
Here is a converter we are working on https://github.com/stealthly/xml-avro to help get folks off of XML and onto lower cost, open source systems. This allows you to keep parts of your systems (specifically the domain business code) using the XML and not having to be changed (risk mitigation) but store and stream the data with less overhead (optimize budget).