Re: [ire] DNRD CSV Draft

Francisco,

Thanks for the reply, my feedback is embedded below.

-- 

JG

James Gould
Principal Software Engineer
jgould@verisign.com

703-948-3271 (Office)
12061 Bluemont Way
Reston, VA 20190
VerisignInc.com

On 12/5/12 8:59 PM, "Francisco Obispo" <fobispo@isc.org> wrote:

>So I'm now trying to go in more detail on the CSV implementation, to try
>to understand its benefits.
>
>So far I've found the following cons of this approach:
>
>1) It does not eliminate the need of having to code, whoever implements
>   this, is going to have to write some scripts to generate the enclosing
>   XML file, and generate all the CSVs in the right format. In fact, I
>believe
>   there is going to be more work associated with this approach, see #2.

I disagree with this.  First the XML definition that follows
draft-arias-noguchi-registry-data-escrow-04 would be primarily static
content, with the dynamic content being the file names and optionally the
file checksums.  You could use a templating language (e.g. Apache
Velocity) or simple string replacement to replace the file names and
checksums in the static definition XML file.  This is very straight
forward and might take less time then the time that I'm taking creating
this e-mail.  You define the fields that are applicable to your registry
in the XML definition and then you're free to use any tool (off the shelf
or custom) to generate the CSV files. The CSV files map one-to-one with a
relational schema, so there is very little transformation required.  The
all XML approach of draft-arias-noguchi-dnrd-objects-mapping-01 requires a
relational to object conversion, that is far more complex and is far more
work then dumping fields from a relational schema to a set of CSV files.

>
>2) There's the issue of Validation of the data. The fields are defined
>   using an XML Schema (i.e.: <rdeCsv:fName>  Name field with
>type="eppcom:labelType")
>   which will create the need of writing a program to implement all of the
>   types used in the schema for validation.

I do agree that there is the need to create a custom validation program
that validates XSD field format definitions to the CSV data fields.  I
believe that the community can create this tool, and we would certainly
like to participate in that effort.  The fact that you can validate XML
using an XML parser doesn't mean that we should make the generation and
consumption of the data escrows more complex to forgo having to build a
new validation program.  The field types defined in
draft-gould-thippeswamy-dnrd-csv-mapping-00 utilize XSD types, so it's a
matter of reusing parts of the XML Parsers to validate individual field
elements.     

>
>3) We now have more files to process/store/sign, instead of just one.

The files could be placed and compressed into a ZIP file per the AGB, so
the number of contained files is not relevant for transfer, signing, and
storage.  

>
>
>I'm not done with my tests yet, but I've been able to compress an XML file
>with BZIP2 to 1/15th of its original size. I'm not envisioning that this
>is
>going to be a problem for most registries.

You can also compress the CSV files down to 1/15th of their original size.
 You can't ignore the uncompressed size from the processing though.  The
compression can be done to reduce the transfer time and the storage costs.
 You should compare the uncompressed and compressed deposits using CSV and
XML with incrementally larger randomized data sets.  I believe you will
find a large advantage with the use of CSV from a size perspective.  I
fundamentally believe that the draft should address all registries
independent of its size.  Saying that some of the registries will have an
problem with the use of XML to me is a showstopper for XML deposits.

>
>I'll provide more feedback later,
>
>
>Francisco Obispo 
>Director of Applications and Services - ISC
>email: fobispo@isc.org
>Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
>PGP KeyID = B38DB1BE
>
>On Oct 25, 2012, at 1:43 PM, "Gould, James" <JGould@verisign.com> wrote:
>
>> All,
>> 
>> We have created a draft of the Domain Name Registration Data (DNRD)
>>Comma-Separated Values (CSV) Objects Mapping that is attached for review
>>and feedback.  We intend to post it to the IETF as an I-D once the
>>submission page is available on November 5th.  This draft fully supports
>>the Registry Data Escrow Specification
>>(draft-arias-noguchi-registry-data-escrow-04).  It defines the CSV files
>>and the order and format of the CSV fields for the data escrow of domain
>>name, host, contact, registrar, and IDN language objects.  If there is
>>interest in this model we can consider merging it with the Domain Name
>>Registration Data (DNRD) Objects Mapping
>>(draft-arias-noguchi-dnrd-objects-mapping-01).  The basis of using CSV
>>files for DNRD objects includes:
>> 
>> 	€ CSV is a natural format for exporting and importing data from and to
>>a database.  This could greatly simplify the generation of the data
>>escrow files as well as the consumption of the files by an EBERO
>>provider.  
>> 	€ XML is a highly verbose format that will adversely affect the
>>processing of large data sets .  With the draft, XML can be used for
>>definition and CSV can be used for data, so the duplication of the
>>descriptive information does not have to be used for every record.
>> 		€ If you take the domain object (<rdeDomain:domain>) example from
>>draft-arias-noguchi-dnrd-objects-mapping-01 and convert it to CSV files
>>(domain, dnssec, and domainTransfer) you save around 75% uncompressed
>>and 82% compressed using CSV.
>> 		€ Extrapolating out the uncompressed size of a <rdeDomain:domain>
>>records (1718 bytes versus 416 bytes uncompressed per record) for XML
>>and CSV,  you get to 1.7 GB with XML and 443 MB for CSV with 1 million
>>records and 170 GB with XML and 44.3 GB for CSV with 100 million records.
>> 		€ The deposits are generated uncompressed, validated, compressed and
>>transferred by registry, and uncompressed, validated and stored by the
>>data escrow provider.  Both the size difference and the processing
>>resources required for both the registry and the data escrow provider
>>should be considered when comparing the two models.
>> 		€ EBERO providers must transfer, uncompress, validate, and import the
>>data into their database from the data escrow deposits, where the larger
>>the files and the processing resources required, the longer it will take
>>to recover the TLD.
>> 		€ The full deposit is done weekly, so it is a weekly hit for all
>>registries, where the larger the registry the bigger the hit.
>> Please review the attached draft and provide any feedback for
>>consideration.  
>> 
>> Thanks,
>> 
>> --
>>   
>> JG
>>  
>> <86BF0728-DD04-4F90-8380-5AA8A9AB5D0B[81].png>
>>  
>> James Gould
>> Principal Software Engineer
>> jgould@verisign.com
>>  
>> 703-948-3271 (Office)
>> 12061 Bluemont Way
>> Reston, VA 20190
>> VerisignInc.com
>> 
>> 
>> 
>><draft-gould-thippeswamy-dnrd-csv-mapping.txt>___________________________
>>____________________
>> ire mailing list
>> ire@ietf.org
>> https://www.ietf.org/mailman/listinfo/ire
>