Re: Syntax

Frank Ellermann <nobody@xyzzy.claranet.de> Wed, 10 January 2007 21:38 UTC

Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1H4l9H-0001Cp-4Y; Wed, 10 Jan 2007 16:38:19 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1H4l9G-0001Cj-9R for cosmogol@ietf.org; Wed, 10 Jan 2007 16:38:18 -0500
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1H4l9E-00041V-Rl for cosmogol@ietf.org; Wed, 10 Jan 2007 16:38:18 -0500
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1H4l98-0004eZ-Lb for cosmogol@ietf.org; Wed, 10 Jan 2007 22:38:10 +0100
Received: from du-001-151.access.de.clara.net ([212.82.227.151]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <cosmogol@ietf.org>; Wed, 10 Jan 2007 22:38:10 +0100
Received: from nobody by du-001-151.access.de.clara.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <cosmogol@ietf.org>; Wed, 10 Jan 2007 22:38:10 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: cosmogol@ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 10 Jan 2007 22:36:48 +0100
Organization: <URL:http://purl.net/xyzzy>
Lines: 68
Message-ID: <45A55C70.283F@xyzzy.claranet.de>
References: <45A129E9.50905@gmx.de> <20070107205255.GA14621@sources.org> <45A20F62.9060306@gmx.de> <20070108204618.GA29407@sources.org> <20070109000704.GB17340@finch-staff-1.thus.net> <20070109081753.GA1875@nic.fr> <20070110055950.GA5608@finch-staff-1.thus.net> <20070110083434.GB24390@nic.fr> <20070110104810.GB32555@finch-staff-1.thus.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: du-001-151.access.de.clara.net
X-Mailer: Mozilla 3.0 (OS/2; U)
X-Spam-Score: 0.2 (/)
X-Scan-Signature: 41c17b4b16d1eedaa8395c26e9a251c4
Subject: Re: Syntax
X-BeenThere: cosmogol@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: DIscussion on state machine specification in IETF protocols <cosmogol.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/cosmogol>, <mailto:cosmogol-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/cosmogol>
List-Post: <mailto:cosmogol@ietf.org>
List-Help: <mailto:cosmogol-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/cosmogol>, <mailto:cosmogol-request@ietf.org?subject=subscribe>
Errors-To: cosmogol-bounces@ietf.org

Clive D.W. Feather wrote:

> The "what's valid UTF-8" syntax isn't complicated.

While I agree that the code shown below isn't complicated it's not
what I'd like to copy into any REXX scripts about state machines.

UTF32O:  procedure               /* UTF-8 to UTF-32BE decoder     */
   U.2 = xrange( x2c( '80' ), x2c( 'BF' ))
   SUB = x2c( '0000FFFD' )       ;  DST = ''
   parse arg SRC                 ;  LOS = length( SRC )

   do while LOS > 0
      parse var SRC LB 2 SRC     ;  LOS = LOS - 1
      LB = c2d( LB )             ;  TOP = 0

      if LB < 128 then  do
         DST = DST || x2c( d2x( LB, 8 ))  ;  iterate
      end

      if LOS > 0  then  TOP = c2d( left( SRC, 1 )) % 16
      select                     /* for CESU remove both LB = 237 */
         when  LB < 192             then  LEN = -0 /* trail bytes */
         when  LB < 194             then  LEN = -1 /* bad C0 + C1 */
         when  LB < 224             then  LEN = +1
         when  LB = 224 & TOP =  8  then  LEN = -2 /* E08x is bad */
         when  LB = 224 & TOP =  9  then  LEN = -2 /* E09x is bad */
         when  LB = 237 & TOP = 10  then  LEN = -2 /* EDAx is bad */
         when  LB = 237 & TOP = 11  then  LEN = -2 /* EDBx is bad */
         when  LB < 240             then  LEN = +2
         when  LB = 240 & TOP =  8  then  LEN = -3 /* F08x is bad */
         when  LB < 244             then  LEN = +3
         when  LB = 244 & TOP =  8  then  LEN = +3 /* F48x is ok. */
         when  LB < 248             then  LEN = -3 /* bad F4 - F7 */
         when  LB < 252             then  LEN = -4 /* bad F8 - FB */
         when  LB < 254             then  LEN = -5 /* bad FC + FD */
         otherwise                        LEN = -0 /* bad FE + FF */
      end

      BAD = ( LEN <= 0 )         ;  LEN = abs( LEN )
      if LOS < LEN   then  do
         BAD = 1                 ;  LEN = LOS
      end

      TOP = left( SRC, LEN )     ;  SRC = substr( SRC, LEN + 1 )
      TMP = verify( TOP, U.2 )   ;  LOS = LOS - LEN
      if TMP > 0  then  do       /* eat plausible trailing bytes: */
         BAD = 1                 ;  SRC = substr( TOP, TMP ) || SRC
         LOS = length( SRC )     /* but keep possible valid input */
      end                        /* bytes for the next iteration  */

      if BAD = 0  then  do       /* at this point input is valid: */
         LB  = x2b( d2x( LB ))   ;  LEN = verify( LB, 1 ) - 2
         LB  = copies( 0, LEN ) || right( LB, 6 - LEN )

         do until TOP == ''
            TMP = x2b( c2x( left( TOP, 1 )))
            LB  = LB || right( TMP, 6 )
            TOP = substr( TOP, 2 )
         end

         TOP = b2x( strip( LB, 'L', 0 ))
         DST = DST || x2c( right( TOP, 8, 0 ))
      end
      else  DST = DST || SUB
   end
   return DST



_______________________________________________
Cosmogol mailing list
Cosmogol@ietf.org
https://www1.ietf.org/mailman/listinfo/cosmogol