Outage analysis and report

Glen <glen@amsl.com> Tue, 28 January 2020 23:56 UTC

Return-Path: <glen@amsl.com>
X-Original-To: ietf-announce@ietfa.amsl.com
Delivered-To: ietf-announce@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C80C2120100 for <ietf-announce@ietfa.amsl.com>; Tue, 28 Jan 2020 15:56:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -104.2
X-Spam-Level:
X-Spam-Status: No, score=-104.2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BZVZfC6jasBX for <ietf-announce@ietfa.amsl.com>; Tue, 28 Jan 2020 15:56:52 -0800 (PST)
Received: from mail.amsl.com (c8a.amsl.com [4.31.198.40]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 442541200FF for <ietf-announce@ietf.org>; Tue, 28 Jan 2020 15:56:52 -0800 (PST)
Received: from mail.amsl.com (localhost [127.0.0.1]) by c8a.amsl.com (Postfix) with ESMTPS id 87FFE2034C0 for <ietf-announce@ietf.org>; Tue, 28 Jan 2020 15:56:18 -0800 (PST)
Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178]) by c8a.amsl.com (Postfix) with ESMTPSA id 680AB2034BF for <ietf-announce@ietf.org>; Tue, 28 Jan 2020 15:56:18 -0800 (PST)
Received: by mail-oi1-f178.google.com with SMTP id b18so8288238oie.2 for <ietf-announce@ietf.org>; Tue, 28 Jan 2020 15:56:52 -0800 (PST)
X-Gm-Message-State: APjAAAUhgYMGYjJRuvJR+LxO18a6BDoQvSxN88DjmgtHPf2G/y4o4pI1 xPrL9xPFDwV3PxNGmIpXo+tVY1KRkabD3lbh9HU=
X-Google-Smtp-Source: APXvYqx04LOvvUyKOJih2Gi47LXJkBu/900wKDUbkefv/KJP+IS7FQNrz70cEyb+lHyVLik2Jr9TEZoVwSi8kzzabtI=
X-Received: by 2002:a05:6808:6d6:: with SMTP id m22mr4405741oih.138.1580255811440; Tue, 28 Jan 2020 15:56:51 -0800 (PST)
MIME-Version: 1.0
From: Glen <glen@amsl.com>
Date: Tue, 28 Jan 2020 15:56:40 -0800
X-Gmail-Original-Message-ID: <CABL0ig5G0K+ULxAHjLcXw6LicBHutdeOckJ==QMy6=kLZOOhoA@mail.gmail.com>
Message-ID: <CABL0ig5G0K+ULxAHjLcXw6LicBHutdeOckJ==QMy6=kLZOOhoA@mail.gmail.com>
Subject: Outage analysis and report
To: ietf-announce@ietf.org
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ietf-announce/Zo3YbZJVpM74fioJ1P6fVKs2NTc>
X-BeenThere: ietf-announce@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "IETF announcement list. No discussions." <ietf-announce.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-announce>, <mailto:ietf-announce-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-announce/>
List-Post: <mailto:ietf-announce@ietf.org>
List-Help: <mailto:ietf-announce-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-announce>, <mailto:ietf-announce-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Jan 2020 23:56:54 -0000

Dear IETF Community -

As you know, about 30 hours ago, we moved the IETF to a new server,
containing upgraded OS, software, and a new Python version.  As a part
of that process, the Tools Team moved the Datatracker and their other
software to Python 3.

14 hours ago, that new server suffered a significant data loss. Henrik
was online shortly after the data loss occurred, and called me
immediately. Investigation determined that the loss was caused by a
command in the daily Datatracker cron script.  One rsync command in
that script is designed to make iana yang- parameters available to the
Datatracker.  After the upgrade to Python 3, that script generated a
bad command-line argument, resulting in the rsync command running with
an incorrect target and incorrectly deleting server data.  The missing
data then caused the Datatracker, the Mail Archive, and other tools to
malfunction or fail.  We of course operate a number of hot backup
servers, all of which dutifully picked up the data changes
immediately.

Fortunately, just prior to that script's execution, one of AMS'
offsite backup systems had grabbed a complete copy of the data on the
new server.  So, with the exception of approximately 2 hours of
traffic during which the operating servers were impaired (roughly
0815-1015 GMT Tuesday morning) , no other data was lost. However,
given the estimated time to restore that data (3-4 hours over the
Internet), and given that there could be other unknowns in the
software we hadn't yet identified, the optimal course was clearly to
bring the old server group back online, which I did, restoring service
using the old server approximately 3 hours after the problem started.

At this time, AMS programmers are working on merging yesterday's Mail
archive data into the live archive, while the Tools Team members are
working on merging drafts and Datatracker data and other information
they manage into the live system. We will send an update when this
process is completed, and we will then consult with IETF leadership
and schedule a new cutover event in the near future.

Thank you for your patience.

Glen
--
Glen Barney
IT Director
AMS (IETF Secretariat)