Re: [dtn-users] DTN2 - Frequent Spinlock errors

Nik Ansell <nikansell00@gmail.com> Wed, 27 January 2016 15:34 UTC

Return-Path: <nikansell00@gmail.com>
X-Original-To: dtn-users@ietfa.amsl.com
Delivered-To: dtn-users@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D97751B386B for <dtn-users@ietfa.amsl.com>; Wed, 27 Jan 2016 07:34:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.151
X-Spam-Level: **
X-Spam-Status: No, score=2.151 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, J_CHICKENPOX_39=0.6, J_CHICKENPOX_82=0.6, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8QKILGiHWbst for <dtn-users@ietfa.amsl.com>; Wed, 27 Jan 2016 07:34:41 -0800 (PST)
Received: from mail-io0-x230.google.com (mail-io0-x230.google.com [IPv6:2607:f8b0:4001:c06::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2C6361B388F for <dtn-users@irtf.org>; Wed, 27 Jan 2016 07:34:41 -0800 (PST)
Received: by mail-io0-x230.google.com with SMTP id f81so22655571iof.0 for <dtn-users@irtf.org>; Wed, 27 Jan 2016 07:34:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=hlO2PYht/whCVczJ/zrTAYthq0efk0Dapd38dImID4o=; b=wpVSlrNNvVPlb6gPGl2GRE1xFK041j/C92fXFFbdpmAEkxkHY/VaKZ/gwtiIAhbp7e SGxxCtisDrJPj4/S4YjKe7f/SjllJVDtysMqvtHYKQD/HWjaAqUrtSPpo8D3jgE2E6yx 543qAdA3eGoG4oqklrpugceUvUKm57b9U0ZbL/UmkmXuML0ZgrR1RhEqKfD5HFREROTB LLWHUX7k66gqyRMxmtf2U+i+0B7j+uYCVqDhulPFzqkWMbpj2nRKPra7q1VBHRyaa2y0 WNDB+hnpljiNtyqX8w0X9lOJpNLwxu56WtkWLYN4L0u5yr48yKRtC7fnNwMaZpRtyZLC l6kA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=hlO2PYht/whCVczJ/zrTAYthq0efk0Dapd38dImID4o=; b=GdIT9BDmkhwoXmoBLOHXQtCRmyvG+QnC9aDlqnZUk7LK+H17R6SVvEqVDSIakB2+B6 nyAcrnDUQm1YXhAjV6DHf6tIeVs9ZN70EVC7kzUzR8kG+Mb+951qE1BlKiJLk/QXG4dk aT2VdXQ6VTAKFBtcR2rksh2/vCY3jtFIk4znE7MvN5GkNaPj8bngr9zT9F/BnIs7Fwmq WJjNHglTmuD68frKSyvjNYmQkxCI3j/MPX4Q5SDvgVS5bVirVkSo/0okjjBbTSfA08W3 W7jjYJFAWkCfDatTPvYWakrwbziCD3ioPJfRwmK0N43NuacwwF/ZwMcaoYTmwzYRLnok cgLA==
X-Gm-Message-State: AG10YOThW/n+V/zlUe1J+NXuLnqMBN0I5PVk+UUGnOVHvsIR7oEqGedTp7gLqmJqOZt0dlLB+UnChkW/rs9Wdw==
MIME-Version: 1.0
X-Received: by 10.107.169.101 with SMTP id s98mr29216718ioe.126.1453908880336; Wed, 27 Jan 2016 07:34:40 -0800 (PST)
Received: by 10.79.24.2 with HTTP; Wed, 27 Jan 2016 07:34:40 -0800 (PST)
In-Reply-To: <CAKLzrV_WC5i-rGjv0MDFDG4YhBKu3Si4qjXa=UztjDkQ3pLtYQ@mail.gmail.com>
References: <CAKLzrV_0Khan5YbJqyv+--ZHy78DnjZnrscXVMpFjZafq772tA@mail.gmail.com> <CAKLzrV_WC5i-rGjv0MDFDG4YhBKu3Si4qjXa=UztjDkQ3pLtYQ@mail.gmail.com>
Date: Wed, 27 Jan 2016 19:34:40 +0400
Message-ID: <CAKLzrV-JJ=gb28w745OwvZ5iuZZA6oTLvFfmDhrTvas6Bp4U_Q@mail.gmail.com>
From: Nik Ansell <nikansell00@gmail.com>
To: dtn-users@irtf.org
Content-Type: multipart/alternative; boundary="001a1142e192592f91052a528adc"
Archived-At: <http://mailarchive.ietf.org/arch/msg/dtn-users/rg7ItNKjPtAGNQlHoSs6_AA5HT8>
Subject: Re: [dtn-users] DTN2 - Frequent Spinlock errors
X-BeenThere: dtn-users@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: nikansell00@gmail.com
List-Id: "The Delay-Tolerant Networking Research Group \(DTNRG\) - Users." <dtn-users.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/dtn-users>, <mailto:dtn-users-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dtn-users/>
List-Post: <mailto:dtn-users@irtf.org>
List-Help: <mailto:dtn-users-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/dtn-users>, <mailto:dtn-users-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Jan 2016 15:34:44 -0000

Hello All,

I think I have found a fix for this, it is rather crude but seems to work
consistently well after putting my test-bed through its paces using TCPCL,
UDPCL and LTPCL.

In my testing I added some debug code to SpinLock.cc which printed the Lock
class properties to the dtnd.log file. I noticed that adding the debug code
actually caused the spinlock errors to appear less frequently. I also
noticed that when the spinlock error loop occurs, the lock_holder_name_
property of the Lock class (the thread that created the lock) was NULL. In
some cases I also received an error (ASSERTION FAILED (is_locked_by_me())
at thread/SpinLock.cc:91) - this is the check at the start of the unlock
function to ensure that a thread is not trying to unlock memory that was
locked by another thread.

The most common spin lock errors I experienced were:

sending node: OpenFdCache::unpin waiting for spinlock held by null
receiving node: BundleList::del_bundle waiting for lock held by null

I developed two changes (below) to the oasys-1.6.0 build, which resulted in
the ASSERTION FAILED error message and spin lock error message going away
and much better overall performance on my test-bed. Then of course I reran
make and make install to rebuild and install the required files.

I have never written a line of c++ code in my life, nor did I know anything
about linux spinlocks before my digging began.
Therefore I am open to any suggestions to improve the code, or to learn of
any issues I may have caused by implementing these changes. I appreciate
this will potentially impact the performance of some systems, but as the
pause only occurs while waiting to acquire a lock I suspect this will not
be too noticeable, as it yields the thread (i.e passes the context back to
the CPU) before pausing.

1. Add the following code to the start of the function, to allow locking or
unlocking when lock_holder_name_ is NULL.

Lock.h -> is_locked_by_me():

    bool is_locked_by_me()
    {

        if (lock_holder_name_ == NULL) {
        return is_locked() && true;
        }

2. Add the following 11,000 microsecond pause, if the thread is waiting to
acquire a lock:

SpinLock.cc -> SpinLock::lock():

    while (atomic_cmpxchg32(&lock_count_, 0, 1) != 0)
    {
        Thread::spin_yield();
        // 0.011 sec sleep
        usleep(11000);

Kind Regards,
Nik

On Mon, Jan 25, 2016 at 11:19 AM, Nik Ansell <nikansell00@gmail.com> wrote:

> I've done some more digging and have a few observations (below), I will
> continue down this path but think I may be out of my depth already! :)
>
> 1. SpinLock.h: SpinLock::lock():
> a) The error message appearing in the log (below) has a null
> lock_holder_name_ - Does this mean the lock is invalid (i.e perhaps the
> calling thread terminated due to an exception) and can be safely released?
> fprintf(stderr, "warning: %s is waiting for spin lock held by %s, which
> has reached spin limit\n", lock_user, lock_holder_name_)
>
> b) The call to (Thread.h) Thread::spin_yield()  (which calls either
> thread_id_yield() or sched_yield() ). Is possibly failing in some cases -
> not sure how to debug this one....Has anyone any idea if I can somehow
> force the release of the lock if (as in above) lock_holder_name_ == NULL?
>
> c) The function atomic_cmpxchg32(&lock_count_,0,1) !=0 called is contained
> within several Atomic-<CPUARCH>.h files e.g. Atomic-arm.h or Atomic-x86.h -
> I am not sure how to confirm the correct header file was used during the
> build, but will try to build oasys and dtn2 again using the
> --host=arm-linux argument to configure to see if this makes a difference.
>
> 2.  I have noticed there are some locking debugging functions available
> (Thread::lock_debugger->add_lock / remove_lock) if
> the --enable_debug_locking argument is passed during the build of oasys and
> dtn2. I am considering enabling this to get some more info on the problem.
>
> Kind Regards,
> Nik
>
> On Sun, Jan 24, 2016 at 12:05 PM, Nik Ansell <nikansell00@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I am running some experiments to identify bundle delivery characteristics
>> of DTN2 (+LTPlib), ION and IBR-DTN on the Raspberry Pi B2. Currently I am
>> trying to identify reliable bundle delivery scenarios for all 3 DTN
>> implementations, so I can then apply certain network simulation scenarios
>> to observe how each DTN implementation and convergence layer behaves.
>>
>> The scope of my experimentation is below for info:
>>
>> ION: UDPCL, TCPCL, LTPCL
>> DTN2: UDPCL, TCLCP, LTPCL - using LTPlib
>> IBR-DTN: UDPCL, TCPCL
>>
>> I have determined reliable, repeatable scenarios for both ION and
>> IBR-DTN, however DTN2 is producing produce a number of errors when using
>> UDP, TCP or LTP. The errors also seem to appear randomly after any number
>> of for loop iterations, e.g. (5,13,50,48,95,413, 567, 988, etc etc).
>>
>> Errors that cause dtnd to use 200% CPU and stop delivering bundles:
>> warning: deliver_front is waiting for spin lock held by (null), which has
>> reached spin limit
>> warning: Bundle::del_ref is waiting for spin lock held by (null), which
>> has reached spin limit
>> warning: Bundle::is_queued_on is waiting for spin lock held by (null),
>> which has reached spin limit
>> warning: LinkBlockSet::find_blocks is waiting for spin lock held by
>> (null), which has reached spin limit
>> warning: BundleList::erase is waiting for spin lock held by (null), which
>> has reached spin limit
>> warning: Bundle::add_ref is waiting for spin lock held by (null), which
>> has reached spin limit
>> warning: ForwardingLog::get_count is waiting for spin lock held by
>> (null), which has reached spin limit
>>
>> Errors: that cause dtnd to quit:
>> LTP:ASSERTION FAILED (is_locked_by_me()) at thread/SpinLock.cc:91
>>
>> Errors that seem to loop forever but not cause dtnd to quit or use 200%
>> CPU (This error seems to happen consistently after sending 99 LTP bundles):
>> /dtn/cl/ltp/sender error] Unable to create Sender LTP Socket in 5 seconds
>> - retrying
>>
>> To send files I am calling dtnsource in a for loop as below, which allows
>> me to adjust the time between each bundle transmission, or each block of 10
>> bundle transmissions. For UDP and LTP I am using bundles of 63K, for TCP I
>> am using 1M bundles. I have tried several combinations for $WAIT (0-5
>> seconds) and $BULKWAIT (0-20 seconds), but cannot seem to find a working
>> combination.
>>
>> for ((i=1;i<=END;i++)); do
>>     echo "($COUNT) dtnsource -s dtn://tx.dtn/a -d dtn://rx.dtn/g -b $SIZE"
>>     dtnsource -s dtn://tx.dtn/a -d dtn://rx.dtn/g -b $SIZE
>>     sleep $WAIT
>>     let "COUNT++"
>>
>>     # Sleep for x secs after each 10
>>     if (( $COUNT % 10 == 0 ))
>>     then
>>     echo "Sleeping for $BULKWAIT seconds"
>>     sleep $BULKWAIT
>>     fi
>>
>> done
>>
>> The daemon is stopped and restarted after each test, I have also tried
>> deleting and recreating the berkley DB each time, but this does not seem to
>> have much effect. I have also managed to recreate some of the errors on a
>> virtual Ubuntu test-bed, specifically Bundle::del_ref.
>>
>> After reading through the oasys SpinLock::lock code, it looks like this
>> is possibly an infinite loop that tries up to 1,000,000 times to get a
>> lock, then resets the counter to zero and puts out a warning message.
>>
>> Has anyone had a similar experience or can suggest any troubleshooting
>> tips?
>> Any suggestions to workaround or fix the problem will be very gratefully
>> received!
>>
>> Kind Regards,
>> Nik
>>
>
>