Re: [dtn-users] DTN2 - Frequent Spinlock errors

Nik Ansell <nikansell00@gmail.com> Mon, 25 January 2016 07:19 UTC

Return-Path: <nikansell00@gmail.com>
X-Original-To: dtn-users@ietfa.amsl.com
Delivered-To: dtn-users@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A6D1C1ACEC1 for <dtn-users@ietfa.amsl.com>; Sun, 24 Jan 2016 23:19:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.551
X-Spam-Level: *
X-Spam-Status: No, score=1.551 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, J_CHICKENPOX_39=0.6, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RJ0UByHFqAA3 for <dtn-users@ietfa.amsl.com>; Sun, 24 Jan 2016 23:19:21 -0800 (PST)
Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com [IPv6:2607:f8b0:4001:c05::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 7BB601ACEBD for <dtn-users@irtf.org>; Sun, 24 Jan 2016 23:19:21 -0800 (PST)
Received: by mail-ig0-x235.google.com with SMTP id h5so25172629igh.0 for <dtn-users@irtf.org>; Sun, 24 Jan 2016 23:19:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=Z5LGOlv+TsJRmZyWjyZZXiPYY6V+5k2C8cJmGI53EEk=; b=LLpCXNRg6iNQYoxzcMIgCzADeKnjPR2KDdb5iaYeDyff1/7MS2zKvFQicqgkHiXa77 NExdQkeYgvu7KeUU+w6xyLc9xkkblyXcWh6zhMuTKzUGzkFB3PuuoH6ZlSYCUDNnoNIp lS2Jy6ndDONbOIBtd+ToLq8r+vI+7OKaPqYZuPowuRgg1fx2kJsxjhIVNQBDU1YHzbpJ wy4P9ogjgFJQuaZoLIuY2KGE40hb1MJ/61IcFXhpVspjGtpmxBV32WNy86lCffIYHihc r+hAS/H/P/+xWczvPEk0L5fEOGBQ9YUssb7F2GLNWfkDIL9EJDUoGSR58xzi5ZlucxZl 32AA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=Z5LGOlv+TsJRmZyWjyZZXiPYY6V+5k2C8cJmGI53EEk=; b=J/S4FyNmPXCgrOrwWMtYNrZwftRe3ZntFLQOwW5YKLlSxSHAo2U3+GIovPgcuBynZw FwW5/q3TKOf7ZaY3BcZ/ZNxooX3t3+KVdVf8zD2CjuT/aPiCXX/NKIlI2OXBTFls2fna 3c7H6zvbevGzJiGEYhDD+gQmjPrmRHuw1jWJRgtv7jdKBqT4MuyQDHDvEX7+JnF6BoiN VV7E7JUdJCgRFJCpd3aZBcE+EzCM3AXYx6rsWrFsm6OhTxFTlnC7dEEGHtp5dwohEQPI MLJ/1kLaRyGjomWGBa0nN/n5ipZBqDodgbAeDY6SZ8miCHycOzqDBPvtpsaHAAX5EAu7 uREA==
X-Gm-Message-State: AG10YOSFt/BemKfe9M4sBYA0a/qInSgiw7sw2Bg7ZrC1F03pwiB4fng2WoJ6Ejvf6UAqHXNRC+xLqjNO3YUAow==
MIME-Version: 1.0
X-Received: by 10.50.147.34 with SMTP id th2mr11304203igb.27.1453706360605; Sun, 24 Jan 2016 23:19:20 -0800 (PST)
Received: by 10.79.24.2 with HTTP; Sun, 24 Jan 2016 23:19:20 -0800 (PST)
In-Reply-To: <CAKLzrV_0Khan5YbJqyv+--ZHy78DnjZnrscXVMpFjZafq772tA@mail.gmail.com>
References: <CAKLzrV_0Khan5YbJqyv+--ZHy78DnjZnrscXVMpFjZafq772tA@mail.gmail.com>
Date: Mon, 25 Jan 2016 11:19:20 +0400
Message-ID: <CAKLzrV_WC5i-rGjv0MDFDG4YhBKu3Si4qjXa=UztjDkQ3pLtYQ@mail.gmail.com>
From: Nik Ansell <nikansell00@gmail.com>
To: dtn-users@irtf.org
Content-Type: multipart/alternative; boundary="089e014952383b59c5052a2363f1"
Archived-At: <http://mailarchive.ietf.org/arch/msg/dtn-users/fldvyNq6aqF6kp173fZ_yV085k0>
Subject: Re: [dtn-users] DTN2 - Frequent Spinlock errors
X-BeenThere: dtn-users@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: nikansell00@gmail.com
List-Id: "The Delay-Tolerant Networking Research Group \(DTNRG\) - Users." <dtn-users.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/dtn-users>, <mailto:dtn-users-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dtn-users/>
List-Post: <mailto:dtn-users@irtf.org>
List-Help: <mailto:dtn-users-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/dtn-users>, <mailto:dtn-users-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Jan 2016 07:19:23 -0000

I've done some more digging and have a few observations (below), I will
continue down this path but think I may be out of my depth already! :)

1. SpinLock.h: SpinLock::lock():
a) The error message appearing in the log (below) has a null
lock_holder_name_ - Does this mean the lock is invalid (i.e perhaps the
calling thread terminated due to an exception) and can be safely released?
fprintf(stderr, "warning: %s is waiting for spin lock held by %s, which has
reached spin limit\n", lock_user, lock_holder_name_)

b) The call to (Thread.h) Thread::spin_yield()  (which calls either
thread_id_yield() or sched_yield() ). Is possibly failing in some cases -
not sure how to debug this one....Has anyone any idea if I can somehow
force the release of the lock if (as in above) lock_holder_name_ == NULL?

c) The function atomic_cmpxchg32(&lock_count_,0,1) !=0 called is contained
within several Atomic-<CPUARCH>.h files e.g. Atomic-arm.h or Atomic-x86.h -
I am not sure how to confirm the correct header file was used during the
build, but will try to build oasys and dtn2 again using the
--host=arm-linux argument to configure to see if this makes a difference.

2.  I have noticed there are some locking debugging functions available
(Thread::lock_debugger->add_lock / remove_lock) if
the --enable_debug_locking argument is passed during the build of oasys and
dtn2. I am considering enabling this to get some more info on the problem.

Kind Regards,
Nik

On Sun, Jan 24, 2016 at 12:05 PM, Nik Ansell <nikansell00@gmail.com> wrote:

> Hello All,
>
> I am running some experiments to identify bundle delivery characteristics
> of DTN2 (+LTPlib), ION and IBR-DTN on the Raspberry Pi B2. Currently I am
> trying to identify reliable bundle delivery scenarios for all 3 DTN
> implementations, so I can then apply certain network simulation scenarios
> to observe how each DTN implementation and convergence layer behaves.
>
> The scope of my experimentation is below for info:
>
> ION: UDPCL, TCPCL, LTPCL
> DTN2: UDPCL, TCLCP, LTPCL - using LTPlib
> IBR-DTN: UDPCL, TCPCL
>
> I have determined reliable, repeatable scenarios for both ION and IBR-DTN,
> however DTN2 is producing produce a number of errors when using UDP, TCP or
> LTP. The errors also seem to appear randomly after any number of for loop
> iterations, e.g. (5,13,50,48,95,413, 567, 988, etc etc).
>
> Errors that cause dtnd to use 200% CPU and stop delivering bundles:
> warning: deliver_front is waiting for spin lock held by (null), which has
> reached spin limit
> warning: Bundle::del_ref is waiting for spin lock held by (null), which
> has reached spin limit
> warning: Bundle::is_queued_on is waiting for spin lock held by (null),
> which has reached spin limit
> warning: LinkBlockSet::find_blocks is waiting for spin lock held by
> (null), which has reached spin limit
> warning: BundleList::erase is waiting for spin lock held by (null), which
> has reached spin limit
> warning: Bundle::add_ref is waiting for spin lock held by (null), which
> has reached spin limit
> warning: ForwardingLog::get_count is waiting for spin lock held by (null),
> which has reached spin limit
>
> Errors: that cause dtnd to quit:
> LTP:ASSERTION FAILED (is_locked_by_me()) at thread/SpinLock.cc:91
>
> Errors that seem to loop forever but not cause dtnd to quit or use 200%
> CPU (This error seems to happen consistently after sending 99 LTP bundles):
> /dtn/cl/ltp/sender error] Unable to create Sender LTP Socket in 5 seconds
> - retrying
>
> To send files I am calling dtnsource in a for loop as below, which allows
> me to adjust the time between each bundle transmission, or each block of 10
> bundle transmissions. For UDP and LTP I am using bundles of 63K, for TCP I
> am using 1M bundles. I have tried several combinations for $WAIT (0-5
> seconds) and $BULKWAIT (0-20 seconds), but cannot seem to find a working
> combination.
>
> for ((i=1;i<=END;i++)); do
>     echo "($COUNT) dtnsource -s dtn://tx.dtn/a -d dtn://rx.dtn/g -b $SIZE"
>     dtnsource -s dtn://tx.dtn/a -d dtn://rx.dtn/g -b $SIZE
>     sleep $WAIT
>     let "COUNT++"
>
>     # Sleep for x secs after each 10
>     if (( $COUNT % 10 == 0 ))
>     then
>     echo "Sleeping for $BULKWAIT seconds"
>     sleep $BULKWAIT
>     fi
>
> done
>
> The daemon is stopped and restarted after each test, I have also tried
> deleting and recreating the berkley DB each time, but this does not seem to
> have much effect. I have also managed to recreate some of the errors on a
> virtual Ubuntu test-bed, specifically Bundle::del_ref.
>
> After reading through the oasys SpinLock::lock code, it looks like this is
> possibly an infinite loop that tries up to 1,000,000 times to get a lock,
> then resets the counter to zero and puts out a warning message.
>
> Has anyone had a similar experience or can suggest any troubleshooting
> tips?
> Any suggestions to workaround or fix the problem will be very gratefully
> received!
>
> Kind Regards,
> Nik
>