[Din] Rorur : decentralized search engine

Stan Srednyak <stan.sredn@gmail.com> Mon, 07 November 2022 22:51 UTC

Return-Path: <stan.sredn@gmail.com>
X-Original-To: din@ietfa.amsl.com
Delivered-To: din@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0B81FC1526E7 for <din@ietfa.amsl.com>; Mon, 7 Nov 2022 14:51:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jEmpZ7q9Te9K for <din@ietfa.amsl.com>; Mon, 7 Nov 2022 14:50:56 -0800 (PST)
Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E349C1526E5 for <Din@irtf.org>; Mon, 7 Nov 2022 14:50:53 -0800 (PST)
Received: by mail-ot1-x32a.google.com with SMTP id br15-20020a056830390f00b0061c9d73b8bdso7421260otb.6 for <Din@irtf.org>; Mon, 07 Nov 2022 14:50:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=WPy9iS+NAXCim3IrU4XrlculZy/xax8Rl1RZl9SZhQ8=; b=ApZ3aczRkO9kQTPskh6yeYe5/T/gxHy86OA46xktL/po4KZpDl9NjPZCoj2BMcWswS JuVWQaDQ4Bfy/UMxVa+Xj4HmMfVePH1D7shUSTvRQ+JD5P+RO4Yoyqi9+CsLag6K/NIn IFTvgfLlbT1h3Co2NRuwyeVMpAdYGnbSXPHv6UVfUnljsY+QgfJnA5+F5y1notartjcy Oiq8x74Zj6F/IFRxgNTpLZjBgsEfFYA60eksCReoqbYI5HVlQ9X+9kA8nG1ZItmm82zA m1MA/60WfjNxhCZqjbMo7P5umKDG5BnvXvmI2dVvCMOAlPg4Wdl31/DzEnsVgb2GVAQ+ km2w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=WPy9iS+NAXCim3IrU4XrlculZy/xax8Rl1RZl9SZhQ8=; b=UnVs7AdAbm+XVTYVhgH/DzM0d6X/O/UNbBqg5OEoZH3Qc8K+telnyhQfBM5Xupphhj Gz1v6seQ8qSQiQwOGRTOu09bs4Odvb6jAW0OflhRf5lsVC+6Q+eQ1FMXQedTkTxC9mT+ 8tUrXDK2Tg4uvJbqoOQ3A+gF5LuDucyfTLLdCdc0vy18N2kZH8R8h+/AHP/1NWZpunij MJl7qvvoYxjGLyXLuudLmn7jyewIZ6g9sH7aB1RJJ+XTEuP2QUixVScf/whQNhKsMKJQ sfXLSyKt7mmxt57PyXRg4bLVBfO5a8Ow/S3suMxdqxg+Mhr2c42FDHe7y5xti8wbtfbd XbJw==
X-Gm-Message-State: ACrzQf2CuAirOju6lSgLCypq93alROHeIDnd05yuA/FWXz5VcvOu6dE8 mVLxR1Z4NTzRLePH85UIxmtoqrUptivW7e1N8bejIHBoO7AYVg==
X-Google-Smtp-Source: AMsMyM7T2GUcw3nDP9xOXj2bbEEFK12jV2tPzeMjEVsbEpCI8Noj9OGg2kWW0e0b3e1DIBH5M257A/PpXAr1KkhxlUI=
X-Received: by 2002:a9d:53c3:0:b0:66c:49a0:99d4 with SMTP id i3-20020a9d53c3000000b0066c49a099d4mr23182668oth.10.1667861452025; Mon, 07 Nov 2022 14:50:52 -0800 (PST)
MIME-Version: 1.0
From: Stan Srednyak <stan.sredn@gmail.com>
Date: Mon, 07 Nov 2022 17:50:41 -0500
Message-ID: <CAE-786gYOpoE0rgLZX_ZkRGi15sXyLV65fDBL2L+Tq4kYC_rQw@mail.gmail.com>
To: Din@irtf.org
Content-Type: multipart/alternative; boundary="00000000000061f1c405ece9415e"
Archived-At: <https://mailarchive.ietf.org/arch/msg/din/VfaplduooVYxrzkxqk10eJLhPZI>
Subject: [Din] Rorur : decentralized search engine
X-BeenThere: din@irtf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "Discussion of distributed Internet Infrastructure approaches, aspects such as Service Federation, and underlying technologies" <din.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/din>, <mailto:din-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/din/>
List-Post: <mailto:din@irtf.org>
List-Help: <mailto:din-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/din>, <mailto:din-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Nov 2022 22:51:01 -0000

Dear colleagues,

This is an update from rorur.com, the distributed search engine project.

We successfully implemented basic functionality of the search engine - data
crawl, indexing , ranking and query service and tested it on moderate size
clusters on AWS.

Now we would like to move to the next stage of decentralizing the web
search which would consist in full web crawl and data analysis. This is a
computation for which we do not have resources. We estimate ( see below)
that for a meaningful operation we would need ~100 8CPU machines with 2TB
SSD and Gb pipe. If you are interested, join us in this effort to construct
a decentralized open source search engine.

We have hardware partners who can provide the necessary servers distributed
worldwide. At the moment, they charge 300$/month per server , BW included.
Of course you can run your own node, as the network is decentralized and
free to join. You can run smaller machines, there is in fact no minimum
hardware requirement. You can run it from your laptop in the background.

There are also multiple algorithmic and programming issues on which our
team is working. They are mostly centered around distributed verifiable
computing with web data with the purpose of constructing distributed
versions of knowledge graphs, as the latter are known to be of central
importance for high quality search. If you have the qualifications to carry
out such research you are welcome to contact us.


Feasibility analysis:

Our code can be successfully run on an 8CPU machine with an SSD. With a Gb
connection we saw performance at 100 pages/second for crawl and analysis.
Given ~5*10^10 pages on the web we estimate that with 100 such machines the
crawl can be completed in ~2months. According to estimates from Common
Crawl, the total indexable data ~ few PB. We perform data cleaning which
includes removal of JS and non-English characters. This results in at least
10-fold reduction in size ( we typically see substantially more). This
brings the size of cleaned data ~ few hundred TB. The index size is
comparable to the data. Rank data is usually much smaller, for the ranks
that we tried. We estimate that 1-200TB total disk space should be enough
for a proof of principle demo.



best regards,
Stan Srednyak