[Din] Rorur : decentralized search engine
Stan Srednyak <stan.sredn@gmail.com> Mon, 07 November 2022 22:51 UTC
Return-Path: <stan.sredn@gmail.com>
X-Original-To: din@ietfa.amsl.com
Delivered-To: din@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0B81FC1526E7 for <din@ietfa.amsl.com>; Mon, 7 Nov 2022 14:51:01 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.104
X-Spam-Level:
X-Spam-Status: No, score=-2.104 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jEmpZ7q9Te9K for <din@ietfa.amsl.com>; Mon, 7 Nov 2022 14:50:56 -0800 (PST)
Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5E349C1526E5 for <Din@irtf.org>; Mon, 7 Nov 2022 14:50:53 -0800 (PST)
Received: by mail-ot1-x32a.google.com with SMTP id br15-20020a056830390f00b0061c9d73b8bdso7421260otb.6 for <Din@irtf.org>; Mon, 07 Nov 2022 14:50:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=WPy9iS+NAXCim3IrU4XrlculZy/xax8Rl1RZl9SZhQ8=; b=ApZ3aczRkO9kQTPskh6yeYe5/T/gxHy86OA46xktL/po4KZpDl9NjPZCoj2BMcWswS JuVWQaDQ4Bfy/UMxVa+Xj4HmMfVePH1D7shUSTvRQ+JD5P+RO4Yoyqi9+CsLag6K/NIn IFTvgfLlbT1h3Co2NRuwyeVMpAdYGnbSXPHv6UVfUnljsY+QgfJnA5+F5y1notartjcy Oiq8x74Zj6F/IFRxgNTpLZjBgsEfFYA60eksCReoqbYI5HVlQ9X+9kA8nG1ZItmm82zA m1MA/60WfjNxhCZqjbMo7P5umKDG5BnvXvmI2dVvCMOAlPg4Wdl31/DzEnsVgb2GVAQ+ km2w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=WPy9iS+NAXCim3IrU4XrlculZy/xax8Rl1RZl9SZhQ8=; b=UnVs7AdAbm+XVTYVhgH/DzM0d6X/O/UNbBqg5OEoZH3Qc8K+telnyhQfBM5Xupphhj Gz1v6seQ8qSQiQwOGRTOu09bs4Odvb6jAW0OflhRf5lsVC+6Q+eQ1FMXQedTkTxC9mT+ 8tUrXDK2Tg4uvJbqoOQ3A+gF5LuDucyfTLLdCdc0vy18N2kZH8R8h+/AHP/1NWZpunij MJl7qvvoYxjGLyXLuudLmn7jyewIZ6g9sH7aB1RJJ+XTEuP2QUixVScf/whQNhKsMKJQ sfXLSyKt7mmxt57PyXRg4bLVBfO5a8Ow/S3suMxdqxg+Mhr2c42FDHe7y5xti8wbtfbd XbJw==
X-Gm-Message-State: ACrzQf2CuAirOju6lSgLCypq93alROHeIDnd05yuA/FWXz5VcvOu6dE8 mVLxR1Z4NTzRLePH85UIxmtoqrUptivW7e1N8bejIHBoO7AYVg==
X-Google-Smtp-Source: AMsMyM7T2GUcw3nDP9xOXj2bbEEFK12jV2tPzeMjEVsbEpCI8Noj9OGg2kWW0e0b3e1DIBH5M257A/PpXAr1KkhxlUI=
X-Received: by 2002:a9d:53c3:0:b0:66c:49a0:99d4 with SMTP id i3-20020a9d53c3000000b0066c49a099d4mr23182668oth.10.1667861452025; Mon, 07 Nov 2022 14:50:52 -0800 (PST)
MIME-Version: 1.0
From: Stan Srednyak <stan.sredn@gmail.com>
Date: Mon, 07 Nov 2022 17:50:41 -0500
Message-ID: <CAE-786gYOpoE0rgLZX_ZkRGi15sXyLV65fDBL2L+Tq4kYC_rQw@mail.gmail.com>
To: Din@irtf.org
Content-Type: multipart/alternative; boundary="00000000000061f1c405ece9415e"
Archived-At: <https://mailarchive.ietf.org/arch/msg/din/VfaplduooVYxrzkxqk10eJLhPZI>
Subject: [Din] Rorur : decentralized search engine
X-BeenThere: din@irtf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "Discussion of distributed Internet Infrastructure approaches, aspects such as Service Federation, and underlying technologies" <din.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/din>, <mailto:din-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/din/>
List-Post: <mailto:din@irtf.org>
List-Help: <mailto:din-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/din>, <mailto:din-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Nov 2022 22:51:01 -0000
Dear colleagues, This is an update from rorur.com, the distributed search engine project. We successfully implemented basic functionality of the search engine - data crawl, indexing , ranking and query service and tested it on moderate size clusters on AWS. Now we would like to move to the next stage of decentralizing the web search which would consist in full web crawl and data analysis. This is a computation for which we do not have resources. We estimate ( see below) that for a meaningful operation we would need ~100 8CPU machines with 2TB SSD and Gb pipe. If you are interested, join us in this effort to construct a decentralized open source search engine. We have hardware partners who can provide the necessary servers distributed worldwide. At the moment, they charge 300$/month per server , BW included. Of course you can run your own node, as the network is decentralized and free to join. You can run smaller machines, there is in fact no minimum hardware requirement. You can run it from your laptop in the background. There are also multiple algorithmic and programming issues on which our team is working. They are mostly centered around distributed verifiable computing with web data with the purpose of constructing distributed versions of knowledge graphs, as the latter are known to be of central importance for high quality search. If you have the qualifications to carry out such research you are welcome to contact us. Feasibility analysis: Our code can be successfully run on an 8CPU machine with an SSD. With a Gb connection we saw performance at 100 pages/second for crawl and analysis. Given ~5*10^10 pages on the web we estimate that with 100 such machines the crawl can be completed in ~2months. According to estimates from Common Crawl, the total indexable data ~ few PB. We perform data cleaning which includes removal of JS and non-English characters. This results in at least 10-fold reduction in size ( we typically see substantially more). This brings the size of cleaned data ~ few hundred TB. The index size is comparable to the data. Rank data is usually much smaller, for the ranks that we tried. We estimate that 1-200TB total disk space should be enough for a proof of principle demo. best regards, Stan Srednyak
- [Din] Rorur : decentralized search engine Stan Srednyak