Speeding up apt-file

Almost every GNU/Linux distribution provide means to tell what package does given file in your file system belong to. Debian based distribution allow you to tell what package you have to install in order to get given file in your file system.

You may want it, if

  • you are compiling from source some software, and build fails due missing headers or libraries. For example, what package provides X11/keysym.h?
  • you saw some article on web, that mentions some command, and you have no idea where does it comes from. Did you knew, for example, that program xxd, which converts file into declaration of C byte array of same content, is part of vim?
  • you was told to use HTML::LinkExtractor perl module and you are interested where to get it. (No, ad-hoc installs via cpan are not good enough)

To do all above and more, you can use apt-file program. With apt-file you do not need search web or ask people on IRC:

$ apt-file find stupid-header.h
$ apt-file find /bin/xxd

and you are done. Actually, you can use power of perl regular expressions to search for files. But this power comes at price.

And what the problem?

apt-file is very slow:

$ apt-file find /bin/xxd
vim-common: /usr/bin/xxd
vim-dbg: /usr/lib/debug/usr/bin/xxd
xxd: /usr/bin/xxd
xxdiff: /usr/bin/xxdiff
12.34user 0.55system 0:10.19elapsed 126%CPU (0avgtext+0avgdata 15388maxresident)k
0inputs+0outputs (0major+9449minor)pagefaults 0swaps

What worse, speed does not improve on subsequent invocation.

Why the problem?

Let’s recall how it all works. If you have following line in your /etc/apt/sources.list:

deb http://ftp.ru.debian.org/debian/ jessie main

it means, that you can download contents file for your architecture like that:

$ wget http://ftp.ru.debian.org/debian/dists/jessie/main/Contents-i386.gz

where jessie is release name and main is it’s component. This file (uncompressed) have some header and sorted lines like this:

etc/sudoers          admin/sudo
etc/init.d/sudo      admin/sudo
etc/sudoers.d/README admin/sudo
...
usr/bin/fontforge    fonts/fontforge,fonts/fontforge-nox

where first part is name of file (yes, leading slash is missing) and second one is section and name of binary package. There can be more then one of binary packages, owning same file, in which case names of binary packages are separated by comma.

For all repositories I included (jessie, jessie-backports, testing, sid, experimental), it nets over 16 millions lines and around 2Gb uncompressed size.

There are multitude of ways to efficiently store data for exact search – from GNU dbm database to full text search engine Xapian. But none for efficient regular expression search.

So what apt-file does is a little more advanced then zgrep <pattern> Content.gz. It means over 16 millions of regular expression match attempts. No wonders that it takes time.

Is there better way?

If we want to retain full power of perl regular expression – no. But do we really need it? Personally, I usually use it when I am looking for some C header of executable file. I can also imagine the need to locate what package provides some Perl, Python or Haskell module.

I never experienced the need to use regular expression search. Actually, often I forget that search is regex-based and get surprised, when I receive too much results.

Alternative solution would be program of following interface:

$ apt-seek --bin 2to3
$ apt-seek --man posix_spawn
$ apt-seek --header X11/keysum.h
$ apt-seek --header keysum.h
$ apt-seek --lib gdbm
$ apt-seek --haskell Data.Profunctor

Such program would maintain separate databases for every category of search queries (headers database, libraries database, ...). Such solution would not be able to obsolete apt-file, since it is less powerful, and I expect them to co-exist for rather long time.

What is done?

Crude implementation in C and shell is available from my git repository

git clone https://anonscm.debian.org/cgit/users/kaction-guest/apt-seek.git

As thoroughly discussed at #854615 this implementation has a lot of drawbacks and should actually be incorporated into apt-file. But for now apt-seek works fine for me, so incorporating into apt-file is not top of my priorities.

On suckless software

Warning: this article is not purely technical. Opinions, reasoning and contradictions are present.

Conflict of interests is terrible thing. It is very unpleasant, when there are two principles that you follow and some day you encounter subject on which they contradict. It happened with me and software.

First, and primary requirement on software must be free. No exceptions. But there is another consideration. With everything else equal, GPL is better then BSD/MIT/other-non-copyleft-junk. GPLv2+ is better then GPLv2 and so on. But what about cases, where not everything else is equal?

Lets compare GNU Screen (v.4.2.1) and tmux(v2.1) – last released version at time of writing. Screen is GPLv3+, tmux is BSD-3-clause. On my personal experience, tmux is slightly simpler to use. Also, it have feature to number panes starting with 1. Implementing it in Screen is rather complicated and fragile. With all this, Screen contains 40015 lines of Ansi C code, tmux – 34859. And in my opinion, tmux codebase is simpler to understand. Not a dealmaker? Right, read on.

Some time ago I learned about [suckless.org]. It is better to just quote their philosophy:

Designing simple and elegant software is far more difficult than
letting ad-hoc or over-ambitious features obscure the code over
time.  However one has to pay this price to achieve reliability
and maintainability.  Furthermore, minimalism results in
reasonable and attainable goals. We strive to maintain minimalism
and clarity to drive development to completion.

And I was convinced, that most software I use is really bloated. Lets continue the terminal multiplexer theme and consider dvtm-0.12, present in Debian jessie. With configuration at build time, it manages to deliver same functionality as tmux or GNU Screen, within 3925 lines of code. Latest version, v0.15 have even less lines of code – 3848. One can blame me, that I am too concerned with lines counting. In other cases I would agree, that ~100 lines difference is nothing. But let me tell you what was under hood. Version 0.12 implemented what tmux calls copy-mode itself – movement, selection and so on. Version 0.15 dropped it in favor of pipe: pipe all terminal history to external process, read it’s output and here is paste buffer. Brilliant solution in unix way.

But real revelation was text editor sandy. It is about ~2500 lines of code, which I can understand within several minutes. I always belived, that text editor is a beast of extereme complexity, like emacs with over a million lines of lisp code or vim with 300k lines of C code with load of conditional compilation. Why the hell support some shitty proprietary operating system that is not POSIX-complaint? No doubt, suckless software is not for everyone. By design, it is for advanced users.

Here is contradiction. From my point of view, suckless is really superior from techincal point of view and inferior from political. What matters more?

Here is my answer for myself. There is nothing wrong with using and improving free, but not copyleft software, although copyleft one would be better. Many blame GNU Software for beeing bloated, but let me advocate us [1]. GNU is how it all started. Free software started with GNU Manifesto, Emacs, gcc and gdb. Many GNU projects are first of their kind in free world. Bash, GCC, Screen, GDB, Coreutils – they all was first, and it is hard to be first. Every subsequent project was able to use wisdom of it’s ancestors mistakes. GNU mistakes. History is never kind to us.

And there is another consideration. GNU’s priority is freedom of users. And most of them will never write a line of C. If you ever set up a GNU/Linux system for a person, who can’t even invoke apt-get install in command line, you know that is is much, much harder, than setup for yourself. GNU, and many other “bloated” programs attempts to make happy as much non-programmers, as possible. It has it’s price, huge price, that suckless project refuses to pay. I no longer beleive that one day everyone will be GNU/Linux hacker, so there always will have to be someone to pay this price.

I take my hat off to developers, who is proficent with command line and scripting, but still volonter developing projects like Evolution, Dolphin or Gmpc. Sorry, but I do not have willpower to develop something I will not use. All I can do is diplomacy and release everything I create under GPLv3+.

[1]After all, I consider myself member of GNU, although I was not too active recently.

$EDITOR as unix filter

It is really, really strange, but none of major text editors – neither vim, nor emacs work as filter – I mean, you can’t pipe text into stdin, edit and have edited version on stdout. That is because stdout is expected to be tty, so editor writes control sequences on stdout, and we have expected behaviour. But we can do better – use stderr descriptor as tty, and save stdout for piping.

Here is code of little C99, POSIX-2008 program, that invokes your `$EDITOR’ as filter.

/* edit-stream.c --- invoke $EDITOR as filter */
/* Copyright (C) 2016 Dmitry Bogatov <KAction@gnu.org> */

/* This program is free software; you can redistribute it and/or */
/* modify it under the terms of the GNU General Public License */
/* as published by the Free Software Foundation; either version 3 */
/* of the License, or (at your option) any later version. */

/* This program is distributed in the hope that it will be useful, */
/* but WITHOUT ANY WARRANTY; without even the implied warranty of */
/* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the */
/* GNU General Public License for more details. */

/* You should have received a copy of the GNU General Public License */
/* along with this program. If not, see <http://www.gnu.org/licenses/>. */

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>

/* stdio.h would be overkill */
#define write2(str) write(2, str, sizeof(str))

int
main(int argc, char **argv)
{
    int return_value = 1;
    const char *tty = ttyname(2); /* POSIX.1-2001, did you knew? */
    if (!tty) {
        write2("stderr is not tty");
        return 1;
    }
    int in = open(tty, O_RDONLY);
    if (in == -1) {
        write2("failed to open tty for reading");
        return 1;
    }
    char tempname[] = "/tmp/edit-stream.XXXXXX";
    int tempfd = mkstemp(tempname);
    if (tempfd == -1) {
        write2("failed to open temporary file");
        goto err_close_tty;
    }
    /* POSIX does not mandates modes of temporary file. */
    if (fchmod(tempfd, 0600) == -1) {
        write2("failed to change mode of temporary file");
        goto err_remove_tempfile;
    }
    char buffer[2048];
    ssize_t bytes;
    while ((bytes = read(0, buffer, sizeof(buffer))) > 0) {
        /* Here actually must be loop, since write(2) does not guarates
         * that it will be able to write everything. But I am reckless.
                 */
        if (write(tempfd, buffer, bytes) != bytes) {
            write2("failed to write data to temporary file");
            goto err_remove_tempfile;
        }
    }
    if (fsync(tempfd) == -1) {
        write2("failed to fsync temporary file");
        goto err_remove_tempfile;
    }
    if (close(tempfd) == -1) {
        write2("failed to close temporary file");
        goto err_remove_tempfile;
    }
    if (dup2(in, 0) == -1) {
        write2("failed to set tty as stdin");
        goto err_remove_tempfile;
    }
    int stdout = dup(1);
    if (stdout == -1) {
        write2("failed to create copy of stdout");
        goto err_remove_tempfile;
    }
    /* Descriptor 2 (stderr) still points to tty */
    if (dup2(2, 1) == -1) {
        write2("failed to set tty as stdout");
        goto err_close_stdout;
    }
    const char *editor = getenv("EDITOR");
    if (!editor) {
        write2("EDITOR is not set");
        goto err_close_stdout;
    }
    pid_t pid = fork();
    if (pid == 0) {
        close(stdout);
        close(tempfd);
        close(in);
        execlp(editor, editor, tempname, NULL);
        _exit(129);
    }
    int status;
    if (wait(&status) == -1) {
        write2("wait failed");
        goto err_close_stdout;
    }
    if (!(WIFEXITED(status) && WEXITSTATUS(status) == 0)) {
        write2("editor invocation failed");
        goto err_close_stdout;
    }
    int tempfd_r = open(tempname, O_RDONLY);
    if (tempfd_r == -1) {
        write2("failed to open for reading edited temporary file");
        goto err_close_stdout;
    }
    while ((bytes = read(tempfd_r, buffer, sizeof(buffer))) > 0) {
        if (write(stdout, buffer, bytes) != bytes) {
            write2("failed to write data to stdout");
            goto err_close_tempfile_read;
        }
    }

    return_value = 0;

/* Clean up on error is best efford. Descriptors are closed, files
   are unlinked, but nothing is checked. */
err_close_tempfile_read:
    close(tempfd_r);
err_close_stdout:
    close(stdout);
err_remove_tempfile:
    close(tempfd);
    unlink(tempname);
err_close_tty:
    close(in);

    close(0);
    close(1);
    close(2);

    return return_value;
}