Web and CLI front-end to Paperwork (https://github.com/openpaperwork/paperwork)
 
 
 
 
 
Go to file
tYYGH 87f32c03e1 Fix phrase-search; sort was preventing these. 2016-02-03 16:56:59 +01:00
cli Fix phrase-search; sort was preventing these. 2016-02-03 16:56:59 +01:00
web Cache handling for thumbnails thanks to ajax 2016-02-02 11:30:44 +01:00
.gitignore Split config from the PHP script 2016-01-30 00:07:51 +01:00
README Cache handling for thumbnails thanks to ajax 2016-02-02 11:30:44 +01:00
css.diff Initial version. 2016-01-21 22:12:19 +01:00
gpl.txt Initial version. 2016-01-21 22:12:19 +01:00

README

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

This program comes in two parts:
* the shell search, that can be used on its own,
* the web interface, that needs the former.

= WHAT IS PAPERWEB? =

My goal is very specific: I want to have a web frontend to a collection of papers that are managed by Paperwork (from jflesch on GitHub), with the storage for these papers being on my server. Thus:
* I don't have Paperwork available on the server, only the data.
* It must be very lightweight because my server is lightweight too.
* It must be secure, as the data managed using Paperwork is important and private.

= SECURITY =

There are several aspects to consider:
* Access to the web frontend must be limited. This is handled by the web server.
* Data from the web frontend must not be trusted. This is handled by "the program" (more on this later).
* The web server as whole must not be granted access to Paperwork's documents.

Web server security is outside the scope of this document. Read Apache or Nginx documentation to setup some sort of authentication (password, client certificate…).

Now the web frontend. It needs access to the documents. A naive solution would be to use POSIX ACLs on the documents, so that the web server user is allowed to read them. However, this would allow *any* program running on the web server to read the documents, not just my program. And other programs on the web server might not be constrained by the same authentication rules. For the same reason, I ruled out allowing the web server user to execute random find/grep/awk/etc. commands, let alone a shell command like bash.

The solution is to allow the web server user to execute a single command as another user, using sudo. This command then takes care of whatever is needed by the web frontend. This is much like the security model of forced commands in ssh, used by gitolite for example. This is why there are two parts to this program:
* the command-line interface, that is run on the server by a user with read access to the documents;
* the web interface, that is run on the web server, with no access to the documents save through the CLI interface.

Finally, all the input (search terms, etc.) is verified by the CLI script, so that any number of frontends can be written and benefit from the same strict controls.

= INSTALLATION AND DEPENDENCIES =

The shell script may be copied anywhere. Its dependencies are standard, although I only tested on Debian and Arch Linux: bash, find, gawk, sed, grep, file, base64, ls. Besides, if pdfinfo and pdftoppm are installed, the script detects these and handles the pages from PDF documents in an individual manner (as JPEG images), instead of returning the whole document when a single page is requested.
Once the script is installed, the path to the Paperwork data must be changed (edit the script). Optionally, the number of DPI (dots per inch) for the extraction of pages from PDF documents may be adjusted as well.

On the web side, the only mandatory files are ```index.php```, ```query.inc.php```, ```thumbs.inc.php```, and ```paperweb.ini.php```, which should be created by copying and changing the provided example file: ```example-paperweb.ini.php```. The code depends on PHP 5.2 or better (so that JSON is included).
Other, optional, files are provided in order to offer better cache handling, by using HTTP GET requests through Ajax calls, instead of relying on the POST requests that are necessary for Javascript-less plain HTML. These files are:
* ```ajax.js```: a very minimal Ajax helper;
* ```query.ajax.php``` and ```thumbs.ajax.php```: the receiving components of Ajax calls.
And theres the CSS file, which is optional too, although it is much better to use it, even though it could be improved.

Finally, ```visudo``` must be run so that the web server user is allowed to execute the CLI script, for example:
```
nginx ALL=(pwdataread) NOPASSWD:/usr/local/bin/paperfind.sh
```
Of course, with the above example, user "pwdataread" must have read access to Paperwork's data.

= THE COMMAND-LINE INTERFACE =

Since there was a need to have a command-line interface anyway (see above), I decided it might as well be useful on its own, for searches through a SSH connection for example. For this reason, I chose JSON as the main output format of the CLI interface: it is easily read by humans as well as by computers. Besides, JSON is standard, which may ease the creation of alternative web interfaces (more complex, with AJAX and such), or even a desktop GUI. Anyway, the output is easily understood, and it can easily made pretty, with a command like:
```
paperfind.sh [parameters] | json_pp
```

The program comes with its own documentation (use ```-h```), but here are some examples:
```
$ paperfind.sh -Q -d '201512|2016' -i -l 'sosh|yves' | json_pp
[
   {
      "etag" : "1453224546.0000000000",
      "folder" : "20160119_1828_14",
      "labels" : [
         "facture / taxe",
         "Sosh",
         "téléphonie",
         "Yves"
      ],
      "type" : "pdf",
      "count" : 11
   },
   {
      "count" : 3,
      "type" : "pdf",
      "etag" : "1452541168.0000000000",
      "labels" : [
         "facture / taxe",
         "Sosh",
         "téléphonie",
         "Yves"
      ],
      "folder" : "20151217_0000_01"
   }
]

$ paperfind.sh -T 20151217_0000_01 | json_pp
[
   {
      "etag" : "1452541166.0000000000",
      "mime" : "image/jpeg",
      "height" : 212,
      "data" : "…base64-encoded data…",
      "width" : 149
   },
   {
      "etag" : "1452541167.0000000000",
      "mime" : "image/jpeg",
      "width" : 149,
      "data" : "…base64-encoded data…",
      "height" : 212
   },
   {
      "etag" : "1452541167.0000000000",
      "mime" : "image/jpeg",
      "data" : "…base64-encoded data…",
      "width" : 149,
      "height" : 212
   }
]

$ paperfind.sh -D 20151217_0000_01 -p 3 | json_pp
{
   "etag" : "1452541108.0000000000",
   "width" : 745,
   "mime" : "image/jpeg",
   "data" : "…base64-encoded data…",
   "height" : 1053
}
```

= THE WEB INTERFACE =

There is not much to say, appart from the fact that the web server should probably be configured to authenticate users before granting access to the interface. It depends on the value of the data managed by Paperwork.
The current web UI is minimal and barely more than a prototype. I may improve it, depending on my own needs, or if someone volunteers to contribute ;-)