paperless-ngx Deployment

Introduction

paperless-ngx is an open-source document management system designed to help users digitize and efficiently manage paper documents. It allows users to automatically classify and index files by scanning or uploading PDFs or images, and supports full-text search and tag management. The core goal of the project is to make “paperless office” simple and automated.

For me, the main advantage is that files are automatically OCR’d after uploading, so when I need a document, I can just search for it—very convenient.

Deployment

I’ll just provide a deployment.yaml for reference.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: paperless-ngx
  namespace: app
spec:
  selector:
    matchLabels:
      app: paperless-ngx
  serviceName: paperless-ngx
  replicas: 1
  template:
    metadata:
      labels:
        app: paperless-ngx
    spec:
      initContainers:
        - name: install-tessdata
          image: ghcr.io/astral-sh/uv:0.7.9-python3.12-bookworm-slim
          command:
            - sh
            - -c
            - |
              apt-get update && \
              apt-get install -y tesseract-ocr-chi-sim && \
              cp -r /usr/share/tesseract-ocr/5/tessdata/* /tessdata/
          volumeMounts:
            - name: tesseract-lang
              mountPath: /tessdata
      containers:
      - name: paperless-ngx
        image: ghcr.io/paperless-ngx/paperless-ngx:2.17.1
        ports:
        - containerPort: 8000
          name: web
        env:
        - name: TZ
          value: Asia/Shanghai
        - name: PAPERLESS_REDIS
          value: redis://:passwd@redis:6379
        - name: PAPERLESS_DBHOST
          value: postgresql
        - name: PAPERLESS_URL
          value: https://xxxx.xxxx.com
        - name: PAPERLESS_SECRET_KEY
          value: xxxxxxxxx
        - name: PAPERLESS_TIME_ZONE
          value: Asia/Shanghai
        - name: PAPERLESS_OCR_LANGUAGE
          value: chi_sim
        - name: PAPERLESS_OCR_LANGUAGES
          value: eng,chi_tra
        - name: PAPERLESS_DBUSER
          value: postgres
        - name: PAPERLESS_DBPASS
          value: xxxxxxxxxxx
        volumeMounts:
        - name: tesseract-lang
          mountPath: /usr/share/tesseract-ocr/5/tessdata
        - name: data
          mountPath: /usr/src/paperless/data
          subpath: data
        - name: data
          mountPath: /usr/src/paperless/media
          subpath: media   
        - name: data
          mountPath: /usr/src/paperless/export
          subpath: export     
        - name: data
          mountPath: /usr/src/paperless/consume
          subpath: consume          
        - name: timezone
          mountPath: /etc/localtime                           # Mount to the container directory
      volumes:
        - name: timezone
          hostPath: 
            path: /usr/share/zoneinfo/Asia/Shanghai 
        - name: tesseract-lang
          emptyDir: {} 
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 100Gi

Note:

        - name: install-tessdata
          image: ghcr.io/astral-sh/uv:0.7.9-python3.12-bookworm-slim
          command:
            - sh
            - -c
            - |
              apt-get update && \
              apt-get install -y tesseract-ocr-chi-sim && \
              cp -r /usr/share/tesseract-ocr/5/tessdata/* /tessdata/
          volumeMounts:
            - name: tesseract-lang
              mountPath: /tessdata

Introduction to Tesseract OCR

tesseract-ocr-chi-sim is the Simplified Chinese language pack for Tesseract OCR. Tesseract is an open-source optical character recognition (OCR) engine that supports text recognition in multiple languages. By default, Tesseract only includes the English language pack. If you need to recognize Simplified Chinese, you need to install tesseract-ocr-chi-sim additionally.

So, the image does not include tesseract-ocr-chi-sim by default and needs to be installed manually. However, I don’t want to modify the image, so I use initContainers to install it.

Next are the various environment variables:

TZ: Sets the container time zone. Here, it’s set to Asia/Shanghai to ensure logs and system time are consistent.
PAPERLESS_REDIS: Redis connection address, format: redis://:password@host:port, used for caching and task queues.
PAPERLESS_DBHOST: PostgreSQL database hostname.
PAPERLESS_DBUSER / PAPERLESS_DBPASS: Database username and password.
PAPERLESS_URL: External access address for the service. It is recommended to set this to your domain or public address.
PAPERLESS_SECRET_KEY: Django project secret key. It is recommended to use a random string for security.
PAPERLESS_TIME_ZONE: Time zone setting for Paperless-ngx. It is recommended to keep it consistent with TZ.
PAPERLESS_OCR_LANGUAGE: Default OCR language, e.g., chi_sim for Simplified Chinese.
PAPERLESS_OCR_LANGUAGES: List of supported OCR languages, separated by commas, e.g., eng,chi_tra for English and Traditional Chinese.

Others

I think this project is already very good. If you want other projects, you can check out

https://github.com/papra-hq/papra

Feel free to follow my blog at www.bboy.app

Have Fun

Bboysoul's Blog

paperless-ngx Deployment

Introduction

Deployment

Others