Introduction
paperless-ngx is an open-source document management system designed to help users digitize and efficiently manage paper documents. It allows users to automatically classify and index files by scanning or uploading PDFs or images, and supports full-text search and tag management. The core goal of the project is to make “paperless office” simple and automated.
For me, the main advantage is that files are automatically OCR’d after uploading, so when I need a document, I can just search for it—very convenient.
Deployment
I’ll just provide a deployment.yaml
for reference.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: paperless-ngx
namespace: app
spec:
selector:
matchLabels:
app: paperless-ngx
serviceName: paperless-ngx
replicas: 1
template:
metadata:
labels:
app: paperless-ngx
spec:
initContainers:
- name: install-tessdata
image: ghcr.io/astral-sh/uv:0.7.9-python3.12-bookworm-slim
command:
- sh
- -c
- |
apt-get update && \
apt-get install -y tesseract-ocr-chi-sim && \
cp -r /usr/share/tesseract-ocr/5/tessdata/* /tessdata/
volumeMounts:
- name: tesseract-lang
mountPath: /tessdata
containers:
- name: paperless-ngx
image: ghcr.io/paperless-ngx/paperless-ngx:2.17.1
ports:
- containerPort: 8000
name: web
env:
- name: TZ
value: Asia/Shanghai
- name: PAPERLESS_REDIS
value: redis://:passwd@redis:6379
- name: PAPERLESS_DBHOST
value: postgresql
- name: PAPERLESS_URL
value: https://xxxx.xxxx.com
- name: PAPERLESS_SECRET_KEY
value: xxxxxxxxx
- name: PAPERLESS_TIME_ZONE
value: Asia/Shanghai
- name: PAPERLESS_OCR_LANGUAGE
value: chi_sim
- name: PAPERLESS_OCR_LANGUAGES
value: eng,chi_tra
- name: PAPERLESS_DBUSER
value: postgres
- name: PAPERLESS_DBPASS
value: xxxxxxxxxxx
volumeMounts:
- name: tesseract-lang
mountPath: /usr/share/tesseract-ocr/5/tessdata
- name: data
mountPath: /usr/src/paperless/data
subpath: data
- name: data
mountPath: /usr/src/paperless/media
subpath: media
- name: data
mountPath: /usr/src/paperless/export
subpath: export
- name: data
mountPath: /usr/src/paperless/consume
subpath: consume
- name: timezone
mountPath: /etc/localtime # Mount to the container directory
volumes:
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
- name: tesseract-lang
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
Note:
- name: install-tessdata
image: ghcr.io/astral-sh/uv:0.7.9-python3.12-bookworm-slim
command:
- sh
- -c
- |
apt-get update && \
apt-get install -y tesseract-ocr-chi-sim && \
cp -r /usr/share/tesseract-ocr/5/tessdata/* /tessdata/
volumeMounts:
- name: tesseract-lang
mountPath: /tessdata
Introduction to Tesseract OCR
tesseract-ocr-chi-sim is the Simplified Chinese language pack for Tesseract OCR. Tesseract is an open-source optical character recognition (OCR) engine that supports text recognition in multiple languages. By default, Tesseract only includes the English language pack. If you need to recognize Simplified Chinese, you need to install tesseract-ocr-chi-sim additionally.
So, the image does not include tesseract-ocr-chi-sim by default and needs to be installed manually. However, I don’t want to modify the image, so I use initContainers
to install it.
Next are the various environment variables:
TZ
: Sets the container time zone. Here, it’s set toAsia/Shanghai
to ensure logs and system time are consistent.PAPERLESS_REDIS
: Redis connection address, format:redis://:password@host:port
, used for caching and task queues.PAPERLESS_DBHOST
: PostgreSQL database hostname.PAPERLESS_DBUSER
/PAPERLESS_DBPASS
: Database username and password.PAPERLESS_URL
: External access address for the service. It is recommended to set this to your domain or public address.PAPERLESS_SECRET_KEY
: Django project secret key. It is recommended to use a random string for security.PAPERLESS_TIME_ZONE
: Time zone setting for Paperless-ngx. It is recommended to keep it consistent withTZ
.PAPERLESS_OCR_LANGUAGE
: Default OCR language, e.g.,chi_sim
for Simplified Chinese.PAPERLESS_OCR_LANGUAGES
: List of supported OCR languages, separated by commas, e.g.,eng,chi_tra
for English and Traditional Chinese.
Others
I think this project is already very good. If you want other projects, you can check out
https://github.com/papra-hq/papra
Feel free to follow my blog at www.bboy.app
Have Fun