DocTr++: Deep Unrestricted Document Image Rectification

Abstract

In recent years, tremendous efforts have been made on document image rectification. However, existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. When the captured image only involves a local text region, its rectification quality is degraded and unsatisfactory.

Our previously proposed DocTr, a transformer-assisted network for document image rectification, also has this limitation. In response, we introduce DocTr++, a new unified framework for document image rectification that doesn't have restrictions on the input distorted images.

Our major technical improvements are threefold:

We upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing.
We redefine the pixel-wise mapping relationship between the unrestricted distorted document images and their distortion-free counterparts. This data is used to train our DocTr++ for unrestricted document image rectification.
We provide a real-world test set and metrics for evaluating rectification quality.

To our knowledge, this is the first learning-based approach for the rectification of unrestricted document images. Through extensive experiments, we've found our method to be highly effective and superior to existing methods. We believe that DocTr++ will set a new standard for generic document image rectification and encourage further development in learning-based algorithms.

Method

An overview of our DocTr++ for unrestricted document image rectification. Given an arbitrary distorted document image \( I_d \), we extract its features through a CNN backbone and a distortion encoder architecture.

Then, the rectification decoder takes a fixed number of learned queries as input that attend to the encoder's output. These embeddings are parallelly transformed into per-patch warping flows \( f_b \) pointing to \( I_d \).

Finally, we use the predicted \( f_b \) to warp \( I_d \) and obtain the rectified image \( I_r \) through the bilinear sampling-based warping operation "W".

Demo

Unrestricted Rectification

Top row: three types of commonly distorted document images based on the presence of document boundaries:
    (a) w/ complete boundaries,
    (b) w/ partial boundaries,
    (c) w/o any boundaries.
Middle row: the rectified results of our method.
Bottom row: the distorted image, the original detected texts, and the rectified one (highlighted), based on DBNet.

Showcases

In our paper, we present document image rectification cases, which encompass real-world distorted document images such as test papers, book pages, and text paragraphs. We also provide qualitative results of DocTr++ and other methods based on the UDIR test set.

Related Works & Acknowledgement

Our current work is deeply rooted in our previous project, DocTr. There's also a plethora of excellent works introduced around the same time as ours.

Furthermore, a special acknowledgment goes out to the amazing works our codes are largely based on: DocUNet, DewarpNet, and DocProj. Their contributions to the field have been invaluable.

BibTeX

@inproceedings{feng2021doctr,
      title={DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction},
      author={Feng, Hao and Wang, Yuechen and Zhou, Wengang and Deng, Jiajun and Li, Houqiang},
      booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
      pages={273--281},
      year={2021}
    }

@article{feng2023doctrp,
      title={Deep Unrestricted Document Image Rectification},
      author={Feng, Hao and Liu, Shaokai and Deng, Jiajun and Zhou, Wengang and Li, Houqiang},
      journal={IEEE Transactions on Multimedia},
      year={2023}
    }