In a recent tweet from the founder of Dataquest.io, Vik Paruchuri recently publicized the launch of a multilingual document OCR toolkit, Surya. The framework can efficiently detect line-level bboxes and column breaks in documents, scanned images, or presentations. The existing text detection models like Tesseract work at the word or character level, while this open-source AI works at the line level. The biggest challenge in building a text-line detection model is the unavailability of a hundred percent correct datasets with line-level annotations.
Surya is an encoder-decoder model using an image of the document as input and produces an image with boxes drawn around the line boxes on the original input image. The initial layers of the decoder contain SegFormer, a transformer for semantic segmentation, while the 2d convolutional layer with batch-normalization layers makes the end of the decoder network. Before using the image or PDF, the pages are split into segments to the maximum dimension of the image and undergo various pre-processing.
For model evaluation for the accuracy of bboxes, researchers used precision and recall on the coverage area instead of the traditional IoU metric (Intersection over union). The precision calculates how well predicted bboxes cover ground truth bboxes and recall calculates how well ground truth bboxes cover predicted bboxes. Surya is compared with Tesseract, experiments suggested that the precision of Surya is much higher than that of Tesseract, and Tesseract’s recall is slightly more than that of Surya but overall Surya outperforms Tesseract. Another advantage of Surya over the Tesseract model is that it can work both on CPU and GPU and is much faster than Tesseract.
Surya, named after the Hindu God of the Sun, has successfully worked on multiple languages and is expected to work on almost all languages. The limitation of this model is not likely to work on photos or other images as it is specialized on documents. Experiments also show it does not work well with images that look like ads. In spite of this limitation, the model is still of great use and can be further expanded to text detection, table, and chart detection.
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.