• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Extract Text Based on Columns & Multi-layer PDF File Creation inside Java Apps

 
Ranch Hand
Posts: 714
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What’s new in this release?

We are pleased to announce the release of Aspose.Pdf for Java 10.0.0 . A PDF file may consist of Text, Image, Attachment, Graph, Annotations and other elements and Aspose.Pdf for Java provides the feature to add as well as manipulate image in existing PDF file. Different type of compression can be applied over images to reduce their size. The type of compression being applied over image depends upon the ColorSpace of source image i.e. if image is Color (RGB), then apply JPEG2000 compression, and if it is Black & White, then JBIG2/JBIG2000 compression should be applied. Therefore identifying each image type and using an appropriate type of compression will create best/optimized output. We may come across a requirement to determine image Color space and apply appropriate compression for image to reduce PDF file size. In case we have a PDF document with more than one column (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. In this new release, we also have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ‘Pure’ mode, you may call setScaleFactor(..) method and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing. Specifying the ScaleFactor values between 0.1 and -0.1 is treated as zero value, but it makes algorithm to calculate scale factor needed during extracting text automatically. The calculation is based on average glyph width of most popular font on the page, but we cannot guarantee that in extracted text no string of column is reaches the start of next column. Please note that if ScaleFactor value is not specified, the default value of 1.0 will be used. It means no scaling will be carried out. If specified ScaleFactor value is more than 10 or less than -0.1, the default value of 1.0 will be used. We propose the usage of auto-scaling (ScaleFactor = 0) when processing large number PDF files for text content extraction. Or manually set redundant reducing of grid width ( about ScaleFactor = 0.5). However you must not determine whether scaling is necessary for concrete document or not. If you set redundant reducing of grid width for the document (that doesn’t need in it), the extracted text content will remain fully adequate. Layers can be used in PDF documents in many ways. You might have a multi-lingual file that you want to distribute and want text in each language to appear on different layers, with the background design appearing on a separate layer. You might also create documents with animation that appears on a separate layer. One example could be to add a license agreement to your file, and you don’t want a user to view the content until they agree to the terms of the agreement. As well as the enhancements and features discussed above, there have been numerous fixes related to recently introduced PDF to DOC conversion, PDF to Excel conversion, PDF to HTML conversion, PDF to PDF/A conversion, XPS to PDF conversion, PDF to TIFF conversion, text replacement, text extraction, rendering PDF files to XPS and creating TOCs in PDF files. Some important new and improved features included in this release are given below

- Extract text based on columns
- PKCS7 does not support a stream based Constructor
- Font Folder issue on Non Windows operating systems
- TextFragment Underline formatting is not working
- Exception when loadding PDF processed wtih Apache PDFBox library
- Extreme slow initial usage of Aspose.PDF/Aspose.Words in Tomcat 8
- When replacing text, contents overlap in resultant file
- Setting margin based on level in TOC
- Digital signature is not properly being added to PDF file
- TextBoxField.setValue() throws exception
- verifySignature() method returning false
- Formatting issues in TOC
- Signature.verifySignature method is not recognizing signature
- Exception when trying to extract/get font information for TextFragment
- Decrypt/Encrypt results in corrupted document or error
- PDF to HTML: space between text is lost
- PDF to DOC: Space between text is increased
- PDF to DOC - Formatting issues in resultant filer
- PCL to PDF conversion throws NoClassDefFoundError exception
- PDF to JPEG: White rectangle instead of image's part
- HtmlFragments Issue: A string longer than a page throws exception.
- PDF to XML: Resourced are saved in incorrect directory in Linux
- PDF to DOC - Exception during conversion
- PCL to PDF - Exception during conversion

Newly added documentation pages and articles

Some new tips and articles have now been added into Aspose.Pdf for Java documentation that may guide you briefly how to use Aspose.Pdf for performing different tasks like the followings.

- Identify if image inside PDF is Colored or Black & White
- Extract text based on columns

Overview: Aspose.Pdf for Java

Aspose.Pdf is a Java PDF component to create PDF documents without using Adobe Acrobat. It supports Floating box, PDF form field, PDF attachments, security, Foot note & end note, Multiple columns document, Table of Contents, List of Tables, Nested tables, Rich text format, images, hyperlinks, JavaScript, annotation, bookmarks, headers, footers and many more. Now you can create PDF by API, XML and XSL-FO files. It also enables you to converting HTML, XSL-FO and Excel files into PDF.

More about Aspose.Pdf for Java

- Homepage of Aspose.Pdf for Java
- Download Aspose.Pdf for Java
- Read online documentation of Aspose.Pdf for Java

Contact Information
Aspose Pty Ltd
Suite 163, 79 Longueville Road
Lane Cove, NSW, 2066
Australia
Aspose – Your File Format APIs
sales@aspose.com
Phone: 888.277.6734
Fax: 866.810.9465
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic