This first post in the series takes a look at one of the main output file types that our clients use to analyse their PDF table data and power their applications: Microsoft Excel.
How Parsel helps clients analyse table data in PDFs
Parsel's core proposition is to turn table data found in PDF documents (and image files) into structured data that can be worked with in a variety of formats. Inevitably, Microsoft Excel happens to be the most common format for manual data analysis by our clients. However, an increasing number of our clients are using Parsel's JSON (JavaScript Object Notation) output type, available on our Pro and Enterprise plans, to power automated data analysis and routing in their own applications. In this post we'll look at the Microsoft Excel (.xlsx
) output type and discuss the typical use cases that leverage this file type. In a forthcoming post, we'll dive into JSON and how it can be used to automate table analysis and power downstream applications.
Exporting into Microsoft Excel (.xlsx)
Parsel clients from all industries, from financial services to public sector contractors to small retail businesses, choose to use Parsel's Microsoft Excel (.xlsx
) output file type to analyse and work with table data originating in their PDF documents. The reason for this is clear: Parsel's .xlsx
output conveniently shows an image of the table, as seen by Parsel's parsing algorithm, alongside the resulting structured data presented in the rows and columns of the Excel spreadsheet. The boundary around the detected table is colour coded, in the event that multiple tables exist on the same page.
Example output from a PDF table to a Microsoft Excel table
As you can see this allows for convenient side-by-side analysis of Parsel's output. And now that the data is in Excel, it can be easily worked with (summed, averaged, combined with other data points in other Excel workbooks, etc.).
Parsel detects row categories to support data analysis
In addition to showing the individual row captions for each line in the table, Parsel also tries to infer any categories applied to the row captions. In this example table (the balance sheet of a company), ASSETS
, EQUITY
, and LIABILITIES
are all categories of data found in the table. Within each of those categories, there are some observable sub-categories (Non-current assets
, Current liabilities
etc.). Parsel detects these categories and sub-categories and exports them to Excel. This allows for convenient lookups in Excel (e.g. =VLOOKUP
or =INDEX(MATCH())
) so clients can evaluate or aggregate data within entire categories of rows.
Client types predominantly using the XLSX data type
Excel is the tool of choice for clients who want to manually review, edit and aggregate PDF tables. The fact that images can be stored within this file type alongside the parsed table data supports side-by-side analysis of the table found in the PDF document and the resulting structured data. Parsel has clients in the following industries, among others, who are using this output type for table analysis:
- Financial services
- Retail
- E-learning
- Academia
- Public sector
Parsel output types explored, part II: JSON
In the next instalment of this series, we will explore how clients use JSON to automate their table analysis and we'll dive into the client types that tend to use this output format, typically with direct access to Parsel's API.
Try Parsel for free today!
To get started, sign up today and convert your PDF into structured data with a few clicks.