In this post, we hear from the head of Parsel's data science team, Daniel Vliegenthart, about the technical architecture under the hood of Parsel and what that means for handling large document processing workloads without any performance degradation.
From the moment we first started thinking about Parsel as a product, we had three adjectives in mind for its technical architecture: secure, scalable and supple. We knew that Parsel was going to be processing potentially sensitive documents (financial documents, initially, then expanding from there). We also knew that processing workloads would be spiky and therefore it would be hard to predict the level of computational resources needed at any given time. We also wanted to be able to improve the product in an organic way, by iterating fast on new features without worrying about breaking anything in the document processing pipeline.
With all of this in mind, we set out to define an architecture that adhered to the above three principles. In this post we hope to convince you that these principles matter, and show why they're the architectural foundation that Parsel was built upon.
All documents processed by Parsel are transmitted and stored securely
We understand that documents and images contain sensitive information that must be kept private at all costs. To ensure this, we encrypt your document from the moment you upload it, all the way to the point where you download your desired output. Our entire software architecture is hosted on AWS on our own Private Virtual Cloud, which is a private, secure and closed off virtual environment. Any communication with the outside world is fully encrypted (encrypted-in-transit), and thus not at risk of being stolen or spied on, as encrypted data is meaningless without knowledge of the protected private encryption key. Furthermore, any data we store in our private cloud environment is fully encrypted-at-rest.
Parsel is built to scale
Our architecture ensures that we can analyse your document as quickly as possible, while completely isolated from any other processes.
Parsel is built using a serverless architecture. Traditionally, when building web-based applications, a dedicated server (or servers) that is always online is used to process and store data and serve up content whenever a user visits the website or application. A serverless architecture inverts this paradigm completely: rather than relying on a dedicated server that is always on, even when nobody is visiting the site, serverless applications rent just the compute power needed to serve up the application purely on an on-demand basis. That is, Parsel rents the spare computational resources of our cloud provider, Amazon Web Services, only when we need them.
A good way of thinking about this paradigm inversion is with the analogy of car sharing clubs like Zipcar. Shockingly, the average car is only in use for 4% of its lifetime! The other 96% of the time, it's parked somewhere just taking up space. That's an incredible inefficiency that car sharing clubs attempt to address by allowing people to simply rent a car for as long as they need it, at any time, and return it without worrying about things like maintenance. In similar fashion, a typical serverless workflow consists of three steps for every task that is executed:
Request compute power available in an instant
Execute the task's logic in a dedicated software container
Release the compute power as soon as the process finishes
This has some distinct benefits over the traditional self-managed server setup. First off, serverless allows for practically instant and massive scalability, since we can request as much compute power as we need to cary out as many tasks as are needed at each instant. For example, if 10,000 documents are uploaded at the exact same time, this would pose a serious challenge to a server that typically only expects a few dozen documents to be uploaded at any given moment. It might even lead to a downtime for the entire application as the server buckles under the strains being placed on it by the massive workload.
With a serverless setup, these documents can be processed in parallel without any slowdown. Also, every process runs in a completely isolated software container, meaning that it's secure and fully shielded from interference by external factors. This approach is one of the big factors why Parsel is flexible from the core; instead of running one large script on a dedicated server, we've designed a event-driven pipeline of serverless modules that has given us the ability to iterate and experiment with a high velocity, continuously. This is where the third adjective comes into play: Supple.
Parsel's architecture is supple and easily adaptable
Parsel's architecture consists of a pipeline of serverless modules that each depend on the output of the previous module. Each module is triggered by an input file, performs a task and produces an output file, which in turn triggers one or more subsequent modules, until the desired end state is reached. This is a graph based infrastructure—more specifically, it's a Directed Acyclic Graph of compute tasks; each vertex is a compute task, each directed edge is the output of a new file and serves as an event that triggers the next vertex.
The first module is started when you upload your document; it applies an Optical Character Recognition (OCR) algorithm to convert your unstructured data (a document or image) into semi-structured data, which is the output of the first module. This output starts the next module; search for patterns in the semi-structured data that indicate the presence of a table and store that table in our proprietary format (fully structured data). This in turn triggers the next modules that transform the output to human readable formats, like Microsoft Excel, CSV, JSON, raw text and annotated images.
Because each module performs a singular task, and is isolated from the other modules, we can rapidly experiment and iterate on new functionality for each module with confidence. We can, for example, add a new output format or introduce a new piece of logic into the table construction algorithm with ease and without impacting any of the other modules. This same principle also allows us to bolt on entirely new features without changing the integrity of existing modules, like sending output to new data warehouses or visualisation tools, analysing document sentiment, generating images for each chart that appears in a document, and many other practical use cases. Parsel, therefore, is built for the future.
Try Parsel for free today!
To get started, sign up today and convert your PDF into structured data with a few clicks.