Intelligent Document Processing Compliance, from Stone Tablets to Digital Docs

Getting started with Platform Engineering security and compliance
March 18, 2024
EU AI Act Secrets Revealed
May 15, 2024

In this episode, we dive into the world of intelligent document processing (IDP) with Jonathan Grandperrin, CEO and co-founder of Mindee. As businesses increasingly rely on digital documentation, understanding and complying with data protection regulations like GDPR becomes paramount.

Jonathan shares his insights on the evolution of document processing – from ancient inscriptions on stones to today’s sophisticated AI and computer vision technologies that interpret and manage digital data. He explains how IDP technology not only helps organizations streamline operations but also ensures compliance with data security laws.

Key takeaways:

  1. Automate Document Processing: Implement intelligent document processing (IDP) technologies to automate the extraction and management of sensitive information, ensuring both efficiency and compliance with regulations like GDPR.
  2. Strengthen Security Protocols: Enhance data protection by integrating strong security measures such as encryption and compliance checks. Make sure your document processing aligns with global standards such as SOC 2 in the US and GDPR in Europe.
  3. Embrace IDP for Regulatory Compliance: Use IDP technologies to maintain compliance by accurately managing and deleting personally identifiable information (PII) as required by law, which reduces the risk of data breaches and non-compliance penalties.
  4. Understand the Content of Documents: Focus on technologies that not only extract data but also help understand the information within documents to ensure proper data handling and compliance with regulations.
  5. Prepare for Evolving Regulations: Stay informed and adaptable to changes in data protection laws, such as the upcoming AI acts in Europe, by choosing document processing solutions that can quickly adapt to new regulatory environments.

Read the transcript

Bart Farrell 0:00
From writing on rock tablets thousands of years ago to modern databases powered by great technologies like AI, humans have liked to keep stuff written down in documents. What happens when those documents need to be read with technologies like intelligent document processing, which reads through documents to extract high-quality information? How can we ensure these technologies comply with data protection regulations like GDPR, with the upcoming AI acts soon in Europe? These issues make us wonder if the regulatory aspects of the industry can keep up with the fast pace of technologies available to end-users for achieving their business goals. That’s why Silvana and I, in the Data Defenders Forum, got a chance to speak to Jonathan Grandperrin. Jonathan is the CEO and co-founder of Mindee, a company originally formed in France, now in different countries, highly specialized in areas of computer vision and deep learning, all of which revolve around the nitty-gritty details found in documents. Let’s hear what Jonathan had to say. We’re sure you’ll gain a lot from this. Hi, everybody, welcome to the Data Defenders Forum, this is episode number seven. We are very excited to be looking at an area we haven’t explored much before—computer vision on the AI side. We’re joined today by Jonathan Grandperrin, CEO and co-founder of Mindee. And with that said, Jonathan, we just want to start out by getting to know you a little better. Can you tell us about who you are, your background, and how you found yourself working in this area?

Jonathan Grandperrin 1:37
Yeah, sure. Hi folks, I’m Jonathan Grandperrin. I have a French engineering background, mainly in computer vision and data science. I was CTO in France before launching Mindee. I started working on Mindee in 2018. I wanted to apply computer vision and deep learning with the latest breakthroughs at that time to the business world and document processing. Document understanding was definitely the best way to go for us.

Bart Farrell 2:14
Now, if you had to choose one historical figure that you would trust with your personal data, or in your specific case, your documents, who would it be and why?

Jonathan Grandperrin 2:26
Good question. I think it would be either maybe John Nash or Alan Turing. I would go for Alan Turing, actually; my documents would be pretty safe.

Bart Farrell 2:36
Any particular reason why him over John Nash?

Jonathan Grandperrin 2:40
I don’t know them, obviously, but not very well. I mean, I feel like he has done more breakthroughs in cryptography in general. So, I don’t know, just a gut feeling.

Bart Farrell 2:53
It’s good to trust your instincts. Before we jump into security and compliance, for people out there who may not be familiar with document processing as it is today, but also the evolution of how we got to the point we’re at now, can you walk us through that historically, so we can get a bit of context?

Jonathan Grandperrin 3:15
Yeah, that’s a topic. So first, before talking about processing documents, maybe we should define and talk about documents themselves. Today, there are many types of documents. But at the very beginning, I think the first one that was retrieved is kind of a couple of thousand years old; it was just a piece of rock where people were sharing information. That’s why documents are basically a human API, a way to have a record that holds information that is understandable by someone else. That’s what a document is—information in something. So as paper came in, at first there was the era where everything was printed, and written first, then with printing, we had printed texts on papers. Then we had the digital era maybe, where we were able to scan documents or to create digital-native documents. And now most of this information has to be processed or stored in databases and in the cloud. So it came from, I think, rock to databases. And when it comes to processing those documents, I think there were two, maybe three different phases. The first one, processing would just mean being able to store, to just not analyze, but just to scan them to have the record in a digital format. So I think it was between, I don’t know, maybe the ’60s to the ’80s and it’s still used today. Processing documents also means being able to edit or extract information from those documents. So with the cloud, with Dropbox or Google Drive, or the signing companies like DocuSign, etc., there are a lot of possibilities in the document processing space. And more recently, thanks to GPT, thanks to computer

vision, and thanks to OCR technologies that were created, we are also able to extract text from non-digital documents like papers or scans. So it kind of evolved from paper to the digitization of documents to being able to edit and create documents digitally, then OCR to be able to transform text from photos of images into machine-readable text. And now it’s more about understanding completely documents as a human can do. And I think that’s where we are today, trying to transform this human layout, human API, the documents, into structured data so that it can be used in software, it can be used automatically in workflows.

Sylvain Kalache 6:13
Great, thanks, Jonathan. And so yeah, this amount of information that’s being processed is huge. And today, we have more and more regulation coming up. So can you give us a few examples of how IDP technology can help companies, not just tech companies, but any company in any industry, stay in compliance with regulation, local regulations that they have to comply with?

Jonathan Grandperrin 6:48
Yep, I would say that the first one is having access to the text and to the information inside documents is very important for compliance. If you are handling documents that contain PII, for example, you need to be aware of that, and you need to make sure that you are able to retrieve it. Now that you have PII, if you need to delete them, for example. So making sure you can stay compliant with the policies you have with your customers, with the third parties you are using. Being able to extract this information to make sure that you know where it is, it’s very important, without any document processing or document data extraction solution, it’s just a PDF that is kind of here, but you don’t know what’s inside. And that can be complicated. For some cases, I would then say that’s the most important part of how document processing and document understanding can help in compliance.

Sylvain Kalache 7:54
So it’s really about knowing, like, you have data, but knowing what this data is about. So it’s not so much as the extraction of the data, but really the understanding of the data. Okay. Is there any other example that you have that comes to your mind where IDP can help companies?

Jonathan Grandperrin 8:20
Oh, it’s okay. So it’s, I think it’s around this idea of being able to know what’s inside this data, but it may be used for different use cases, in compliance. Data Availability, the right to access and erasure in the GDPR, making sure that you know where this data is because otherwise, you’re not able to erase it. So I think there are many different use cases, depending on the industries, the geographies, but basically, everything is around making this data accessible to people so that they can apply compliance fully. Even on those documents, a lot of you know, deep GDPR really, I would say kick-started in a big way. I mean, data compliance law has been brewing for a while now. But the GDPR was really the inflection point. But it’s also coming to the US. Like last week, a bipartisan bill was proposed for APRA, which would be the equivalent of the GDPR. So such drastic laws are also coming to the US. And so it’s not just today that of course, US companies will operate in Europe and have to comply, but this is coming to the US; it’s going to become an increasingly important topic. As I was preparing this episode. I also found quite an interesting point; I was reading the Verizon data breach investigation that they’ve been running, I think every year since 2008, so it’s quite interesting to look at trends. And it turns out that misdelivery error also when information is sent to the wrong recipient, is a huge issue for some industries is actually among the top sources of data leaks, as I mentioned, the art and entertainment industry, financial and insurance industry, and the public administration industry. And so this has huge implications. And obviously having you know, an OCR not relying on you, mine, just doing typos, can also avoid a lot of lot of headaches. How about so we spoke about compliance, compliance and security are tightly linked, but is there any specificity around security that you can share with us?

Jonathan Grandperrin 10:57
Basically, when you choose a document processing technology, error is going to be something that runs locally. And in that case, you just need to make sure that nothing goes out. And you’re good if all the security layers that you have in your software or in your environment are already there. If you are using a third party, where you send your documents to, and they are processed using an API, for example, or whatever, but in the cloud, it’s

kind of the same type of security checks that you need to do as any other service that you would use where you send personal information or crucial information. As I was saying before, the commands are just the medium, but it’s information that’s inside, that’s just the same thing, you can have a web form on, on the website, collect information from your users, it’s going to be transferred to your server. If you use this information to send that to another service, it’s the same as sending a document with personal information. That’s what I mean. So when choosing a solution, I will make sure that of course, it’s encrypted, the TLS is used. So there are security layers that are used in the protocol for transferring the documents and the files. That’s the minimum level that you can, of course, ask for this kind of, of service. And otherwise, depending on the geography, in France, I would try to make sure that GDPR compliance is top of mind of the provider, that’s very important. In the US, I would go with a SOC 2 Type II compliance for making sure that they have all the processes to handle the data well. But otherwise, I don’t think there are much more to think about than just those two kinds of checks, the same you will do when choosing a payment solution or something that just processed personal information.

Bart Farrell 12:58
With that in mind, you know, a lot of times, you know, the solution is developed by a company, but then based on the feedback that they’re getting from end users, they might pivot or be thinking about new features. In your case, could you give us an example of a customer that had compliance as a key topic on their list of criteria? And what was it that they were looking for?

Jonathan Grandperrin 13:17
Yeah, obviously, we have a lot. PayFit is a good example. I think the staff in France, they are operating in France and in Europe. So they had these GDPR compliance requirements when choosing a solution. So they kind of audited us trying to make sure that we had the processes and we were able to process the data the way we were saying we would process them. So they asked us a lot of questions regarding how can we delete the data, once it’s processed to make sure that we don’t store anything? They were kind of asking all the questions regarding making sure that we have all of that in mind. And we don’t store data if we don’t have to. And that we described very well, the way we would use the data if we had to store some for training purposes or this kind of usage. So yeah, it was one of the main criteria for deciding what provider to choose for them. So performances of the solution, of course, but most importantly, also, as importantly, data compliance and security compliance. You

Bart Farrell 14:27
know, one of the things we were talking about previously with GDPR. Is that someone explained to us recently about how, you know, nowadays, every company wants to be a data-driven company. You know, it’s really hot and exciting to say that, but you know, what does that really mean? And what are the implications from a security from a compliance perspective, in the case of what this person was bringing up is how a lot of companies may not even be aware of the kind of data that they’re collecting and based on that the responsibilities that will be related to that. In the documents space. Anything can be built on a document so we can be talking about healthcare records, we can be talking about financial records, we’re going to be talking about things that are very, very sensitive and can be used against somebody in terms of invading their privacy. What is your advice for people that are out there when it comes to this in terms of how we’re addressing companies more broadly, you know, that, like I said, that every company now, it doesn’t matter if it’s, you know, famously, someone said, I, you know, 10 years ago, like Coca Cola nowadays is a tech company, but it seems like now, you know, 2024, every company is, is having some kind of data that they’re having access to? How is that something that you’re addressing? And Mindee? Do you get questions about this kind of stuff from your customers, depends

Jonathan Grandperrin 15:35
a lot in the industry and the use cases and the type of customers you’re talking about, of course, first, because there are different levels of certification and compliance that you need to fit in. If you’re talking about healthcare, it’s very different than if you’re talking about Coca Cola, and I don’t know, all the data they have for the providers of other logistics, it depends a lot, I would say that today, the

document processing space and the IDP is, is, is very trendy since a couple of years. And it’s very easy to find solutions for end users, companies that provide you with a kind of tailored service for you, for exactly your use cases or your industry. So benchmarking solutions to make sure that you have a clear use case in mind is very important. Otherwise, you can get completely overwhelmed by everything you could do, or you could not do on the documents, there is not a one size fits all, document solution that will make sure you stay compliant, you know, all the data you have in all your documentation and all your infrastructure. I think that’s, that’s not possible today. But what’s possible, though, is to have very specific use cases, they don’t if if it’s for, I don’t know, your operations, operations efficiency, making sure that you can process the documents well, and nothing gets out of the of the of your environment, your request system, and you know exactly what you are processing can be in the pure understanding if you have batches of data with a lot of data from your customers, document management system that can help you retrieve classified documents and make sure that it’s stored in a more intelligent way than in the past drive with, with 1 million documents inside. So that’s why topic without any use case, clearly in mind about I’m not able to tell you what will be out. But what I can say is that today, there are a lot of use case focused solution for a lot of different industries, a lot of different use cases in general that can help you achieve exactly what you want. So, benchmarking, making sure you define your problem well, and you will find a solution in the IDP space. That’s

Bart Farrell 17:58
helpful. And I think, you know, it’s interesting, because, you know, GDPR came out, you know, we’re in 2024. So we’re talking about over half a decade now of GDPR. Being, you know, on our on our minds and being on the radar, the next thing that we have coming up from a European perspective, you know, is the AI Act, and but whether it’s GDPR, or your data protection or AI, the constant challenge of can governments and regulatory agencies keep up with the speed at which technology is moving? You know, being a company that’s, you know, originally from France, but also working the United States having a global focus? I imagine that you’re getting some questions, or at least we’re seeing in the news, you know, is it hype? Is it fear? Or what are the concerns people should be having? In terms of the conversations that you’re having? What What kind of what kind of questions are coming up frequently about AI, and where we can expect this to be going in from a regulatory perspective about concrete instances in which, you know, technology, such as deep learning, maybe impacted by regulations? Is there anything I and maybe it’s a little bit early, but I just want to know, like, what, what are the conversations that you’re having around that topic at this point? No,

Jonathan Grandperrin 19:03
no? Good question. I think the general fear around AI is the misconception of an AI is a system that learns from every input. It has to process when, when you’re running in production when you’re doing inferences. So I think the general fear is around, I don’t want but when you are going using GPT, I don’t want you be able to retrieve data that I have personally inputted somewhere in GPT in Facebook and whatever. This kind of data understanding and sharing between peers. I think that’s the real, the real fear. If we stay in the business b2b world, you sign it, it’s been a it’s been a decade already. And in the US, they also have different loads of different processes and certifications. So if you don’t if you make sure that your data is kind of isolated, and there is no concerns about AI It just, it’s a service that you use as you could use the web service for doing something else, you’re not sharing the data with anyone else that’s returned something somewhere, either because you are stuck to comply and GDPR, whatever. That’s okay. But I would say that the main fear that maybe I’ve had some signals on, even in the US or in France is, is my data going to be used for training a solution that’s going to benefit or most like, worst case scenario, you are going to share without even knowing even knowing it data to someone else, because your AI is going to infer something that is kind of trained on my document and extract this and just output it somewhere else. That will be the main concern. So as soon as you have a solution, or your build your solution, making sure that it

‘s like, by default, completely private, isolated from the rest of the of your customer base of the order your users, I think you’re good. But that’s the that’s what I think right now, we don’t have that many questions in this space, because what we sell is web service solution that is really is isolated in our infrastructure. So we make sure that we let our customers and prospects know that everything is safe when it comes to data protection with us. And nothing will be shared. So we don’t have this kind of conversation most of the time.

Sylvain Kalache 21:32
Good. Yeah, like sharing with the audience. It’s not just about your customer being in compliance, but it’s also you Mindee, as an entity to to have the audit process and technology in mind to make sure that, by extension, your customer remain compliant. So yes, communication is key. In this case, I want to jump a little bit, just a little bit out of the compliance to pique. Our urgency is made of Sunday, our product on software engineers, and obviously, we are all here to do business. So I just wanted to see if you had seen a customer, you know, gaining financially, by using an IDP solution in a way that, you know, might not have been expected. Do you have a story around this?

Jonathan Grandperrin 22:30
Hmm. defined? not expected? They didn’t know that we’re going to save money using it.

Sylvain Kalache 22:36
Exactly. Yeah, I think yeah, exactly. Right. I think that’s abuse. In a few of us thing, I think I think automation and reducing manual labour is in the no beef abuse one. So I just wanted to know if they were like, an edge case that they were they were not expecting or that you’ve never seen before that came up and you are like hands? That’s that’s interesting. Oh, no,

Jonathan Grandperrin 23:00
you say that I think document processing and document automation is related to operational efficiency. And that’s the ROI you are expecting. So most of the time, I think you get it. You just win human time for processing the workflows, because documents are involved. I don’t have anything striking in mind that,

Sylvain Kalache 23:25
yeah, maybe I can share what I had in mind. I think at the beginning of the episode, something very interesting, you said is like, it’s not so much about the acquisition of the data, but it’s about the processing, right. And as you said, like the ability I mean, our ability, the technology ability to to process more and more efficiently and accurately, is giving you access to data that you didn’t have before. And as Bob said, our economy’s data driven. So the more data you have, therefore, the more potential gain and insight in your market and customer base you could gain. So you know, I think maybe you don’t know also, because that’s kind of like a very niche question your customer may or may not share with you. But I My gut feeling is that some of the customer who started to use Mindee, were like, Oh, should we have this data we we didn’t ever know. And now we can extract value out of it. Yeah,

Jonathan Grandperrin 24:18
most of the time, they have it in mind. But obviously, documents, you can think about fraud detection, for example, like in the expense management space, making sure you don’t reimburse fraud receipts or fraud expense reports. So that’s a pure loss for the end users in terms of toward the end customers in terms of of cash. So you have many different use cases like that. What I meant was, all of them are about saving money in the end, like operational efficiency, fraud detection, or even. Yeah, just being able to access this data so that you don’t have any regulatory or compliance issue you with your customers that can go further than that, like a lawsuit or something. So. Okay,

Sylvain Kalache 25:10
before we we start the episode Mindee just announced a new product docTI is called yet and can you tell us a little bit about it? Yeah,

Jonathan Grandperrin 25:24
sure. So docTI AI for document tailored intelligence. So basically, it’s a solution that helps people design very efficiently, very easily document processing solution for their own documents, whatever it is. So you can in just a few minutes, you can define what you want to retrieve. Like what data points you want to extract from your documents, and deploy a web service that do that very accurately. So think about any type of document, I don’t know birth certificates, ID documents, whatever

. So in our platform, we have many pre built models that we have worked on since a couple of years. And now we have opened this kind of studio to anyone, because we had a lot of demands in many different use cases that we were not able to, to put in our roadmap because we had maybe a you know, 1000s of different types of documents that people were asking us for. So we thought it will be easier to provide them with the studio and with the capability of deploying that themselves instead of building everything ourselves and putting that in our catalogue.

Sylvain Kalache 26:30
And I saw the Do you believe this is like really kind of starting a new kind of category in the industry of like, training, less IDP? I think here the training is part is really the key, right?

Jonathan Grandperrin 26:45
Yeah, yeah, you were using a bunch of different technologies. LLM are involved in that, obviously, and, and that makes sure that you don’t have to prepare to annotate the data to train everything yourself, which was the way to go a couple of years ago. So we expect to see and we already seeing like, since last month, a lot of usage. I think the number is almost 1k API’s were created for different types of document in one month. So that’s pretty good for us. And, and yeah, we are definitely looking forward to seeing what people are building with it. Sometimes we had to put your cards or Pokemon cards, for example, so that we had crazy stuff. Very funny.

Sylvain Kalache 27:31
Did you get among the you know, you mentioned the history of document and you mentioned that there were stones document docTI analysed stone yet.

Jonathan Grandperrin 27:41
I should try that out too. I will do that right after the episode.

Sylvain Kalache 27:47
All right, amazing. Jonathan, thanks a lot for being on the Data Defenders Forum. We will make sure to stay up to date with your blog to see if you find any stone documents. Thank you for listening in, and see you soon. Bye bye. Cheers.