Bring Your Own Data into OpenAI GPT Apps


What is Bring Your Own Data into OpenAI?

Imagine you’re running a chatbot like ChatGPT. It can answer general questions, but you want to ask for details about something specific to you. Maybe you’re selling a product, if you can put the manual right at the top of ChatGPT’s reading list, that would be incredibly helpful in getting better responses. Bring your own data into OpenAI is exactly that, you can upload a series of files (including txt, html, docx, and pdf) into Azure OpenAI to customize and improve your chatbot.

A robot reading a document

This feature is brand new to Azure OpenAI and was announced on June 19, 2023. Microsoft offers OpenAI tools like ChatGPT and GPT4 through their Microsoft Azure platform. We recommended Microsoft Azure in the past as a great alternative to OpenAI that offers the exact same service but is more enterprise friendly in terms of security, privacy, and compliance.

How do I set up Azure OpenAI?

The big caveat is that you’ll need to sign up for Microsoft Azure. Unlike regular OpenAI, the Azure OpenAI Service is limited access only for approved enterprises and partners only. Generally speaking, you need to be a business that is also a Microsoft customer to even apply to use Azure OpenAI. We wrote about the application process here.

When you get access, you can create a “deployment” of OpenAI in Microsoft Azure. It seems a bit odd that you don’t just get an API key, but the deployment means you get your OWN OpenAI instance. It is unique to you. This gets rid of a lot of security and privacy concerns. More importantly, when you finally tie files into the deployment, it is available broadly through any method you access your deployment. You can use the playground (a simple ready-to-use chatbot interface prepared by Microsoft), or you can use the API keys to create your own application. In all cases, you’ll have access to your files.

How does Bring Your Own Data into OpenAI work?

When it comes to uploading your files, you’re going to need an Azure Blob Storage and an Azure Cognitive Search instance. The Blob Storage is basically a Dropbox or OneDrive solution where you can upload files right into the storage. You can upload anything, but OpenAI can only access txt, md, html, doc, docx, ppt, pptx and pdf files. The Cognitive Search is actually a rather costly module that will run about $100/month and helps index and organize your files. Neither are particularly difficult to set up, just find the icon in the Azure Portal and setup, most of it is just confirming the pricing for you.

One of the most useful options is that you can limit the response strictly to your data content. Your chatbot will still be able to banter and speak like a regular person, but the factual statements will all come out of your uploaded files.

An option to limit responses to your data content.

Testing

In my own tests, I uploaded north of 2,000 Word documents. It took maybe all of 10 minutes for the upload to complete and for Azure Cognitive Search to process the documents. The chatbot was immediately able to pick up new files. And it was lighting fast at accessing documents throughout the entire library. Please also note that I was only able to use GPT3.5 rather than the new GPT4 for these tests.

For most of my tests, I kept the toggle to “limit responses to my data” on. Without the toggle on, the chatbot might reach into its own knowledge base to answer questions. It worked fine, but from a legal tech perspective, I was fascinated by the idea that I can control and get some certainty over the responses. It worked a charm in only providing answers that can be found in the documents I uploaded. With this toggle on, it refuses to answer questions it does not know about.

The chatbot refuses to answer a question about the 2022 Olympics.

When it comes to responding to questions, I both liked and loathed that the response usually referenced or cited one of my documents. On the positive side, I can go back and look at the source and feel some comfort in the answer. On the negative side, the built-in playground chatbot was a bit lazy at times, basically saying your answer is in this document, see the citation. If you are building your own web app, you can put in a system message that can mitigate this effect.

My Impressions

I asked a series of questions about facts buried deep in my documents and it seemed to pick them out very well 9 times out of 10. For the last 1 out of 10 times, it was wildly off and I saw zero connection between the response and my question. Other users were facing this issue and Microsoft has responded that it is a bug. Since this is the first few days of a preview, I’m going to take them at their word and say factually, the AI was very good at picking out answers.

The chatbot was also fairly intuitive in synthesizing answers from multiple documents. I can ask it for an opinion or general theme and it seemed rather accurate. You should not leave legal analysis to it, but the general impression or summary was accurate.

The chatbot was also really good at fetching documents. I can use natural language and casually ask for documents relating to a few subjects and get great responses. I found that the chatbot was hesitant to draft documents based on the precedents that I uploaded.

Some Ideas for Legal Tech

We are in the early days of Bring Your Own Data. Nevertheless, this will be really useful for law firms and legal departments even today.

For example, I connected a precedent library to my storage and was able to get some helpful drafting tips and templates. It was like having an assistant on hand.

A chatbot transcript that shows a chatbot responding to factual questions and fetching files using bring your own data into OpenAI.

This idea can easily be extended. For in-house teams, you can upload all your internal documents into the system and have the chatbot locate documents for you and draft based on historical patterns. In short, it can be a document management tool, a knowledge base, and an assistant.

I think there will be opportunities in eDiscovery as well. Azure didn’t even blink with 2,000 files, I suspect it can scale to a much larger dataset. I have also seen others try and derive data analytics from files uploaded with mixed success.