What is AI: How (not) to use natural language processing tools

5 Minute read August 20, 2024

Large Language model text and code under glass

This article is the third in a two-part series exploring questions about AI. Read Part 1 here and Part 2 here.

Think of interacting with a natural language processing (NLP) tool like a translator; it can understand and reproduce language, but nuance and deeper meaning might get lost in translation.

There’s no magic wand-waving here. So, be clear, be patient, and consider a few rules when interacting with these tools.

In November 2022, OpenAI released ChatGPT, its first NLP tool, for free to the masses. ChatGPT is an interface people can use to interact with the large language model (LLM) GPT-3.

Today, we’ve got our pick of NLP tools — Enthroptic’s Claude, Google’s Gemini, Meta’s Llama, and xAI’s Grok to name a few. Each tool has different advantages — and not all tools are equally suited for every task or organization.

Regardless of your choice, here are some general guidelines:

Rule 1: Context matters

The first thing is to understand how these tools work under the hood. When using natural language processing (NLP) in their most basic form, there’s a chat interface — you enter a prompt and receive a response.

When you send a prompt, there may be additional context automatically included. Then, all of that context is sent to an LLM or model behind the scenes where each token’s value (a set of characters represented as a unique id) and position is statistically analyzed to generate a response.

The important takeaway here is that these tools generate responses entirely based on the context you provide. If you provide poor context, you can expect poor responses. And these tools can be seeded with a ton of context before any of the processing is done by the model.

GPT-4, for example, can be seeded with a total of 8,192 tokens — about 25 pages of text. With all of that potential context, the variations and opportunities are extensive.

What if you’re a regional director for an urgent care network and want to evaluate 1,000 recent patient feedback forms? If we had all of the responses organized and categorized in a spreadsheet, you could easily save that as a CSV and send it to one of these NLP tools to instantly understand what the 1,000 patents had to say.

You could ask questions about the patients as a whole or narrow in on specific patient groups by providing more context. NLP tools make this type of sentiment review incredibly efficient compared to the time it would take a human to do the same work.

But should you do this? The answer is almost certainly no.

The CSV contains personal health information (PHI) and personally identifiable information (PII). And in the United States, urgent care facilities and their employees are required to follow HIPAA laws and other privacy, security, compliance, and regulatory standards. Sending this type of data to an unauthorized third party would result in a massive breach.

There are other types of information that you may want to refrain from using inside the context you provide to natural language processing (NLP) tools such as:

PHI and PII
Proprietary intellectual property. This is your competitive edge or that of your company.
Your own personal information. Any personal details you provide can be stored on servers and potentially accessed by unauthorized individuals.
Information or property protected by a nondisclosure agreement (NDA).

Rule 3: Control your content

When you create prompts consider how to anonymize your data. Develop a process to clean up/remove sensitive information before sending it to a tool. And only provide enough context for the specific problem or topic you’re working on.

There’s no need to send your entire secret algorithm, just ask for help on a specific concept within it.

Rule 4: Understand your security options

There are natural language processing (NLP) tools that offer a degree of protection and control over what happens with your prompt once you send it to the model for processing.

Free versions of these tools give you very little control over how prompts will be used in future training of the model — essentially making your prompts that company’s property. In contrast, paid business and enterprise features often adhere to limited or no access policies for the data you send into the system. They promise to not use any of your data unless you permit them to.

At this point, you have to trust those companies to guard your data the same as any other. The terms and conditions of these paid features on NLP tools change often and have to be reviewed often to ensure they meet your use-case’s compliance and regulatory needs. But as of now, to maintain compliance for our urgent care use case, you would have to take it one step further, unplugging the model from the public network altogether.

Rule 5: Natural language processing tool implementation: build vs. buy

Gartner has a great way of explaining this difference in natural language processing (NLP) tool implementation on a build vs buy spectrum. When you’re buying or subscribing to NLP tools you get the benefits of a highly tuned system that’s ready to use out of the box. All of the hard work has been done for a small monthly/yearly subscription fee. But you’re limited to what those tools offer you.

When building these tools yourself, you can keep all the data to yourself and control where and how the data gets presented. And it’s entirely possible to do so, but you would have to deploy and manage the entire system on your own. And that requires large hardware purchases, new security risks, and substantial hiring and maintenance needs.

Takeaway

There are ways to work with these natural language processing (NLP) tools on the buy-side of the spectrum in a sensitive manner. It’s always good to be cautious, if not skeptical, in any situation.

Read the terms of service carefully and make sure the terms of protection offered by any tool make sense and fit your organization and use case.

Determining how (not) to use NLP tools requires a thoughtful approach. Be clear about your goals, patient in your experimentation, and strategic in your approach. Choosing the most appropriate solution requires carefully evaluating the data privacy and compliance considerations, ethical implications, and context before sending your data into the unknown.