Rethinking legal
service delivery

Our innovation Lab

We’re always exploring new ways to deliver cost-effective legal solutions to a high quality. The Innovation Lab works by developing our own innovations and by supporting external businesses by investing in the them and providing them with strategic advice and business support.

arrow down

Capitalization Crunch: How Capitalization of words can impact tokenization across Large Language Models.

 Introduction

Sometimes the smallest things can have a massive impact on the ability of a Large Language Model (“LLM”) to process and follow instructions. One factor that can have a surprisingly large impact in a wide variety of situations is excessive capitalization.  While frequently not an issue with shorter documents, the longer a document becomes, the more impact capitalization can have on an LLM’s ability to understand and process information. This is due to a key feature of LLMs, called tokenization.

Tokenization Overview

Tokenization is when an LLM or similar process breaks down an input, which is usually a long string or multiple strings, down to multiple smaller components. These components could be a whole word, part of a word, a symbol, or some other small piece of the original string. These individual components are assigned numeric values which are then stored in vectors to be fed into the LLM. The LLM then outputs vectors, which are converted to text and presented to the end user. [https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens]

We can see a visual example of this breakdown through publicly accessible tokenizers, like OpenAI’s tokenizer. [https://platform.openai.com/tokenizer]

[Produced by OpenAI tokenization tool – GPT 3.5 & 4 (https://platform.openai.com/tokenizer)]

As seen above, each word is broken down into a specific token, with spaces being included as part of some of the token. To go even further, we can pull specific token IDs to see how they are being sorted:

[Produced by OpenAI tokenization tool – GPT 3.5 & 4 (https://platform.openai.com/tokenizer)]

We can see that “The” is represented by 791, “ quick” by 4062, and so on. Each one of these unique tokens represents not only the text itself but potentially also information like whitespaces or periods. In this example, all the individual tokens include the prior space before the word as part of the token itself, and the period is represented by having its own token.

Impact of Capitalization

When we send a message to an LLM, such as ChatGPT, Claude, Mistral, or a llama-based model, each individual component of the message is represented by a specific numeric value and will vary from other components. This means that if a value is changed, such as the addition or removal of the word, the total input to the LLM will be impacted, and the result will vary. This doesn’t necessarily mean that there would be dramatic differences, but answers may slightly vary. Granted, most models are non-deterministic, so asking the same question can result in slightly different answers each time. This is generally accepted as a feature of the models, but also a potential source of additional headaches when developing a solution using LLMs.

This impact can also be felt in the total cost and capacity of some use cases. While the cost of one additional token can generally be negligible, over a few thousand pages of information, the computational and financial costs can start to add up. This is likely to be less of an issue as publicly available models get better and cheaper, but not paying attention to how your input message to an LLM is being tokenized can result in suboptimal outcomes.

What makes capitalization have a potential impact is that certain capitalization can be tokenized in completely different ways than a non-capitalized sentence. This is due to how certain models (like OpenAI’s chatGPT) perform the tokenization process itself. Unlike a human reader, a word in all caps is broken up differently than a word in regular capitalization. While the meaning to a human might be identical, the way the LLM interprets the information may be wildly different. The highest impact of this variance can be felt in sentences where all the letters have been capitalized.

Generally speaking, the places we typically find all-caps sentences or phrases are in headers for sections, title pages for slides, and in some cases as an attempt to add emphasis to a critical point in a document. While not an exhaustive list, we can see that having information in all caps tends to be for a specific reason, and we would typically want to preserve that specific reason in any further analysis. This can become complicated when working with LLMs due to how the models interpret sections in all caps in comparison to regular caps.

Normally this would not be a big deal, since the context surrounding the information would also be fed into the model and provide information to the LLM regarding what a section may be referring to. However, in certain use cases (such as summarization of documents, searching within a document, or performing analysis against a large document) the variance in the tokens can add up and can cause issues with the end results provided to the end user. These can range from sections being ignored during a search/retrieval task, to the system producing an erroneous output based on a misunderstanding of what the task is looking for. Notably, this tends to be dependent on the specific information being searched for, as well as what parameters the LLM is calibrated for.

Examples Of Capitalization

For example, consider the two following sentences:

A responsible investment policy should consider environmental, social, and governmental factors (ESG).

A RESPONSIBLE INVESTMENT POLICY SHOULD CONSIDER ENVIRONMENTAL, SOCIAL, AND GOVERNMENTAL FACTORS (ESG).

By just glancing at the sentences, you can tell that they consist of the same words, although the second one may be interpreted as louder or more emphasized due to the capitalization of every letter. However, you do not necessarily interpret the meanings of the words differently, or phonetically read words differently based on the presence of all capital letters. The word responsible means the same thing as RESPONSIBLE, governmental means the same thing as GOVERNMENTAL, and so on.

In contrast, a LLM interprets the sentences as follows:

[Produced by OpenAI tokenization tool – GPT 3.5 & 4 (https://platform.openai.com/tokenizer)]

We can see straight away that there is a difference between the two sentences, just looking at the colors and how things are divided. Each color chunk represents a separate token identified in the string, and the total number of tokens is greater in the all-caps string. For example, we see that responsible now consists of two tokens instead of one, and governmental is now four tokens long. Apart from now being more computationally heavy in terms of tokens being sent to a model, the specific LLM interpretation of the tokens may be different and consist of significantly varying parts. This can quite easily run the risk of not being correlated back to the original meaning of the words present in the lowercase sentence and greatly change interpretation by the LLM away from how a human may interpret the sentences.

This isn’t just a result of one tool being different either. By looking at the tokenization used by a llama-2-based model, we see similar results.

[Sourced from https://github.com/belladoreai/llama-tokenizer-js]

We can see that responsible is one token in lower case, and 7 tokens in upper case, and governmental goes from two tokens to five. While there are differences in how OpenAI and llama-2-based models complete the tokenization process (both are byte-pair encoding models derived from their underlying training data), the impact of all caps towards tokenization on a document can be found in both methods.

This is something that is inherent in the tokenization process, not just a one-off instance. Unless a model is explicitly trained to treat words in all caps the same way as it treats lowercase words, there will always be a variation in tokenization between the two sentences, potentially leading to differing results for the same sentence. At the same time, we wouldn’t necessarily want a model to ignore capitalization since it’s an effective way to contextualize information to both an end reader and an LLM.

What this practically means is that the results derived from an all-caps sentence will continue to vary, and likely not be easily fixable by the model. As models get stronger and stronger, it is possible that the degree of variation will be reduced even further, but there will still always be a risk of aberration due to the differences in how the sentences are tokenized.

Potential Mitigation Strategies

There are a few ways that this issue can be approached to be fixed, but each solution may have drawbacks as well.

For example, adding additional instructions to your prompting can help alleviate the issue. Assuming a prompt designed to search against a larger document, informing the prompt that “answers may appear in all caps, regular capitalization, or a combination of the two” can potentially fix parts of the issue. Alternatively, explicitly using the all-caps version of a sentence present in a document as part of the prompting could align the specific task quite well. However, it’s important to note that rephrasing is not always an exact science, and can potentially result in other issues in responses down the line due to emphasizing different parts of the text. Testing your prompts in multiple scenarios can help minimize this risk, although it may not ever be fully eliminated.

Another solution that may work is improving the quality of the source material that the all-caps string or weird capitalization of strings may be found in. Depending on how sophisticated your process is, something as simple as passing the relevant text through a PROPER() formula in Excel, or a .lower() method in Python. However, these can have an impact in unforeseen ways, such as capitalizing letters after apostrophes in the case of PROPER() or potentially changing context through modifying capitalization of proper nouns for .lower(). In general, cleaning and standardizing your data input is always a positive step for any document processing or analysis, but there will generally be trade-offs depending on the method used.

Even if either of these solutions fails to be applicable in a specific use case, just understanding that this specific quirk of tokenization exists can be very useful. While you may be unable to make adjustments to an existing process or fundamentally make changes to an input, being able to diagnose this as an issue serves as a solid first step.

Long Term Impact?

A key point to consider is that as models get better and faster, the impact of specific tokens may be reduced. There has been consistent discussion (https://hbr.org/2023/06/ai-prompt-engineering-isnt-the-future), on “prompt engineering” eventually going away, and some might argue that the models will get better at inferring general meaning. While this typically does not go into the depths of specifically how words are analyzed based on tokens, there is some truth that as models continue to develop, they will be better geared to handle and provide high-quality output from less-than-perfect input.

This viewpoint may not be entirely agreeable, but it is essential to take into consideration this significant trend. While it’s true that as models get better and better, the level of skill needed to generate an output that is high quality will continue to go down. The reality is tokenization is a core part of how LLMs can function and produce high-level results. Absent a complete reengineering of the LLM process, tokenization, and its many model-dependent quirks are not going away any time soon. While theoretically, you could have a model that ingests capitals and non-capitals as the same information, this would make the model weaker due to the differences in context. As such, being aware that certain sentences or strings of information can result in variation in output is really the best path forward for now.

Conclusion

Tokenization is one of the most fundamental components of any LLM-based tool or program, shaping the way information is fed in and understood by the LLM. While it works practically invisible to the end user, understanding its quirks and limitations allows us to better understand what might influence end output. Capitalization is one change that is generally unimpactful to how a person understands the meaning of a sentence, but incredibly impactful to how an LLM understands it. Being aware of the issues and limitations of excessive capitalization when dealing with any LLM-based tool is valuable for ensuring we maximize the benefits of the tool while minimizing erroneous results.

Zeidler Swift

The Zeidler Swift platform gives customers a new way to receive legal services from a law firm. It provides a feeling of control and oversight, which gives legal and compliance teams peace of mind. Furthermore, legal services are delivered as clearly described products with a defined outcome.

Zeidler Swift consists of various modules. The Fund Governance module provides an overview of the registration and compliance status of each client’s investment funds around the globe. It serves as a database for the relevant fund documents. Finally, the module provides an overview of upcoming filing requirements. The Fund Registration module allows the submission of instructions to register a fund in a particular country directly in the country overview – this replaces instructions by e-mail.

The Fund Formation Questionnaire module replaces unstructured phone calls to gather information from the client about their investment fund project. It collects the data and information which is required to set up an investment fund. The Fund Management Portal module is an adoption of the so-called Kanban methodology, which visualized the work-flow related to the lifecycle of an investment fund.

Global Knowledge Hub

The Global Knowledge Hub is one of our Lab projects. Many of our clients have asked us, if we could develop a web-based system to access our accumulated knowledge from registration and legal work in 45 jurisdictions.

Once completed, the Global Knowledge Hub will contain practical information about the registration and ongoing compliance requirements relevant for the distribution of UCITS and AIFs in 45 jurisdictions. Furthermore, it will include the marketing / pre-marketing rules, including private placements applicable to UCITS and AIFs in the same 45 countries.

At this point in time, it is planned to cover the following countries: 31 EEA countries (EU, Iceland, Liechtenstein and Norway), Switzerland, Andorra, Australia, Chile, Guernsey, Gibraltar, Hong Kong, Japan, Jersey, Peru, Singapore, South Africa, Taiwan and the UAE. Further countries can be added upon request by our clients. In addition to the web-based content, the Global Knowledge Hub will be complemented by a SLA for queries by phone or by e-mail about the rules and procedures in force in the 45 countries.

Legatics

Legatics River is an end-to-end online deal platform focused on client experience and efficiency.

Lawyers add functionality to their matters by adding a ‘module’ to their matter. Each module captures a legal process (such as a CP checklisting or bibling) that has been redesigned in an efficient and client focused manner.

The modules incorporate high levels of automation to cut the amount of time junior lawyers spend on routine administrative tasks.

They also act as central points of information and coordination, updating all parties as to the status of the deal and what they have to do next. Parties perform their role in relation to a process online, whether that be adding a new draft of a document or approving a KYC item.

The result is a large saving of junior lawyer time and clients that can see their deal progress in real-time.

Capitalization and Tokenization in Large Language Models

Our resident AI enthusiast, Alexander Mercer, discusses capitalization challenges in programming Large Language Models (LLMs).

Zeidler Swift

Within our Lab, we are building Zeidler Swift, a Legal Tech platform with modules covering fund registration, international fund governance, and legal fund management. A new way to deliver legal services to clients.

Global Knowledge Hub

Relevant legal information for asset managers and investment funds (UCITS and AIFs) in 45 jurisdictions across the globe.

Legatics

As part of our Lab, we have made an investment in Legatics, a company which has developed Legatics River, a live and efficient deal platform.