LIFTI v7: Eliminating Allocations with CharacterBuffer

Share on: bluesky linkedin copy

LIFTI is an open source full text index library for .NET - you can check it out on GitHub

Performance has always been a core focus for LIFTI. For v7, I looked at the tokenization pipeline and found an opportunity to significantly reduce memory allocations. The solution? Replacing StringBuilder with a custom CharacterBuffer backed by ArrayPool<char>.

The problem

During tokenization, LIFTI builds up character sequences from the input text. Previously, this used StringBuilder - a perfectly reasonable choice, but one that allocates a new string every time you call ToString(). When indexing 100,000 tokens with 50% duplicates, that’s roughly 50,000 unnecessary string allocations for tokens we’ve already seen.

The tokenization flow looked like this:

 1// OLD approach (simplified)
 2var tokenBuffer = new StringBuilder(256);
 3
 4foreach (var character in input)
 5{
 6    if (IsSplitCharacter(character))
 7    {
 8        if (tokenBuffer.Length > 0)
 9        {
10            stemmer?.Stem(tokenBuffer);  // Modifies the StringBuilder
11            var tokenText = tokenBuffer.ToString();  // ALLOCATION
12            processedTokens.MergeOrAdd(tokenText, location);
13            tokenBuffer.Clear();
14        }
15    }
16    else
17    {
18        tokenBuffer.Append(preprocessedCharacter);
19    }
20}

Every call to tokenBuffer.ToString() creates a new string object, even if we’ve already indexed that exact token.

Enter CharacterBuffer

CharacterBuffer is a ref struct that wraps an array rented from ArrayPool<char>.Shared. It provides a similar API to StringBuilder but works with ReadOnlyMemory<char> instead of allocating strings:

 1public ref struct CharacterBuffer
 2{
 3    private char[] buffer;
 4    private int length;
 5
 6    public CharacterBuffer(int initialCapacity)
 7    {
 8        this.buffer = ArrayPool<char>.Shared.Rent(initialCapacity);
 9        this.length = 0;
10    }
11
12    public void Append(char c)
13    {
14        if (this.length >= this.buffer.Length)
15        {
16            this.Grow();
17        }
18
19        this.buffer[this.length++] = c;
20    }
21
22    public readonly ReadOnlyMemory<char> AsMemory()
23    {
24        return this.buffer.AsMemory(0, this.length);
25    }
26
27    public void Dispose()
28    {
29        if (this.buffer is not null)
30        {
31            ArrayPool<char>.Shared.Return(this.buffer);
32            this.buffer = null!;
33        }
34    }
35}

The key method is AsMemory() - it returns a ReadOnlyMemory<char> view over the pooled buffer without allocating a string. This memory can be used as a dictionary key via a custom comparer.

Zero-allocation deduplication

The TokenStore class is responsible for merging duplicate tokens. It now uses ReadOnlyMemory<char> as the dictionary key:

 1internal class TokenStore
 2{
 3    private readonly Dictionary<ReadOnlyMemory<char>, Token> materializedTokens = 
 4        new(ReadOnlyMemoryCharComparer.Instance);
 5
 6    public Token MergeOrAdd(ReadOnlyMemory<char> tokenText, TokenLocation location)
 7    {
 8        if (this.materializedTokens.TryGetValue(tokenText, out var token))
 9        {
10            // Duplicate - just add the location
11            token.AddLocation(location);
12        }
13        else
14        {
15            // New token - materialize the string once
16            var text = tokenText.ToString();
17            token = new Token(text, location);
18            this.materializedTokens.Add(text.AsMemory(), token);
19        }
20
21        return token;
22    }
23}

The clever bit: when we encounter a duplicate token, TryGetValue succeeds using the memory-based comparison, and we never call ToString(). We only materialize the string when creating a new Token object.

The custom comparer ensures memory-based equality works correctly:

 1internal class ReadOnlyMemoryCharComparer : IEqualityComparer<ReadOnlyMemory<char>>
 2{
 3    public static ReadOnlyMemoryCharComparer Instance { get; } = new();
 4
 5    public bool Equals(ReadOnlyMemory<char> x, ReadOnlyMemory<char> y)
 6    {
 7        return x.Span.SequenceEqual(y.Span);
 8    }
 9
10    public int GetHashCode(ReadOnlyMemory<char> obj)
11    {
12        return string.GetHashCode(obj.Span);
13    }
14}

The new tokenization flow

With CharacterBuffer, the tokenization code becomes:

 1var processedTokens = new TokenStore();
 2var tokenIndex = 0;
 3var tokenBuffer = new CharacterBuffer(256);  // Rent from pool
 4
 5try
 6{
 7    foreach (var character in input)
 8    {
 9        if (IsSplitCharacter(character))
10        {
11            if (tokenBuffer.Length > 0)
12            {
13                stemmer?.Stem(ref tokenBuffer);  // Modifies buffer in-place
14                processedTokens.MergeOrAdd(
15                    tokenBuffer.AsMemory(),      // NO allocation
16                    new TokenLocation(tokenIndex, start, length));
17                tokenIndex++;
18                tokenBuffer.Clear();
19            }
20        }
21        else
22        {
23            tokenBuffer.Append(preprocessedCharacter);
24        }
25    }
26
27    return processedTokens.ToList();
28}
29finally
30{
31    tokenBuffer.Dispose();  // Return to pool
32}

The try/finally ensures the buffer is always returned to the pool, even if an exception occurs.

Impact on stemmers

The change to CharacterBuffer required updating the IStemmer interface:

1// OLD
2void Stem(StringBuilder builder)
3
4// NEW  
5void Stem(ref CharacterBuffer buffer)

The PorterStemmer implementation was updated to work with CharacterBuffer using helper extension methods:

 1internal static class CharacterBufferExtensions
 2{
 3    public static bool EndsWith(this ref CharacterBuffer buffer, string suffix)
 4    {
 5        return buffer.EndsWith(suffix.AsSpan());
 6    }
 7
 8    public static void ReplaceEnd(
 9        this ref CharacterBuffer buffer, 
10        string oldEnding, 
11        string newEnding)
12    {
13        if (!buffer.EndsWith(oldEnding.AsSpan()))
14        {
15            return;
16        }
17
18        buffer.Length -= oldEnding.Length;
19        buffer.Append(newEnding.AsSpan());
20    }
21}

These extensions keep the stemmer code readable while working efficiently with the character buffer.

The results

For a benchmark indexing 100,000 tokens with 50% duplicates:

  • ~50,000 fewer string allocations (one saved per duplicate token)
  • Reduced GC pressure
  • Improved throughput

The pooled arrays are reused across tokenization operations, so the memory overhead is minimal - just the size of the largest token buffer needed.

Making CharacterBuffer a ref struct ensures it’s stack-allocated and can’t escape to the heap. The tradeoff is that ref structs can’t be used in async methods or stored in fields, but for tokenization (a purely synchronous operation), this is perfect.