Enumerating words from long streams of data

I started this repo on GitHub where I’ll be keeping small programming projects.

These projects are mostly about solving interesting problems rather than trying to make something useful. It helps me to hone my programming chops while I don’t have any big pet project running.

So, this time I wanted to be able to enumerate over individual words or sentences in texts of arbitrary length.

With C# you can use regular expressions to enumerate over parts of a text, but this requires loading the thing to memory in its entirety.

To overcome this limitation, I thought it would be nice to load data from stream chunk-by-chunk, process chunks asynchronously and gradually return items through IAsyncEnumerable. This interface came in C# 8 and it allows us to start enumerating before all the source data is yet available with an ‘async foreach‘ statement.

So far I’ve only implemented WordEnumerator, and already I encountered a couple of interesting problems.

Finding it hard to find the right words

Firstly, how do I recognize legitimate words in a potentially messy text input?

Take for instance: “His name was ‘d$re@mss0zz’“. In this example I want to detect only ‘His’, ‘name’ and ‘was’ and ignore ‘d$re@mss0zz’ completely. But how about: “He said – ‘Well, I don’t know(he really didn’t know)’“? It is messy writing, but it is legit text nonetheless and I would like to capture all the words present.

I figured the most useful results come from looking for groups of letters where they are surrounded by either control characters, white spaces or punctuation marks. The only problem is that in Unicode punctuation category is very broad and contains a lot of very weird characters. Therefore I limit this set to only common punctuations in the English language, which would include:

  • the full stop (.)
  • the question mark(?)
  • double quotes (” “)
  • the apostrophe (‘)
  • the comma (,)
  • the hyphen (-)
  • underscore (_)
  • exclamation (!)
  • colon (:)
  • semicolon (;)
  • parentheses ()
  • brackets []
  • the slash (/)

Long word

The longest published word, according to Wikipedia, apparently has almost 2000 characters. I’m therefore not sure if it is worth to try and filter out letter strings that are too long to make sense.

One problem with ignoring overly lengthy letter strings is that they can potentially overflow the memory. I leave to sort it out in later iteration.

Chop, chop, chop

Another problem is caused by reading fixed-size chunks from stream – those chunks will split words in most cases. I just keep track of last and first character from previous and current chunk. If both are ‘legal’ characters, then I glue the words.

There will be gibberish

And then there is issue of what string of letters represents some actual word or not. This problem I do not even mean to solve. I’d have to connect it to some lexical service that’s constantly updated with valid words for all the languages in the world.

And even then, how would I handle multilingual input?

The first cut

After all in the first iteration, I ended up with something like the code below. Check out the repo for full version.

public class WordEnumerator : ITextEnumerator
{
    private const string WORDS_EXPRESSION = @"[a-zA-Z]+"; 
    private const int BUFFER_SIZE = 1024;

    public IEnumerable<string> Enumerate(string text)
    {
        Regex regex = new Regex(WORDS_EXPRESSION);
        var matches = regex.Matches(text);
        foreach (Match match in matches)
        {
            char? before = (match.Index > 0) ? text[match.Index-1] : (char?)null;
            char? after = (match.Index + match.Value.Length < text.Length) ? text[match.Index + match.Value.Length] : (char?) null;
            if (IsWord(before, match.Value, after)) yield return match.Value;
        }
    }

    public async IAsyncEnumerable<string> EnumerateAsync(Stream stream, Encoding encoding)
    {
        using (StreamReader reader = new StreamReader(stream))
        {
            // some words may be cut by fixed buffer size reads
            // we detect those cases and glue them
            char lastChar = ' ';
            string lastWord = null;
            bool glueWords = false;
            while (!reader.EndOfStream)
            {
                char[] buffer = new char[BUFFER_SIZE];
                var readCount = await reader.ReadBlockAsync(buffer, 0, BUFFER_SIZE);
                if (readCount == 0) continue;
                char firstChar = buffer[0];
                // if last word of previous stream was not yet returned, we can now check if it should be glued or can it be returned now
                if (!char.IsLetter(firstChar) && lastWord != null) 
                {
                    yield return lastWord;
                }
                glueWords = char.IsLetter(firstChar) && char.IsLetter(lastChar);
                lastChar = buffer[readCount-1];
                var wordsFromChunk = Enumerate(new string(buffer));
                // cannot run foreach and yield return. This is because the last word might need to be glued with first characters from next block.
                var wordsList = wordsFromChunk.ToList();
                for (var i = 0; i < wordsList.Count; i++)
                {
                    var word = wordsList[i];
                    string gluedWord = null;
                    if (i == 0 && glueWords)
                    {
                        gluedWord = lastWord + word;
                    }
                    lastWord = word;
                    if (i < wordsList.Count - 1 || reader.EndOfStream || !char.IsLetter(lastChar)) yield return gluedWord ?? word;
                }
            }
        }
    }

    public bool IsWord(char? before, string input, char? after)
    {
        if (before != null && !IsWordSeparator(before.Value)) return false;
        if (after != null && !IsWordSeparator(after.Value)) return false;
        foreach (char c in input) {
            if (!char.IsLetter(c)) return false;
        }
        return true;
    }

    public bool IsWordSeparator(char c) 
    {
        return char.IsSeparator(c) || IsCommonPunctuation(c) || char.IsControl(c);
    }

    public bool IsCommonPunctuation(char c) 
    {
        var allowedPunctuations = ".!?,;:(){}[]-_'\"/";
        return allowedPunctuations.Contains(c);
    }
}

Leave a Reply

Your email address will not be published. Required fields are marked *