As I shared at the beginning of this journey I'm working on my own word tracking app for Mac OS - WriteRise. I am planning to share its development process in the open through a series of articles here and on Twitter.
Writerise can correctly count words across different applications, but I wasn't sure about its precision. Since I don`t have a text to count the words of and I need to calculate everything on the fly I assumed the error to be pretty high since I haven't optimized the counting at all. Let's set the worst case estimate to 35% and the best to 15%. To check I started writing in a single application without stopping while tracking with WriteRise. I would then compare the word count in WriteRise with the word count of the application and see how much it was off.
But there is one main caveat I needed to keep track of:
- WriteRise doesn't support deleting words yet
The text that follows is an edited version of the original text I wrote to track the word count. In the original text, I never used backspace nor new line. You can find the original version of the text at the bottom of the page.
Here's the text:
This writing session is going to be 200 words long. Let's start and see how close I get without looking at the word count.
This is also going to be on a single line to check if WriteRise word miscalculation is only based on line-wraps or if there is actually something wrong.
I noticed that at times WriteRise would have a significantly wrong word count relative to the text application I was using.
I also can't delete anything I'm writing since I'm counting words and I decided not to include the ability to delete words in the MVP (a more in-depth explanation on the topic will come in a different article later on).
I decided in the very beginning to track characters instead of words because it was easier and I needed to make an MVP to see if tracking words across all applications and categorize them was viable. Later on, I changed from counting characters to counting words, which is the only metric we really care about.
The problem comes with how it works. To count words you can't simply parse a file and get the word count, which is what most writing applications do (and which is fairly easy) we need to actually detect keypresses, anonymize them and strip them of all information and only count them as valid whenever we detect a new word.
But how we detect a word you might ask? Simple! We just count every space! _Naïve me answered promptly_
Oh so if we type space space space we typed 3 words! mm Not so easy.
Got it! We count a space followed by a character that's not a space character. Great! So space + (CMD+A) which is select all counts as a word! Again, not really...
The current solution ignores modifier and special characters, pass it through an algorithm to detect a valid character and counts everything that passes as an allowed character.
The problem is that the precise way would be in keeping a buffer of everything we have written in the current application and delete it when we switch applications. Doing like that has huge privacy concerns. I don't want to store your data, i just want to know if you wrote something that is a word or not.
That said I will keep the improved, but still, naïve solution to counting until some better and just as privacy sensible comes to mind.
The objective of the next few weeks is to improve accuracy. Reduce the error rate from (what I expect to be) a 2 digits number to a one digit number. I don't think I can get it less than 10%, but we are going to aim for 9% anyway.
My estimate is that at the moment the error is around 30%, which is terrible, but expected. The algorithm is not terrible, but it's not really great either. There is a lot of room for improvement.
Writerise: 394 words.
According to VSCode: 450 words.
According to Grammarly: 448 words.
According to Hemingway Editor: 444 words.
Note: The word count is based on the original version of the text found at the end of the page, and not the edited version present in this article.
The percentage difference between VSCode and WriteRise is 13.27% which is better than expected. It is NOT good enough though.
It's also interesting to see that they're not counting right either since each editor is displaying a different number of words for the same text. Which is weird since they actually have the text to parse while WriteRise is doing it on the fly.