That time I learned about the dilemma of Abstraction in Software Design ::

I want to tell you about the time I was with working with Venezuelan Taxpayer identification numbers, how I dealt with them, and how they made me think about software design and the balance between the SOLID principles and the YAGNI/KISS mindset about solutions.

Let’s start by observing a simple problem through the lens of my younger self as he was just starting to design a system.

The context⌗

A not so long ago, in a galaxy not so far away I was working on a system for a company here in Venezuela. The system needed an ID for customers, since it is required for billing in Venezuela. National and resident foreigners use their personal Identity Card or Cédula de Identidad (CI) and any other entity is identified by their local Taxpayer ID known as RIF number or simply RIF. Persons have RIFs that can be used as identification on invoices, but they only tend to be used for expenses related to personal enterprises like freelancers.

The identification system is simple: CIs and RIFs are formed by a tuple of a letter and a number. The letter indicates the type of CI of RIF. CIs uses the letters V, meaning venezolano (venezuelan), and E, meaning extranjero (foreigner). RIFs extends this set with J (jurídico) and G (governmental) to denote RIFs of private and public companies respectively. Regarding numbers, CIs use 8 digits or less, representing an increasing count of identified individuals, and RIFs are always 9 digits, composed of 8 digits and 1 verification digit. A person’s RIF is computed from their CI by padding the number with zeros up to 8 digits and calculating the verification digit.

Here are some examples of both type of national IDs:

V-1: the famous first CI, assigned to president Isaias Medina Angarita.
J-12345678-9: a random juridical RIF
V-00000001-0: a personal RIF.

First implementation⌗

In summary, the requirements were:

Validating CIs and RIFs in all the common formats
Storing them in an uniform format
Being able to distinguished between CIs, personal RIFs, and other RIFs

Note that I only needed to check that a string looked like a valid CI or RIF. Validating its existence was done manually, since there was no official API for querying it.

Consider the following common formats for both types of national IDs:

V1234567, a CI
E1234567, another CI
V-1234567, another CI
V-12345678-9, a RIF because it has 9 digits
J-01234567-8, another RIF, so the digits are padded with 0s
G123456789, another RIF

There are only letters, numbers, and one symbol (-) in every case. Both types of IDs were similar enough that I tried to devise a single algorithm for both CI and RIF - or simply national IDs. I produced the following rules:

It must be the same length after striping non-alphanumeric characters other than -
It must be 10 characters or less
Its first character must be is in the array ['V', 'E', 'J', 'G'] (case insensitive)
From the second character onwards (inclusive) the string must be made of numbers
If the string is a RIF, the number part must be 9 digits long

The last step requires to know the type of national ID of the input, but CIs only have 8 numeral characters or less, so this step seemed to be as trivial as counting the digits in the number. Also, to store a national ID in the database, the simplest format is [letter][number] (V1234 or J000012345) without any padding because it requires the least amount of characters and these parts are already available after validating the national ID. No additional computation was needed.

Under the premise of simplicity, I wrote all the logic to validate and normalize national ids in a single NationalIdHelper class, spread across multiple static methods. The logic ingested strings and produced either strings or an exception, identified strings as CIs or RIFs, and denoted the type of CI or RIF, like personal, foreign, etc. So, I wrote all the tests cases I thought I needed, tested the helper class and shipped it.

Commit. ~Force~ Push. Go home.

An unforeseen problem⌗

The code I wrote Just Worked^TM^ for a few months in a dozen different places without bugs. Then, a requirement came. The system had to parse and digest bank reports that contained wire transfers from customers, with the purpose of automating certain processes. These files contained a field with the national ID of the person doing the transfer. The field looked like these:

AAAA V0012345678
BBBB J0001234567

It had a prefix with information relating to the transfer, the letter of the ID, some padding (the extra 0s), and finally the number of the ID. Also, I knew that in this file personal RIFs were not possible due to the way accounts are registered at the bank.

At first glance, the string V0012345678 contains a valid national ID, but it is longer than 10 characters, breaking the rules I wrote. After some study, a deeper realization hit me. The format is ambiguous, even after removing the extra padding. Consider the string V0000123456: Is it the personal RIF V-00012345-6 (corresponding to CI V12345) or the CI V-123456? Remember, the algorithm parsed both CIs and RIFs, only discovering which one it was parsing during the process. If I did not know that the first was not possible in the bank reports, there is no way of telling them apart from that string. Under the current model, I needed to give precedence to one type, making it impossible to parse the other. Also, this problem planted another question: What if this is only the first of many to come?

“Let me prevent this type of problem again," I thought. I started by writing a NationalId value object to hold the IDs, their type, whether it belong to a person or a company, etc. Then, I rewrote the validation in a more abstract way: a NationalIdParserInterface and implementations for different formats and types (a StandardNationalIdParser, an AcmeBankNationalIdParser, etc), making the assumption that a string was a valid ID if the parser understood it. Lastly, I deprecated the helper class and began to slowly replace it with the new parser and value object.

This problem is solved and will stay that way in the future… Right?

The Art of Abstracting the Right Bits⌗

In hindsight, my first solution was not solid. It coupled validation, normalization, and formatting in a single class. I had to write the interface and a new value object, introduced dependencies to already working code, wire those services to resolve specific implementations through the DI container, and redesign services that now needed to be resolve through DI to get the new dependency. The first version was just too simple, and the cost of change was too high.

On the other hand, for that first version I kept it simple stupid (KISS) and made sure to not include anything just because I might need it (YAGNI). Even if I think the refactor is cleaner and less coupled, I never had to implement another NationalIdParser. You could argue that applying the interface segregation principle was too much abstraction and that splitting the responsibilities of NationalIdHelper into a parser and the value object was a better alternative. Then, I just had to transformed the national IDs in the bank report parser to an unambiguous format before validating it. The second version might be unnecessarily complex, and the time invested on its refactor did not paid of.

I remember this refactoring fondly because it does not have a right answer. It was the first time I pondered when, what and how to abstract functionality. I asked myself, “Was the first solution too simple?", “Will the abstraction paid of?", and, “Is the second solution too much extra code and complexity?" Questions that are still in mind as I develop solutions right now. It was the first step in realizing that there are rarely universal answers in software design. Most of the time, there are only advantages and trade offs.

Keep coding!