legacy code: how to approach

[Read this post in polish here]

I always start a project with a review of user needs and solutions used in the system (features). This helps me determine current user needs and how they are met through the application. This process helps divide the application into parts and determines which parts are used and to what extent. I am particularly interested in elements of the system that are not used. Functionalities that have been forgotten, abandoned and which can be safely removed. This way, I avoid investing time and energy in analyzing, trying to understand, modernizing, and writing tests for unnecessary code.

I base the review on three sources of information: documentation, people, and data. However, in practice, it often boils down to two sources, since documentation is merely a written record of information and conclusions drawn from the other two sources. Nevertheless, it is usually fragmented and outdated.

I try to obtain answers to questions gathered in the table below:

questions	people	data
Application
What is its purpose?	They will say why it exists	We won't learn it from data
What problems does it have?	They will say why they have problems with the application	It will point out technical issues of the app
Features
How can they be divided?	They will say how and why	It will show who, what, and how often is used
Which ones are the most important?	They will say why they are important	It will show which ones are the most used
Which ones are used?	They will say why they use them	It will show which ones are used
Which ones are not used?	They will say why they don't use them	It will show which ones are not used
Who needs this feature?	They will say for whom the feature is	It will show who uses the feature
Users
Who are they and how can they be grouped?	They will say who they are and how to group them and why	It will show who uses what
What are their needs?	They will say what their needs are	We won't learn it from data
What problems do they have?	They will say about quality problems	It will show quantity problems
Features?	They will say how they use features	It will show who uses which feature
How many 'inactive users' are there?	They will say why they are inactive	It will show who is inactive

When performing such a review, I create documentation in the org format, which is similar to markdown. A text format is the only truly universal format that ensures everyone in the future will be able to easily edit the documentation.

1. sources of information

1.1. documented information - documentation

Information collected and available from sources such as:

Office documents, e.g., Microsoft 365, Word, and SharePoint
Enterprise wikis
Project management systems, e.g., Jira
Communication platforms, e.g., Teams, Slack

The variety of available options, which are often used simultaneously, leads to further dispersion of information, making it difficult to find and requiring significant effort to maintain, causing the information to become outdated.

Typically, I find a one-time overall project specification document from the beginning of the project, which has been saved in a format that prevents further modification e.g. PDF.

These sources have historical value and are interesting to examine. They allow for understanding the origin of the project.

1.2. information not written down - people

The most important source of information is data obtained from conversations with people who possess so-called 'indigenous knowledge' passed down orally. The result of information gathering is the appropriate selection and writing of this information.

1.2.1. user classification

Depending on the application, users can be divided and grouped based on shared characteristics and properties. Start with the division into internal (i.e. corporate) users and external users (i.e. the company's clients). Then, apply further segmentation based on groups and roles. Check if the segmentation is reflected in the application.

1.2.2. information gathering methods

Obtaining information from a given representative of a group or role can be divided into formal and informal categories:

informal

Conversations with people in person or remotely.
formal

Long, planned interviews in person or remotely, with the session recorded. I'm just helping to prepare a longer written statement for one of the application's users, for example, a user of a company's CRM. Requesting the user to demonstrate how they use the application and its functionalities, identify any issues, and suggest solutions.

Recording audio-video conversations is invaluable for later reviewing and processing the entire conversation. After processing, I present him with the compiled ideas, opinions, and conclusions, requesting review, corrections, addition of any overlooked details, and consent for publication. After obtaining consent, I delete the video material, and include the interview in the documentation as an official user statement for the application. This serves as the basis for further actions in application design.

1.3. encoded information - data

Data generated during code usage (e.g., user activity data) arises from regular application operation (e.g., a user submits an order) or results from purposefully enforced data creation (e.g., logging) to verify if specific code is being invoked.

This is the only objective source of truth that provides answers to questions like: what is being used, how often, and by whom.

1.3.1. code whose execution generates data

Data is typically stored in a database. So we can create database query, considering the following questions:

how many rows are there in a given table?

Depending on the age of the application, if it's new, there might be very few rows in each table. Therefore, the more critical question becomes:
when were the last rows added to a given table?

Probably there is some information specifying the data creation date, such as column created_at (timestamp) or you can create such a column. Depending on the creation time of the data, we need to infer whether, if the most recently created data in the table is very old, it is worth examining this functionality and considering why it is not being used and if we can remove it.
1. query example: SQL documentation
  Using a text file org you can write an SQL query in a text editor and execute it on the production database, placing the result directly in the documentation. Later, this can be exported to any format, such as HTML, which is exactly how this post was created in HTML format.
  
  When performing database analysis, I record my thought process: I save queries, data obtained, and conclusions. For clarity and documentation purposes, this constitutes documentation for others and myself.
  
  For a simple example, I am including the database queries below - the exact query I executed and its results.
```
SELECT
    table_name,
    table_rows
FROM
    information_schema.tables
WHERE
    table_schema = 'fighterchamp'
  AND table_type = 'BASE TABLE'
ORDER BY
    table_name;
```
  table_name table_rows
  
  user 1000
  
  tournament 10
  
  info 0
  
  In summary, I would write something like this: Based on the query from [2024-06-03 Mon] executed on the production database, I conclude that the 'info' table does not contain any data. This serves as a basis for revising the code in this area and evaluating its usefulness for the user – dead code, probably nobody uses it.

table_name	table_rows
user	1000
tournament	10
info	0

1.3.2. code whose execution does not generate data

In this case, we have to use logger or php extension to create some data when the given code is invoked. Generating and especially saving data, especially with a high number of requests, can slow down the application's performance and be noticeable for users. Consequently, this type of 'observation' is conducted for a set period, aiming to collect a representative sample of data. Here, two data collection scopes can be defined:

overall
1. based on HTTP requests - logger
  
  Logging most of the HTTP requests. Depending on required information, this typically includes the path, controller and action names, user and role. You can store results in a file, database or to put it on the queue, it depends on the user load. I prefer to store results in database because it's easy to implement and results can be query by SQL.
2. based on PHP
  
  There's a special PHP extension that provides a way to track if a piece of code is actually invoked - https://github.com/krakjoe/tombs.
selective

When, due to performance considerations, it's inadvisable to enable all logging or when only a few suspicious areas exist, logging can be added in specific locations. This can verify if users visit a specific page or invoke a given action, such as a product purchase.

This approach has earned its own term: 'tombstone'. A logger is responsible for logging the location it was called from and the time it was invoked. This information is logged to a special file. Dates are added to each log entry for convenience, such as 'Log::('2024-05-26')'. The resulting log entry would appear as '2024-05-26 UserController::showAction()'. A short (5-minute) presentation on this topic is available in David Schnepper's Ignite talk, "Isn't That Code Dead?".

2. next steps

After gathering the necessary information, I attempt to remove as much dead code as possible. In discussions with the business, I use more precise language, favoring the term 'archiving' because this is essentially what we do with version control systems. Nothing is ever deleted permanently; instead, it is removed from the 'current version' of the software. The purpose of this is to not waste time or resources on 'maintenance', such as reviewing, reading, and modernizing unused code.

Following this cleanup, I proceed with establishing the first tests. As a quick method of ensuring a minimal security net, linking up smoke tests is recommended. There is a package I've used and recommended for years: https://github.com/shopsys/http-smoke-testing.

Gathering user activity data also helps determine which system elements are most critical, thereby deciding the sequence for further work.