Since I was little, my mother always told me "you should learn, learn and learn" and "knowledge is power".
Not so long ago, I truly believed that knowing a specific technology, or every detail of the system are the core of professionalism. Over the years I discovered that there are skills which are more important than pure knowledge, most notably the skill of solving problems.
This skill brings one to the next level and differentiates senior developers from junior ones.
Not so long ago, I truly believed that knowing a specific technology, or every detail of the system are the core of professionalism. Over the years I discovered that there are skills which are more important than pure knowledge, most notably the skill of solving problems.
This skill brings one to the next level and differentiates senior developers from junior ones.
The
concept is so powerful because many of its methods do not require any proficiency.
Mastering the skill of problem solving requires experience, but the good news
is that you will encounter plenty of things that won't work for you to solve
along the way.
Such
problems can appear during the development process, and unfortunately even
at production.
So what is the problem? Sometimes we don't strive for a solution:
So what is the problem? Sometimes we don't strive for a solution:
- We give up too quickly
- We get stuck and try the same thing over and over because it’s supposed to work (to quote Einstein: "Insanity is doing the same thing over and over again and expecting different results")
- We tend to blame others or pass responsibility to others. Preferably those who can't defend themselves (like a CI agent or a shelf products), or the usual suspects (like other teams).
- We believe that someone else will help us solve it.
- We look under the streetlight
- We find a reasonable enough explanation and do not bother to prove it.
- voodoo is not a good enough explanation...
- Bug in an external product? That explanation was acceptable by me only if we found a matching open issue in the product's docs.
- We don't put enough effort in reproducing the issue
- We don't know how to approach it ….
So I propose the following problem solving methodology:
- Define the problem
The most important task when approaching a problem, is to define what the problem is. It's common knowledge that "a problem well defined is half solved", and moreover: different problems have different solutions.
This seems trivial, however tricky in practice.
For instance, I get up in the morning, I need to get to work, and my car won't start. What's my problem? Surely, the car that won't start. No! My problem is that I need to get to work. If I try to solve the first, I can get the car to the garage, I can call someone to help me with the car, I can read the manual and try to figure it out myself. But, if my problem is getting to work, then I can take a bus, or ride with a colleague.
Focus on what you need (getting to work), and not what you want (the car to start).
(Though, don't forget to fix your car later on, or you might add other problems… 😊)
- Gather information
Gather as much information as possible. First collect the information and then determine what you can do with it. It's important to do this as soon as the problem is discovered as some of the data may not be available anymore at a later stage.
Such information can be: - Concrete examples that don't work
- Backup Logs, stack trace, core dumps
- Settings: environment variables, command line
invocation, configuration files
- When did it last work? (If such time existed)
- What is the system's flow? The problem probably
hides somewhere in it
- How does it look like when it's working? Compare
it to the not-working state
- What changed from when it last worked??
- Find the root cause
- Reproduce and debug
Reproducing the error enables gathering more information. Either by debugging, or adding additional logging. This will help us understand why something happened the way it did and what went wrong.
If we are lucky, not working things consistently won't work. It's much harder to debug something that doesn't always happen. - Elimination
Another method to pin-point the problem is by comparing a working case to a non-working case. If a functionality used to worked at a certain point in time, and now it doesn't, then there is a point where it transformed from a "working" to a "not-working" state, we only need to locate that point in time. This can be done theoretically or practically by either removing parts of the code or copying parts of the code to a new environment, or by bisecting commits. (Is it only me or this reminds a bit of the intermediate value theorem? 😇). Note that this is basically a brute-force method and it does not require any deep knowledge about how the code works. Therefore this is a very powerful methodology, where it enables you to solve problems with minimal knowledge, just trial and error.
- Play the detective
This method is useful when solving problems that are hard to reproduce in development and all you have is the information you gathered.
Make hypotheses and prove or disprove by the symptoms, just like Sherlock Holmes. Note that even lack of symptoms is a symptom. Just like in Sherlock Holmes' "Silver Blaze" story where a dog not barking revealed that the criminal was someone familiar to the dog. Similarly, in our case, for instance, a missing log line can also tell something about a log that initially looks clean from errors.
In addition, be suspicious. If something doesn't seem right, even if it's not a clear error, it might suggest a problem worth investigating.
Finding the root cause can be time
consuming, so it's important to understand when you've reached a dead end.
If you feel like you're stuck, try something different, switch methods or stop and take a step back to think if the problem has been defined properly.
If you feel like you're stuck, try something different, switch methods or stop and take a step back to think if the problem has been defined properly.
- Find the solution
- Google
Simplify and generalize to summarize a question for google, as general as possible. Try to eliminate the specifics as much as possible.
- Handle the cause
- Focus on what you need, not what you want
For example, you are conducting integration tests with another team and their environment becomes unavailable. You might want them to fix that environment, but actually what you need is an environment, and not that one specifically. - Handle the symptoms
It's important to note that sometimes handling the cause is not the best course of action in the short-term period. That may be the case when the problem is critical and when you can't develop or deploy the solution that quickly, or even haven't found yet the actual cause. Meanwhile, there's a problem waiting to be solved, no matter what the cause is.
In that case you need to find a solution that is good enough for your problem, even though it might not be what you intended at the beginning. Beware of the difference between a working solution to a "seems to work" solution. Also, don't forget to revisit the problem later on and handle the cause. - Communicate
Maybe you are not the only one in your group experiencing these specific issues, and others can help.
I
believe everything is solvable and the world does eventually make sense, which
implies there is a logical explanation behind anything which appears not to work properly,
no matter how unreasonable things might seem to start with. Once we realize
that it is up to us to solve it, and not someone or something else, there is
nothing that can stop us from finding a solution (except maybe unknown bugs in
external products...)
It
doesn't really matter who's fault is it (counterintuitive, perhaps, to human
nature – we want someone else to be blamed instead of working together for the
cause), the only thing that matters is that you have a problem, and you want it
to be solved. So it is your own interest to
get it solved.
This is
what I call professional determination.
Have you encountered any interesting problems recently? How did you solve them?
Have you encountered any interesting problems recently? How did you solve them?