Back to basics
This blog is about learning data science, and today I will share my progress after a long absence.
The story of my long absence began when I received the results of an A/B experiment at work and wanted to determine if the difference was statistically significant.
During my studies, I took courses that included statistical tests. So I started to refresh my knowledge. I realized I could perform the test, calculate the p-value, and reject the null hypothesis, but I felt overwhelmed by the feeling that I had no idea what I was doing. The number of “why” questions was devastating.
Why this test and not another?
Why am I modeling data with probability theory at all when reality is more complex? At what point am I “paying” for simplifying real data into probability theory?
Why is the p-value usually taken as 0.05? Why sometimes 0.005, etc.?
What does the p-value tell me at all? And what does it not tell me? Why are interpretations on the internet contradictory?
Why does the p-value take into account events more extreme than those observed, if I am not interested in them?
and more popping up along the way.
How was I supposed to make any business decision with such a limited understanding of the decision-making tool?
The number of these questions made me suspect that these statistical tests are worthless. When I read various materials available on the internet on this subject, I had the impression that everyone was parroting what I had learned in college, rather than asking WHY questions.
I had an ambitious dream of finding answers to all my questions and describing them in a comprehensive blog article.
I was tempted by the bold thesis that tests are worthless. I started reading and searching. I learned more about different types of tests, assumptions, CLT, and the history of statistical tests. Unfortunately, I was constantly struck by the impression that the sources I was reading were often contradictory. Some repeated interpretative errors that others criticized. I still couldn't find the depth and certainty I was looking for. Something didn't add up; it didn't fit together coherently, eluding me. Nevertheless, I gathered a lot of fragments related to the components of statistical tests, from which I tried to assemble a coherent whole that was easy to understand, despite its extensive nature.
The main thesis was that at no point do we “pay” for simplifying real data to probability theory.
When I use probability in machine learning, I can check how well such modeling works on a test set. In the case of statistical tests, I have to accept the result uncritically.
Additionally, I identified several issues that further undermine credibility. The potential for arbitrary selection of the level of statistical significance. An overwhelming amount of overinterpretation of p-values and vast confusion on the internet regarding this issue. I found “The ASA’s Statement on p-Values” by the American Statistical Association, from which I concluded that it is practically impossible to move from the formal definition of p-values to any useful conclusions about the phenomenon under study.
Privately, I was waiting for my second child's birthday while caring for my first baby. I read about statistical tests on the sidewalk. I had my phone in one hand and pushed the stroller with the other. Time was at a premium, so progress was slow. After the birth, I went into survival mode, when only work and family mattered. The space for reflection disappeared. The article dragged on for several months.
The children became a little more independent, and I got back into the game. In search of answers to my questions, I went back to basics. While reading about CLT, I turned to more serious sources and realized I didn't even know what a random variable was. During my studies and when using ML after graduation, it was enough for me to equate a random variable with a probability distribution. It turns out that this is not the case… My confidence in my own knowledge collapsed for a moment. ChatGPT cheered me up, saying that it's not that I don't know anything, but that I'm just going deeper.
In the meantime, I asked numerous questions on Cross Validated. Surrounded by educated, serious people, I felt very incompetent. But I believed that reading on my own was not enough. I have to bounce questions and thoughts off people who have vastly more experience.
I found it challenging to understand some of the answers because they employed concepts I had never encountered before. When I read that to understand the definition of a random variable, I needed to know the concepts of sample space, measurable space, and σ-algebra, I was taken aback. I don't remember ever seeing these concepts on any slides during my master's studies. They focused more on covering descriptive statistics for EDA, then quickly jumped into classifiers, regression, and neural networks. The concept of a random variable emerged in the context of Bayesian Neural Networks (BNNs) or Variational Autoencoders (VARs), but treating it as a probability distribution was sufficient for my purposes.
Concerned, I asked ChatGPT what I need to read to fill this gap. I selected three books whose tables of contents encouraged me.
The Art of Statistics (Spiegelhalter)
Mathematical Statistics and Data Analysis (Rice)
Statistical Inference (Casella & Berger)
I will certainly not get through all of them (especially since my third child is on the way and another period of “time out” is coming), but I want to fill in the fundamental gaps.
And well, we'll see what happens next :)