Further, number are opinionated! Someone gathered the numbers for some particular purpose, and the number reflects the purpose for which they were gathered. This is something that the Machine Learning people are continuously rediscovering.
Oh, isn't that the truth. Passed through an ML algorithm, at the other end will pop out -- the original methodology used to gather the numbers. Nothing sinister here -- it's just the way the numbers were collected and organized. Always check where your data came from and why it's organized that way (the "metadata," or data about the data). In fact, the exact data often doesn't matter -- what really matters is the metadata. I spend at least as much time thinking about where the data came from, how it was gathered, potential ways it could have been corrupted or misinterpreted, and how it's organized and labeled as I do about the numbers themselves.
And as they say in physics and chemistry, check your units, and that applies in social sciences as well. I've seen reputable economists compare money supply ($) with GDP ($/year) without explaining the crucial missing variable, money *velocity* -- GDP = money supply ($) times money velocity (1/time). The velocity is the reciprocal of roughly how long it takes one dollar to circulate around the dollar universe.
A perfect example. Everyone forgets that for a homegrown script or spreadsheet like this, the *default* is that it's giving bad output – what are the odds that someone wrote it perfectly on the first try? – and the only hope of getting past that is to carefully hand-check some results.
Great post! I come across errors like this all the time (the wheat story, for example) and usually end up sounding like a cranky old man when I point them out.
`Keep a sharp eye out for weasel words like “known cases” or “the death toll may increase as bodies are discovered”; these can conceal huge understatements.`
This reminds me of the fight I had with my friend during the initial stages of covid-19 ( Yes, we spent our time fighting about things which we don't have control over. ). In the initial stages of Covid, as we all know the number of resolved cases was a tiny fraction of the total number of cases ( active + resolved ). One fine day, the guy mentioned that the mortality rate for our country was very low ( India got too many cases in a few weeks ). I couldn't help but notice that mortality rate was calculated as total number of deaths / total cases, and I kept saying to him that the number is meaningless if the total number of cases are not comparable to the total number of resolved cases. People who caught covid didnt die immediately. Those who died, died some days later, and when the numbers are increasing exponentially, this difference was crucial. I saw that after the numbers settled, people were kinda alarmed that the number was increasing again from the low.
I still cant understand how WHO was using that metric, when the pandemic was exploding day by day.
Thanks! I just stumbled onto that book the day after I put this post up... going onto my reading list as well.
Further, number are opinionated! Someone gathered the numbers for some particular purpose, and the number reflects the purpose for which they were gathered. This is something that the Machine Learning people are continuously rediscovering.
Oh, isn't that the truth. Passed through an ML algorithm, at the other end will pop out -- the original methodology used to gather the numbers. Nothing sinister here -- it's just the way the numbers were collected and organized. Always check where your data came from and why it's organized that way (the "metadata," or data about the data). In fact, the exact data often doesn't matter -- what really matters is the metadata. I spend at least as much time thinking about where the data came from, how it was gathered, potential ways it could have been corrupted or misinterpreted, and how it's organized and labeled as I do about the numbers themselves.
And as they say in physics and chemistry, check your units, and that applies in social sciences as well. I've seen reputable economists compare money supply ($) with GDP ($/year) without explaining the crucial missing variable, money *velocity* -- GDP = money supply ($) times money velocity (1/time). The velocity is the reciprocal of roughly how long it takes one dollar to circulate around the dollar universe.
I’m familiar with mistakes in math causing problems in projects. I wrote about one unfortunate experience I had that caused a lot of headaches for me and others on my team. https://davidgelphman.wordpress.com/2013/05/02/awk-that-cant-be-right/
A perfect example. Everyone forgets that for a homegrown script or spreadsheet like this, the *default* is that it's giving bad output – what are the odds that someone wrote it perfectly on the first try? – and the only hope of getting past that is to carefully hand-check some results.
Great post! I come across errors like this all the time (the wheat story, for example) and usually end up sounding like a cranky old man when I point them out.
Another example of mixing up Metric and English measurements almost caused a commercial airliner to crash...
https://en.wikipedia.org/wiki/Gimli_Glider
`Keep a sharp eye out for weasel words like “known cases” or “the death toll may increase as bodies are discovered”; these can conceal huge understatements.`
This reminds me of the fight I had with my friend during the initial stages of covid-19 ( Yes, we spent our time fighting about things which we don't have control over. ). In the initial stages of Covid, as we all know the number of resolved cases was a tiny fraction of the total number of cases ( active + resolved ). One fine day, the guy mentioned that the mortality rate for our country was very low ( India got too many cases in a few weeks ). I couldn't help but notice that mortality rate was calculated as total number of deaths / total cases, and I kept saying to him that the number is meaningless if the total number of cases are not comparable to the total number of resolved cases. People who caught covid didnt die immediately. Those who died, died some days later, and when the numbers are increasing exponentially, this difference was crucial. I saw that after the numbers settled, people were kinda alarmed that the number was increasing again from the low.
I still cant understand how WHO was using that metric, when the pandemic was exploding day by day.
Yup, great example.
Explain to me how my stats show, for example, that my post was liked 2 times, when there are 6 likes?