One way to reduce my reading pile. A belt sander.
By:
Eric De Grasse
Chief Technology Officer
4 July 2020 (Paris, France) – In a backhanded way I suppose we are celebrating an “Independence Day”. Our fearless founder and chief faded into the sunset this weekend, somewhat retired but actually devoting time to his video and film work, plus his book. This weekend we finished consolidating all of operations into our Paris HQ so all of our companies are now under one roof. And I am finally attacking my “SHOULD BE READ” stack of trees. The piece below was tagged for last year’s reading but … well, you know how it is. It’s by a long-time business contact who is a freelance Python developer / data scientist who prefers to be anonymous to protect his business interests.
Wonder why smart software is often and quite spectacularly stupid? You can get a partial answer in “On Moving from Statistics to Machine Learning, the Final Stage of Grief.”
It’s a long piece and there’s some “mathiness” in the write up. But as the author notes:
this post is geared toward people who are excellent at statistics but don’t really “get” machine learning and want to understand the gist of it in about 15 minutes of reading. If you have a traditional academic stats backgrounds (be it econometrics, biostatistics, psychometrics, etc.), there are two good reasons to learn more about data science.
The author, who tries to stand up to heteroskedastic errors, offers some useful explanations and good descriptions of the short cuts some of the zippy machine learning systems take.
A few passages I found interesting:
As you can imagine, machine learning doesn’t let you side-step the dirty work of specifying your data and models (a.k.a. “feature engineering,” according to data scientists), but it makes it a lot easier to just run things without thinking too hard about how to set it up. In statistics, bad results can be wrong, and being right for bad reasons isn’t acceptable. In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works.
And:
I like showing ridge regression as an example of machine learning because it’s very similar to OLS, but is totally and unabashedly modified for predictive purposes, instead of inferential purposes.
One problem is that those individuals who most need to understand why smart software is stupid are likely to struggle to understand this quite helpful explanation.
Math understanding is the problem. That lack of “mathiness” ingrained in most people is why “smart software” is likely to remain like a very large swamp for so many who need to use it.
BONUS BY THE SAME AUTHOR
“Coding is Not Computer Science”
Coding is computer science in the same way that buying something at the store is economics, or talking to your neighbor is sociology.
Buying a widget at the store is governed by dynamics described by economics. We can use economics to answer questions like “why was the widget priced the way it is?” or “why does this store stock widgets in the first place?” But it’s a stretch to say that participation in the economy is doing economics.
Similarly, when you input code into your computer, the way that your compiler or interpreter takes the code you wrote and does stuff with it is computer science. Sometimes your code’s numeric variables are stored as floating points, and sometimes those floating points that ostensibly should be equivalent aren’t actually equivalent; the reason why is explained by computer science. But that doesn’t mean coding is computer science.
More and more, the computer sciencey parts of coding are things we can ignore because they’ve been abstracted away from us in modern programming languages. I can write a whole Python program that never directly:
- allocates memory to variables
- assigns data types
- worries about integer overflow or underflow
- implements whatever a “quick sort” is
- implements a backwards for-loop using
i--;
- defines any network stuff when doing simple HTTP requests
There was definitely a time when all this wasn’t true: in ye olde days, computers were slower, optimizing for performance mattered, programming languages were less developed and more statically typed, StackOverflow wasn’t around to solve all our problems. And there are definitely still times you want someone with real computer science knowledge to write your organization’s code, i.e. when the computer sciencey stuff really matters.
One reason we’ve abstracted a lot of this stuff away is because it’s tedious and obfuscates the “business logic” of code; not even computer science majors enjoy diagnosing segfaults when they could be writing business logic. But once our code is all business logic, where does the computer science come in, exactly? If the code you’ve ended up with reads less like a set of instructions for how to allocate memory and move around a bunch of 0’s and 1’s, and more like an instruction manual for a financial model or a biological model, it’s hard to say in earnest that the person maintaining it should have a computer science degree instead of a finance or biology degree.
There are two important observations that arise due to the abstraction of code away from computer science.
The first observation is that it’s odd how tech firms still prioritize computer science degrees or quantitative graduate degrees for any job that involves writing code. Presumably computer science degrees are a heuristic for whether you’ve written code before, but this hiring criterion still holds even for candidates who can demonstrate their ability to write a modicum of code. I’ve written before about how computer science degrees, by themselves, don’t prepare you to write code in a real world setting. Despite all this, I’ve gotten rejected from jobs that would involve effectively doing econometrics (I have an economics degree) apparently because I didn’t have what they considered to be proper experience (i.e. working at a tech firm or having a computer science degree).
I’m sure tech employers don’t think of this way, but this is an egregious double standard. The implication here is either that only computer scientists can be trusted to write code, or that every field other than computer science can be learned on the job, whether it’s biology, sociology, behavioral science, economics, finance, and so on. But this attitude is weird if coding isn’t computer science.
This is not a one-way street, and the blame doesn’t lie solely on tech firms and VC’s. This leads to the second observation, which is: If you don’t need to understand how computers work to write code, it’s unclear why so many college departments are still reluctant to implement coding more deeply into their curricula, starting in the first undergraduate year. In most undergraduate social science programs, coding is confined to a single applied statistics course in R, and in a class you take more than halfway through your degree.
I’m not saying that the point of a liberal arts education should be to prepare you for the corporate world. Coding has a lot of value as a purely academic pursuit: it facilitates the discovery and accumulation of knowledge for any field with hypotheses and theories that can be tested with data. More academic departments should make coding a larger part of their curricula– not because social science students should be learning a lot of computer science, but precisely because students don’t need to learn much computer science to code. Because coding is not computer science.
Happy Fourth of July to our American readers