The BAD of CSCI E-29: Advanced Python for Data Science, Harvard Extension School Fall 2020. A Review.
TLDR; I want to get off Scott’s wild ride!
Summary of the GOOD.
Software carpentry good
Get taught a mature approach to design
Disciplined environment/package management
Focus on continuous integration
Use of git, conventional commits, and facilitating collaboration
Amazing peer review from other students
Generous marking of assignments
Modern, popular libraries used: pandas, numpy, Django, Luigi, pytorch
Summary of the BAD
Lectures that are too long and do not implement active learning
Code on lecture slides
Deliberate avoidance of Jupyter notebooks
Vague instructions in the beginning regarding environment and setup (I used Windows)
Unclear instructions within lectures and project descriptions
Projects that rely on previous projects to complete.
Very little option for collaboration despite stated goals of course
Pipenv!! Locking failed. Windows errors
Stacking deadlines inconsistent with professional life
All in.
In November of 2020, I was lying in bed at 2300 trying to get off to sleep. I heard a banging sound outside and asked my girlfriend what it was.
“I don’t know”. She said.
Fucking kids! I whispered, doing my ‘old-man face’. The banging continued.
Right that’s it!
The unlikely arena.
I jumped up out of bed and went to my balcony, ready to do my angry Clint Eastwood impression. When I looked over the balcony, I saw a man being beaten with a metal pipe in my driveway.
Oh dear,
I looked up at the windows around me, and could see several neighbours just like me with mobiles to their ears.
Oh good, someone’s calling the police.
Suddenly a cry came up from the man “Help me! Help me!” then a gurgling sound. I saw that the assailant had pinned him by the throat and was pushing down on the pipe, cutting off his airway.
Shit. Shit. Shit. Airway. A. Manage the airway. Manage the airway.
I guessed the time it would take me to get down the stairs. 3 stories. 30 seconds. You can live 30 seconds without air. I ran inside and punched both legs into a pair of chinos I had on the ground.
Someone’s being attacked and... they're asking for help so...
Her eyes went wide. “Should I come?” she asked.
Yes...come. And bring a phone
I blew out the door and down the stairs, hoping that I wouldn’t have to fight someone.
When I burst out of the lobby, I saw a car speeding away down the street, and the pipe-wielding assailant running towards the river. No victim anywhere.
Oh well, in for a penny, in for a pound. I took off after them.
Fortunately they didn’t realize they were being followed, and they slowed to a walk a block away. I followed silently about 30 metres behind them and hid in the shadows whilst they threw the pipe into the bushes on the riverbank.
The rest of the night involved half a dozen police officers, dozens of witnesses, lots of statements, and myself and a few officers scouring the riverbank for the weapon until we found it planted in a particularly dense bit of bush. I didn’t get to sleep until 0200 and the next day I asked the pharmacist on our ward to please check all my prescriptions because I was so unbelievably tired that I couldn’t be sure I was adding the doses up correctly.
“Should I wade no more, Returning were as tedious as go o'er" (Shakespeare, Macbeth, 3.4. 142-144)
That incident was less stressful than completing this course.
Abstract
Advanced Python is an incredible subject unlike anything I've ever studied before. It's one of the only subjects I've been involved in where there's clearly an underlying philosophy that stands in opposition to the mainstream University culture (the only other subject that comes close is Melbourne University’s Evolution and the Human Condition, whose lecturer seemed perpetually on the brink of being expelled).
It's also one of the most difficult subjects to make headway in. In this subject, I sometimes felt like I worked an entire Saturday to add 1% to my grade.
Taking Advanced Python for Data Science is an exercise in wading through.
Who needs a class ring - it’s easy to spot an Alumnus.
Scott, this course gave me my first grey hairs.
Lectures that drag on and on & Code on lecture slides
First, there are the long lectures. Lectures were 2 hours. It took me 3 hours to watch them. This doesn’t really work.
First of all, this course is supposed to be taken alongside employment. Therefore, I suppose that most people come home from their day jobs and then attempt to watch the lecture. Do you know how difficult it is to find 3 hours on any given night to watch something?
I don’t even have children and it’s difficult.
Yes, the lecture has a ‘breakpoint’ in the middle, but it’s not even clear at that point which readings to go do, or what the nature of the next project will be. So if you stop there, you probably have to come back to finish the lecture to get any direction.
So after watching the lecture for 3 hours, the best possible thing is to immediately start exercising to consolidate the learning.
But there are no exercises. Only projects.
To get started, the projects usually take about an hour of fiddling with git, pipenv, and docker, plus idiosyncratic terminal commands. It could be two hours if things aren’t going well! So in order to get to the point where I can actually apply the learning, I’m looking at 4 hours. Do that in one night? In your dreams. Oh, and if I take two nights, I become even more inefficient, because it’s harder to remember things, so I need more hours.
Then there’s the code written into the lecture slides. This usually depicts a particular implementation of the topic. Unfortunately, the amount of code is quite small. The good news is that, for the open-book exam, I’m really just looking at the same two slides of code and deciding which one will give me the answer. It’s also often small and hard to read. Not to mention hard to copy into any IDE.
I’m strongly against code in lecture slides now. Hmmmm...if only there were a better way to showcase code.
Perhaps in a way that I could run it?
How it feels switching from jupyter notebooks to a new IDE.
Another thing: there is absolutely no reason to put the readings in a random slide at the end of each lecture. The readings should all be listed in Canvas, with an option to comment on them. I was so sick of going back into these massive pdf files to look for the slide with the 10 readings links I can click on. Yes, I bookmarked them. No, I shouldn’t have to do that. The readings themselves were great.
2. Deliberate avoidance of Jupyter notebooks
There is a massive blind spot in this course, and it is the avoidance of jupyter notebooks. Jupyter notebooks are an incredible tool for exploring and understanding code, and this course needs it badly.
Scott’s objections to Jupyter notebooks are worth listening to. I learned a lot!
I even became somewhat anti-Jupyter at the start of the course.
However, the rebuttal is even better. Jupyter notebooks running within virtual environments for working on projects are excellent, and I believe that teaching us to use them correctly is much better than discarding them.
Many of the exercises within the projects were prime material for jupyter cell execution, as they involved class inheritance and composition. In fact, I can’t really understand how anyone could become familiar with the concepts taught unless they were using something very ‘jupyter-like’. I think another vote of confidence in Jupyter is the fact that they are used in CSCI E-109B Advanced Topics in Data Science.
In summary, this course has to make up its mind.
Either it’s a course about exploring advanced and idiosyncratic methods in python, which would work well with Jupyter notebooks
OR it’s a course about deploying applications and doing replicable data science, which wouldn’t require notebooks.
It cannot and should not be both.
“Either teach start fire or teach put out fire. Not both. Me hands get burnt.” - A depiction of me giving strong opinions on how to teach code.
3. Unclear instructions in the beginning regarding environment and setup (I used Windows)
I asked a very important question at the start of the course:
In the opinion of those reading this, how should I set things up to complete all the assignments in this course with the least possible hassle?
This answer came, I believe, from the Professor, Scott:
I do not recommend using jupyter at all for this course - any amount of 'try it in notebook, paste into script and hope it runs' is a massive waste of time IMHO and worse, risks introducing bugs since you're no longer debugging the actual code you write and submit. Tools like pycharm have all the same things you like in Jupyter for quick interactive debugging with none of the drawbacks - time to learn how to use them!
You will actually be just fine working entirely on your windows instance for this course, at least in the beginning - just be prepared to use the docker containers we provide for later psets. I do recommend focusing on local development rather than remote, even if the remote system itself is better. We will not be doing anything that requires the scale of a remote computer.
END
Unfortunately, in starting with the local environment, on windows, it becomes the only environment I, as a beginner, know how to use. So when I have to start using docker late in the course for things that don’t run well on windows, there’s a huge learning curve at the worst possible time. Remember, this is data science, and data written in docker is hard to get back.
Additional note: Instructors, please don’t combine pipenv, cuda, and docker for an assignment. Please, for the love of God, just don’t. Not all at once. That’s like three religions sharing one holy city.
Having a few merge conflicts here.
I believe that this student had it right with their reply.
In my opinion, there is no such thing as the best environment setup in terms of working through these courses, it entirely depends on what you are comfortable with and the os you are using. There are quite a few options:
You can still use a jupyter environment as long as you write everything/turn-in in a .py and work with a command line (CL) to execute the file. One thing I like about jupyter is that I am able to switch between environment set-ups. In addition, we may just be git clone assignments and it is nice to open up files through the jupyter environment.
Note: You can still use a notebook to test certain aspects of the assignments.
To create a .py, first, create a text file then rename it to the "whatever name suits you".py Also, add this to the .py file (shebang line)
#!/usr/bin/env python3
2. If you are going 100% Ubuntu and accessing everything on EC2. You can still have a jupyter environment setup interacting with EC2. Again it depends if you are comfortable with the CL for environment setups. Though it seems to me it is too much of a hassle and it can be done it entirely depends on you.
3. IDEs are a great resource for fill-in code or adhering to the tab vs space in files.
4. In terms of programming, windows environment is not developer-friendly. I prefer to stick to a linux/unix environment.
In my opinion for the least possible hassle just stick to option 1. As it seems you prefer a jupyter environment. Although it seems for this class it is trying to move away from jupyter environment. It is a lot to think about. Though I am pretty sure the instructors are able to help with this.
_______________________________________________
Just to add another perspective to Julia's great answer.
If you can spare some 10Gb or more on your local desktop, running Xubuntu in a VM (VirtualBox) would provide you with an appropriate environment for this course and can be set up in less than 30 minutes.
Pycharm is definitely an asset, especially considering that this course makes extensive use of Git and such, which the IDE integrates in its interface.
END
They were absolutely right. Windows is not developer-friendly. Windows is more like developer-landlordly.
No, he won’t fix it - you broke it! By the way, you need to pay him.
If I could go back in time, I would 100% do this on a local virtual machine. At least! I might even go a full virtual machine running on EC2 with a CUDA-friendly GPU which I can ssh into.
4. Unclear instructions within lectures and project descriptions
I know I can’t be the only one who found it difficult to work through the projects simply due to not understanding what to do next. Sometimes the instructions were incredibly specific i.e ‘run command X in the terminal’.
Sometimes they were painfully vague. In pset-6 I read:
Admin Pages
You should create admin models so you can view this in the admin!
Tips regarding some funny Django Admin behaviors:
The display of the model in the list page is defined by __str__(self)
What!? What is 'the admin'? Why do I want it? What does it do? All I have is a link to the django admin documentation at the level of the admin model. It’s just as hard to understand.
I was genuinely frustrated, because I would have loved to learn about Django more. After the course, I sat down and watched 3-hour course on youtube to learn it specifically. I mentioned to my girlfriend that I should have probably taken 3 hours out of the project time to watch it the first time around.
But that’s just another 3 hours of time I need to find!
It was projects like that where I genuinely felt stupid. Not because I couldn’t do the work, but because I didn’t know why I was doing the work. I was following a recipe to make a django website. It sucked.
I got great marks though.
5. Projects that rely on previous projects to complete.
Even worse than unclear instructions were unclear indications of success and failure. It was actually really hard to tell whether I’d written a function or a class correctly in an assignment.
Since… you know… I am also the one testing it.
My beautiful beautiful code.
This isn’t actually so bad if you just want to submit a project and walk away. But when all future problem sets require you to use massive amounts of code from previous assignments - all the shit hits the fan. In fact, picture a dozen fans equidistant in a bounded cubic space, and one central overflowing toilet. Any incorrect implementation will come back to haunt you. You’ll just keeping carrying those bugs around like a broke light-worker backpacking through Thailand.
There are some workarounds. You could possibly copy and paste code from others’ assignments once you do their peer review to try to fix your mistakes….
Oh...but you don’t actually see their CSCI utility functions until the course ends?
And those are the functions which will make or break your assignments?
Never mind. Just get everything right the first time - you’ll be fine mate!
6. Very little option for collaboration despite stated goals of course
Even though we used git and peer review relentlessly, I actually didn’t develop with anyone on this course.
Not for lack of trying though. I reached out to another student who did the MITx micromasters in Statistic and Data Science. The attempt to collaborate can be summed up with these messages:
[1:33 am, 01/11/2020] J.I.: Hi Michael,
How are you? Nice to connect
[7:23 am, 01/11/2020] J.I.: I think your idea sounds good, would be cool to work on that project together
…
Michael James Woodburn: Sounds good. I've just added you as collaborator on a cookiecutter repository
…
[5:02 pm, 03/11/2020] Michael James Woodburn: I'll add you - what's your AWS username?
[5:02 pm, 03/11/2020] Michael James Woodburn: That way we both have access
[10:57 am, 04/11/2020] Michael James Woodburn: You'll love the way the S3 bucket is set up. It means that when we need an image, we can just fetch the one image we need, not the dataset (11GB).
[11:13 am, 04/11/2020] J.I.: Awesome, I will check the user name, hard day at work
[11:13 am, 04/11/2020] J.I.: I will come back in the night
,...
[5:07 pm, 05/12/2020] Michael James Woodburn: how are you going with Pset_6?
[10:45 am, 06/12/2020] J.I.: Haven't started yet :/
Doing the presentation
[10:45 am, 06/12/2020] J.I.: How is it going for you? You started?
[10:46 am, 06/12/2020] Michael James Woodburn: How are you doing the presentation? I thought we were working together
[10:49 am, 06/12/2020] J.I.: Yeah, I read that every team member should do a separate presentation, I am doing the biomedical search stuff. Honestly, it seems to be straight forward, would like to share it with you. I think you are better in implementing python than I am. We could get both projects done I think
[10:50 am, 06/12/2020] J.I.: Would like to get your feedback on this as well. I think this could be also very helpful for your work in general
[10:51 am, 06/12/2020] J.I.: I will try to get the presentation done by tomorrow, and then send it over to you
[10:51 am, 06/12/2020] J.I.: I am also open to get this biomedical search thing done in a collabo, and then focus on the image analysis part?
[10:52 am, 06/12/2020] J.I.: My boss said that I could add some of my work time for that project as well
[10:53 am, 06/12/2020] J.I.: So I hope to get it done, however I don't know how much python it will be 😂
[10:53 am, 06/12/2020] J.I.: You think to do some machine learning approaches?
[10:53 am, 06/12/2020] J.I.: Do you have experience with NLP?
By this point I was wondering whether I should get stuck into this guy. But I guess I took the high road.
[11:49 am, 06/12/2020] Michael James Woodburn: Yep share it as repo
[11:49 am, 06/12/2020] Michael James Woodburn: Heaps in NLP
I never received anything!
Don’t worry about it though. The Romans have some great projects.
This was the absolute low point of the course for me. Not only was the final project due in 10 days, but my partner had decided to do entirely his own thing, to the point of presenting on it, without even telling me.
When I showed this to my uncle, he made a funny comment:
“How do these people even function?”
My thoughts exactly.
The good thing is that I already anticipated this might happen: Expect the best, prepare for the worst. I quickly estimated how much time I’d need to cover his part.
Oh. All of my time. Ok. Could be worse.
What kept me going was actually the principles that Scott had drilled into us as we went.
Don’t Repeat Yourself!
I quickly refactored all my jupyter notebooks which I’d crafted for J.I. into ReadMes and also into my presentation, which I delivered live to my zoom tutorial for bonus marks. My code was already commented for him, which meant it was also commented for the assessor, who probably wondered why I was calling him the wrong name all the time.
So, not a happy experience for me. But would the Harvard experience be complete if not for a little conflict between collaborators?
If you guys were the inventors of Radio-Star, you'd have invented Radio-Star
7. Pipenv!! Locking failed. Windows errors
Pipenv is a library which takes the functionality of pip and extends it so that dependencies are installed such that their individual packages’ dependency requirements are all met and not in conflict. If it is not possible to resolve conflicts in the requirements, they are not installed. This is known as locking failed.
Locking failed.
Locking failed.
Locking failed.
Sorry, had a flashback there for a second.
You see, the problem with that is that pipenv only has to think that there are conflicts to fail locking. There might not be conflicts. There may just be some weird issues with the way the authors wrote the version, that pipenv hates.
By the way, every time locking fails, you have to attempt to resolve dependencies and install again.
Resolving dependencies must be high complexity, because it took ages each time, and it’s really hard to say whether it will succeed or not before running it, due to the myriad of ways it can fail.
You can keep skipping lock, but that only kicks the can down the road. If you can’t lock at the end, you can’t deploy with pipenv!
Pipenv also cracks the shits when you try to install pytorch with no CUDA (just CPU) on windows.
Pipenv cracks the shits when you try to install from a private repository without git credentials (it hides the username and password prompts behind some bullshit loading symbol, so you just think it’s still resolving).
Pipenv genuinely took up most of my time on this course.
The only good thing I can say is that I like the idea and I got to make a lot of coffee when it was running. I hope Conda implements the same functionality too one day.
The worst thing I can say is that this package is that it should be banned from any university course. In fact, any package that has unresolved issues and whose community is almost in open revolt should not be integrated into an admission course for HES.
Bugfixing pipenv took so much valuable time that could have been spent actually mastering python. It created such massive uncertainty regarding when the project would be finished, that I completely lost faith in this course as an adjunct to a day job.
Pipenv escalated this course from frustrating to infuriating.
8. Stacking deadlines inconsistent with professional life
Finally, to follow on from the pipenv debacle, I have to say that this subject had some borderline insane deadlines.
The deadlines were supposed to be:
December 2 Pset 6
December 9 Final project presentation
December 9 Final project
December 16 Final Exam
In reality, those final deadlines were:
December 8 Final project presentation
December 16 Pset 6
December 16 Final project
December 17 Final Exam
So the final project and the final exam, which I was supposed to dedicate additional time to had to actually be done in conjunction with the last problem set. Come on.
Yes, I know that we had more time to do them because the deadlines were pushed back.
No, that doesn’t make it work!
The worst night of the course was having to stay up all night to get pset-5 working.
The second worst night was having to leave my sister’s graduation dinner early to get home by 2200 so I could complete the final exam and sleep before getting up for work at 0630 the next day.
Remember, this is a course which is supposed to be taken in addition to having a job. Keeping an exam open for 24 hours and having it open immediately after handing in two massive projects is a huge squeeze.
For almost any other subject, you might have the option to just compromise.
Hells bells, in University, some of my friends wouldn’t go to the final exam because they knew that they had already passed.
But for this course, knowing that you absolutely had to get a B minimum, knowing that you could go $4000 down if you didn’t then, yeah, I would do whatever it takes.
Fortunately, MIT had brainwashed me with their MIT Mindset™ for this very reason.
Thanks MIT!
But there are consequences for that. My girlfriend had to do all the cooking. My research collaborators at the hospital were messaging me asking why I hadn’t run certain tests we’d talked about.
I really didn’t want to tell them that trying to get into Harvard was more important to me than detecting kidney stones. But I also couldn’t pretend that this course was anything less than number one priority in that time. Shockingly, I even had to leave work on time, which enraged my manager and led to a massive HR headache within a week.
Fortunately for me, my manager was very explicit when he said that I “should be disciplined” for leaving work on time.
That’s actually something that my union has a very strong opinion about, and I feel very sorry for what’s about to happen to that man.
Final comments:
My boss (not my manager, different person) who is a consultant psychiatrist, was at a work lunch and described taking morphine for a broken leg.
“It was awful, just awful. I felt a massive sense of dread the whole time I was on it”.
Some people have that reaction to opiates, and her trainees suggested that her opiate receptors were of that nature.
Maybe it was your defense mechanisms. I chimed in. Defense mechanisms are thought patterns we have which protect us from experiencing pain.
The whole party stared at me with nervous disbelief. Careful now.
“Oh?...” She raised an eyebrow. Although she had the voice of Moaning Myrtle from Harry Potter, I was getting a strong Margaret Thatcher vibe.
Yes. I continued (unwisely),
The morphine knocked out your psychological defense mechanisms for just a second, which allowed whatever was underneath to come out.
Dead silence.
“But, Michael, whatever could there be underneath? Something I’m keeping deep down and repressed?”
I knew the best thing to do was to say nothing. My head was not quite in the noose yet.
“Apart from how I’m old, barren, and alone?” She added, and she burst into intense cackling “We’ll make a psychiatrist out of you yet, Michael!”. The whole table breathed a sigh of relief and chuckled nervously.
She clearly had forgotten about my hilarious hypothesis when she told me in December 2020 that “I was clearly not interested in psychiatry”, strongly implied that I had a personality disorder, and suggested that I “Not trample too many people on my way to the top”.
Don’t you hate it when Santa puts coal in your stocking?
I’ll do what I can, I replied, But personality is chronic. You taught me that.
Throughout this whole course, I realised that my defense mechanisms are practically designed for this work. That’s why I passed.
Maybe I have “reaction formation”.
People who use this defense mechanism recognize how they feel, but they choose to behave in the opposite manner of their instincts.
When I felt like giving up, I cleared more of my schedule. When I got tired, I decided to code until I couldn’t read the screen. I didn’t mind the time. I didn’t watch the clock. So I didn’t notice the dozens and dozens of hours that were being sucked into it.
Or, if I did, I reminded myself that Bill Gates slept in the computer lab too.
I’m also lucky that I have a fairly reliable instinct for these python problems, and I would occasionally have some brainwave that would give me the breakthrough I needed. The more impossible the problem seemed, the more trust I gave my brain to figure it out.
Scientists have only just begun to unlock its mysteries.
I also had another defense mechanism, which is less savoury, but I will try to be honest with you all.
I had pride.
So, as much as I feel sorry for those who dropped out, I will also admit that it only made me feel stronger about still being in the game.
Now the class is over I feel very understanding of those who didn’t pass. Also for those who pulled out for fear of not passing.
Because that fear is absolutely appropriate.
This subject should scare you. You really don’t know what can happen, and lots that can go wrong will, in fact, go wrong. Although the TAs are good, the superstructure of the course is compromised, and the lack of intermediate exercises means that you will often fail to reach the next rest stop, leaving you...somewhere in between.
If you can just get every part of every project partially done, you will probably pass. But it seemed like almost every part was built on the last.
So I can understand completely if anyone realistically could not do that.
If I hadn’t been able to devote almost every weekend and night to this course, I honestly don’t know whether I could have passed.
I reflected that, just by the virtue of putting hours in, I must be growing in my understanding of data science. My career trajectory started to change, and people kept asking me why.
"So you don't want to be a doctor anymore?" - I'm sick of hearing that! You don't hand your MD in when your contract expires.
How can you put 2 hours a day into something for 16 weeks and not have it change things?
In fact, it feels good to have the the wind blowing in another direction for once.
So, Advanced Python course, push us all you want.
But please, no more squeezing.