Data Science Study Diary

9 March 2020

Today is the first official day of the first course in the Post Graduate Diploma of Data Science with the University of Adelaide (online).

They enabled access to the materials two weeks ago, and I have already spent over 38 hours since then reading the materials for the first two weeks and working on the first assignment.

In any given hour I could go back and forward several times between feeling like I've got this, and crippling self doubt and anxiety. I doubt whether I have enough math experience to understand what is going on, even though the facilitator said not to worry about the math in the introductory webinar. I doubt whether I will be able to keep up. I doubt whether I will have any idea what I am doing when writing papers. I am afraid that what I have done on the assignment so far will be completely wrong and that my approach to the topic is naive and unsophisticated. I have already had my first sleepless night, worrying about all of the above.

It is ridiculous. I have spent the last few months telling myself that the worst that could happen is that I can't do it and I move on with my life in another direction. I underestimated how intensely my perfectionist tendencies would kick in and how much I don't want to admit that this is too much.

27 April 2020

It has been a wild ride. My first entry above clearly demonstrates my state of mind at the beginning of the course. It was rough. I felt completely inadequate, and was very worried about the fact that it was only a six week course, with the first assignment due halfway through, so the course would be almost over before I actually knew how I was doing. Much of the initial reading involved detailed statistical calculations or terminology that went well over my head. On top of this, the COVID-19 situation turned into a full blown pandemic, the whole family was now studying and working from home, and work suddenly got very busy as clients tried to adapt to the new reality.

One challenge was that the assessment descriptions clearly indicated that we were to develop a PLAN for data collection, cleaning, analysis and a PLAN for a visualisation and validation report. However much of the course content that referred to the assessments talked about how we should be keeping a diary of the work that we did on our data so that we could describe our work in our report. In addition, it appeared to be nearly impossible to complete the assignment effectively without actually doing the work to collect, clean and analyse the data.

When putting my first assessment together, I couldn't find any guides as to the expected format, so I found that uncertainty a little frustrating. However I was impressed with the options that the University of Adelaide provided for assignment help. They provide access to a tutoring service where you can upload your assignment/essay and receive detailed feedback in less than 24 hours. The feedback I received was very helpful. The tutor for the course was also happy to review drafts.

In addition, I made contact with the free psychology services offered by the University to talk through how overwhelmed and inadequate I was feeling, as well as to get some suggestions for the resulting insomnia issues.

Summary - Course: 4062_DATA_7202OL Applied Data Science

Content:
Data sources, noisy data, data cleaning, analysis techniques, verification, visualisation, validation and reporting.

Assessments:
There were two major assessments that were two parts of a data analysis plan. In the first assessment we had to identify a target for data analysis, select at least four data sources, and show how they could be cleaned and analysed. The second assessment included visualisation, validation and creation of an update plan. As noted, the assessment outline made it sound like a theoretical exercise, however I don't think I could have addressed the criteria adequately without actually doing the tasks that I was writing a "plan" to do. Each of the assessments had a 1500 word limit, which I found very difficult to stick within while covering all criteria thoroughly.

There were also 3 discussion board exercises, were we had to write a post that answered questions about our work on the major assessment, then respond to at least 2 other class member posts.

Time spent:
134 hours over six weeks

Verdict:
I have filled in many gaps in my knowledge and was introduced to topics like regression analysis that I had no prior experience with. I learned that I enjoy the process of turning data into visualisations. One takeaway is the importance of domain knowledge when it comes to data analysis. I used data from a client for my project and was able to use my knowledge of their company to extract some interesting insights. If I were using a data source where I had no knowledge of data provenance or meaning, this would have been much more difficult. I also came to understand how much detail aggregation can conceal.

18 June 2020

Summary - Course: 4063_COMP_SCI_7212OL Human and Ethical Factors in Computer Science

Content:
Human factors, affordances, friction, schools of ethics, basic ethical principles, bias in data, algorithms and practice, resolving ethical conflicts, ethical issues in data science, the ethics of algorithms, ethical failures, cross-cultural ethical issues, and ethics in reality.

Assessments:
There were three assessments for this course, worth 20%, 40% and 40% of the grade. The first assignment had a limit of 1000 words. It provided a scenario and asked us to identify human factor and ethical issues. The last two assignments (1500 words each) also provided a data analysis scenario and each asked four questions about how to apply ethical schools of thought and principles of privacy and informed consent to the given scenario. I again struggled with the word limit on every assignment, finding myself constantly cutting out detail that I felt was important.

In this class there were discussion boards every week, but they were ungraded.

Time spent:
92.62 hours over six weeks

Verdict:
I found the content of this course very interesting, but the assignments extremely painful. Unlike the last subject, where the referencing direction was "it doesn't matter which style, just be consistent", this time it had to be Harvard Referencing in the University of Adelaide style, and it needed to be perfect. Once again I found myself very uncertain about whether I had adequately met the rubric for each assignment, and right up until the grades were released I didn't know what to expect.

I have spoken to others who have found ethics subjects quite dry, but I felt that the course designers did a good job of selecting interesting cases to review related to data science, and I have used it as a springboard for thinking about how data is handled in companies that I work with. I don't think I appreciated quite how easily bias and exclusion can occur in data analytics. I also appreciated the variety added to the course content with the inclusion of both videos and podcasts.

26 August 2020

Summary - Course: 4064_DATA_7201OL Data taming, Modelling and Visualisation

Content:
An introduction to the R programming language and RStudio, tidying data, summarising data and using histograms for univariate analysis, bivariate analysis (comparing variables of different types), text manipulation, grouping and transformation, PCA, confidence and prediction intervals, linear models, categorical predictors.

Assessments:
There were once again 3 assessments, worth 25%, 25% and 50%. The first assignment tested the ability to read cricket score data into R, manipulate strings to extract data, and analyse the potential relationships between variable using ggplot. The second assignment was an analysis of Pulitzer prize data that involved a simple linear regression model. The last assignment used weather data to explore the relationship between evaporation and other factors such as temperature and humidity.

Time spent:
105.87 hours over six weeks

Verdict:
I felt a bit more in my element with the programming side of this course. I was able to use RStudio fairly confidently and perform the required programming tasks quite easily. The course content had a chatty conversational tone that made it pleasant to work through.

I also found that there were parts of this course that required more interpretation and were very subjective, and for many of these sections there was not enough content in the course materials to allow me to be confident that I understood the concepts. I spent many hours searching for other examples of "good" and "bad" assumption graphs for linear models, trying to understand what was acceptable and what was not.

I found myself wishing for the opportunity for more collaboration with other students in terms of how they might have approached the assessments. I'm not talking about sharing answers. In a physical classroom, conversations naturally happen between students about assignments. In the online environment, anything that too specific is deleted.

I hit a blocker in the last assignment where I knew that something was not quite right, but the tutors can't help much because that is not "fair". Although I wanted a good grade, what I wanted more was to know that I correctly understood the concepts, so it feels frustrating to get to a point where you want to do something the right way, but no-one is allowed to point you in the right direction. I'm here to learn, so it feels like it is in everyone's best interest that help can be provided one way or another.

My verdict is that I learned a great deal and I enjoyed playing with R to create fun visualizations, but I'd like to learn in a more collaborative environment where we can all help each other to succeed.

5 October 2020

Summary - Course: 4065_COMP_SCI_7210OL Foundations of Computer Science - Python A

Content:
An introduction to Python, Anaconda and Jupyter notebooks, using mathematical and string functions, expressions and conditional statements, for and while loops, lists, dictionaries, reading and writing files, objects, functions, NumPy arrays and functions, Pandas for dataframe manipulation and visualisation, and Seaborn for visualisation.

Assessments:
I thought that the assessment in previous classes was full on. Then I saw the schedule for this one. Every week for five weeks there was an assignment and a quiz. The assignment was to demonstrate understanding of the topic for the week, and typically contained between 3 and 4 sections, each with a number of sub-sections. Some of them were relatively quick, but others took a fair bit of thought.

The quizzes often drew from content from the extra reading, and I found a number of the questions were badly worded or contained errors.

Finally, after five assessments (30%) and five quizzes (20%), there was a final project worth 50% of the grade. In this project we selected a data source, read the data into a notebook, cleaned the data, then created a number of visualisations. The data needed to contain at least one date field, as well as numbers and strings, and have at least 20,000 rows and 10 columns.

I find it strange that you don't get full marks by meeting the requirements outlined on the the assessment sheets. Typically it would note that to get full marks, you had to do more than the requirements. For example, to get full marks on the final assignment, you had to choose a datasource that required additional processing beyond date and number parsing, or combine multiple sources together.

Time Spent:

118 hours over six weeks.

Verdict:

I thought the time schedule was ambitious. I felt for those members of the class with no programming experience at all, going from zero to data cleaning and visualisation in six weeks. I am a programmer and I found I needed a lot of time to work through all of the examples and take it all in. I can't imagine doing this without any prior knowledge of programming concepts. I found the Week 5 content in particular disproportionately large, and thought it could have spanned 2-3 weeks. The length of the content in Canvas was deceiving at times because I would finish a section and then it would recommend going and reading a chapter from each of the textbooks, where some chapters were over 100 pages long. Trying to absorb so much information so quickly is challenging.

The other challenge with the schedule is that it is not until you have completed the Week 5 content that you truly have the tools you need to start on the final project. While you technically could read in your file without Pandas and look at the data, you are literally wasting your time because in the end you will want it read in as a Pandas dataframe.

I had started working through the content 2 weeks before the class officially started, because that helps me to feel more able to deal with anything that comes up with family during the teaching period. I had completed 3 of the assignments by the time it started. It is a good thing that I did, because my grandmother passed away during the second week, and I was able to travel to her funeral without needing to rush back to get the assignment for the week submitted.

I ended up spending about 40 hours on the final project, partly because I selected a datasource that contained some tricky formatting, partly because I was doing battle with Pandas and Seaborn to get the data into the right format for each type of plot, and partly because I became very intrigued by the data and just wanted to know more about it.

I chose the Movies dataset from Kaggle, and decided to take the angle of exploring the role of gender in movie production. There were a number of issues with the dataset in the end, but I loved being able to look at the data from various angles to answer different questions.

The tutor, Kane, was excellent and I liked how he provided custom code samples to work through for each tutorial that were different to what was in the canvas content.

Overall, I'm getting the hang of Python but sometimes it feels like it is a bunch of libraries written by different people stuck together with duct tape. Things appear to get deprecated without any backwards compatibility, meaning many of the examples I would find online would not work. Because there are so many custom libraries, there are frequently several different ways to do something, with no clear guidelines on what is best practice. Jupyter itself also seems to rely on plug-ins for anything more than base functionality. I missed the visibility of variables that I had in RStudio. Python and Jupyter are heavily used in the industry though, so I once I am more used to them, I will likely appreciate them more.

In some ways I wished that I had taken this class first, but at the same time, I enjoyed being able to incorporate some of the concepts that I've learned in the previous three classes into my final project, such as considering whether missing data leads to bias, or incorporating the ideas of exploratory data analysis.

At this point I have now completed 4 classes, enough to earn the Graduate Certificate. Four more are required for the Graduate Diploma.

Woolley Family

Search This Blog

Data Science Study Diary

Comments

Post a Comment

Popular posts from this blog

The day that both broke and warmed my heart – 15 Jan 2011

15 day Tasmanian Adventure with Teenagers

Adventure in Bali with Teenagers