Written on February 14, 2020 by Gale Proulx
Category: Capstone
The Data Analytics capstone is supposed to showcase the student’s budding talents that have developed over the past four years. While most of these talents are supposed to be data science related, my skillset has always been broader rather than narrow. The past month has reflected where my skillset starts to deviate from Champlain College’s education. The biggest feat of the last month’s work has been the new website I helped engineer.
Website design was never included in my curriculum besides one introduction class. Through personal projects I have successfully learned how to make a dynamic blog that is hosted for free giving me complete control over the look of the website. I achieved this by using Jekyll, at static site generator that is able to create web pages dynamically by taking information from markdown files. All the blog posts for my portfolio website (the one you are on right now) also uses this same system. To get free hosting, I utilized GitHub pages which integrates well with Jekyll. With a couple of CSS tricks and a couple of lines of JavaScript, I was able to successfully build the first iteration of the website in about four hours (including the design process with a little help from my friend/roommate/classmate Ian Dupont).
A lot of time was spent after the first iteration with my capstone partner Rose Marshall tweaking web pages’ grammer, including more information, adding tabs, adding a landing page, and cleaning up CSS styling. This collaborative process worked well once I had a foundation for my partner to critique. In addition, Rose was also able to work with Ian in designing our logo which now lies on our landing page.
Aside from the website, I was also successful in completing other important aspects of this capstone. Chapter 3 of my capstone summary is on its first complete draft, and I have started Chapter 4 explaining how the main program I have written combines files from the Clery Act data. That same main program also now has unit testing so I can confirm that all parts of the program are working correctly. Both my validation program and unit testing have confirmed that all data is being combined correctly. Along with unit testing also came some code cleanup to make the program easier to read.
In my last blog post I also mentioned the desire to make a data duplication program, where the duplicated data would reflect research-backed predictions. The logic behind the program was to show the discrepancy between the data reported to what researchers believed should be the reported figures. After attempting to make this program, it was apparent that the difficulty would not be the coding portion of this program, but instead finding the statistics that applied to each specific crime. This is why I have a program that is ready to create a duplicate dataset, but the statistics to use for this program will have to be found later this semester.
Keeping on schedule, as of this week I have successfully found some significant and compelling visualizations based on some brief data exploration. Using Jupyter Notebook and Matplotlib (and some Altair) I was able to visualize multiple years’ worth of statistics based on the sector of each institution. Each visualization is located on our website’s data visualization page. The only visualization barrier that I will have to figure out in the next month is how to include multiple Altair visualizations on the same page with custom styling.