Written on December 6, 2019 by Gale Proulx
Category: Capstone
Before Thanksgiving break, I had the opportunity to present my initial prototype. Since this capstone project is different from a stereotypical Computer Science project, there was no program that showcased my process. My initial prototype reflected the Data Science process, and much less a reflection of a typical programming project. Unfortunately, I did not have any work that easily showcased all my efforts, so I had to make a presentation of all the different tasks I have set out to do.
The first step was to gather data from the Clery Act. Downloading all the data from the Campus Safety and Security website was not hard but merging all approximately 200 files into one file has been a challenge. Additionally, formatting this data is proving to be hard. My presentation reflected my efforts of the challenges I was facing and the next steps I will be taking to format this data. (Turns out Excel still cannot handle 500MB of this particular dataset.)
The second step is to validate and clean the data. My current progress in this regard has been slow as I haven’t had the chance to view the dataset in one place. Additionally, I will have to make some decisions as to how I interpret missing data. By law, postsecondary institutions receiving federal aid must report these numbers, so naturally the large amount of missing data is alarming.
The final step requires some type of analysis. Unsure of how I want to proceed, I have been trying to brainstorm analysis methods from multiple different perspectives. From a statistics background, I could make a multiple linear regression model over a time series. By making a statistical model I could explore what variables correlate to sexual violence in this dataset. From a data science background, it would be possible to use different machine learning algorithms. The New York Times released an article called How Y’all, Youse and You Guys Talk that uses a quiz to categorize a person via the clustering algorithm K-Nearest Neighbors. Alternatively, I could try to hypertune parameters of an appropriate algorithm (such as a random decision forest) to try and achieve the highest accuracy possible.
At the end of my prototype I showed some visualizations that I could potentially make when I have the full dataset. As mentioned before, my prototype was more an outline of the process I plan to go through rather than an example of what I want the final product to be.
One common misconception of Data Science is the amount of time spent actually making models and predictions. Typically, data science projects consist of 80% data cleaning and 20% analysis. Since the first semester is almost over, the data cleaning part of this project should be around 50% complete. Most of the files that I wanted to gather are already merged, so the rest of winter break should be an adequate amount of time to reformat the data. I have also given myself the rest of the winter break in between semesters to get the data cleaned. Following cleaning I will be validating the results, so I am prepared to analyze the data without knowing the validity of the dataset. The rest of the semester will be reserved for analysis and integration into our final product. Overall the implementation of the timeline is running smoothly.
There is one issue regarding the Institutional Review Board (IRB). Since I am working with multiple different capstone professors in different divisions, there has been a small disagreement between professors. I believe part of this is due to my initial vision of what I thought this capstone would be. In the beginning I believed that we might interview participants regarding experiences with Title IX and sexual violence. Now that we are looking at a Call for Submissions framework, IRB interventions may not be necessary. I find it hard to balance ethics with conflicting productivity. From my research it has been clear that many people do not report crimes in large part because of a lack of faith in justice systems. By working with the IRB, there is a (reasonable) restraint on this project, but it also comes with the counterintuitive nature that some people may be deterred from sharing their experiences. Regardless, I do not have enough background to make an educated decision about the matter and I am leaving this discussion to be mainly handled by my professors until we have a clearer consensus on what the restrictions might be regarding this dual-division capstone.
When I went into my senior year at Champlain College, I did not know that my capstone would revolve around sexual violence. I also did not know the ethical and logistical quandaries I would be dancing through to finish the project. While I am not entirely unsatisfied with my project, I do wish that I had more time to research and really build an impactful project. The topic is very important to me, and it is becoming clearer as the year moves on that there are real restraints stopping me from doing everything I want to do. Additionally, it is hard to get through senior year doing a capstone while taking many classes later in the curriculum then intended. I am hoping that next semester I will learn enough in my classes to be able to do a more in-depth analysis of the dataset that I have prepared.