Written on January 16, 2020 by Gale Proulx
Category: Capstone
Within the past month I have had the opportunity to sit down and work on my capstone project uninterrupted. The goal as of last semester was to finish the data cleaning/gathering portion of this project and dive deep into validating my results. Two weeks into the break, I finally finished gathering all the data from approximately 200 files and organized it into one file through Python code alone. My previous attempt at using Microsoft Excel failed as I was trying to run too many formulas on one spreadsheet. This second attempt instead modified the code I had taken and concatenated the files in such a way that I could add the year dynamically. This means I now have one CSV file that contains all crimes from 2009 to 2017. On top of organizing, I was also able to simplify the 600 columns of information into approximately 80 columns. Many crimes such as liquor were classified not only as a liquor crime, but also by location and type. (It could be an on-campus crime or off campus or hate or public property crime.) I made the executive decision to combine all these different classifications to make the data more human readable. This combination did make the assumption that if a college did not report any statistics, they had no crimes reported. There is logic behind why I made this decision. In a given report, we only pulled data for that year. (A 2009 report includes 2007 and 2008 statistics, but we only took 2009.) I assume if a college was receiving federal funding, then they would have to report statistics for that year. This was not always the case. In fact, many colleges just did not have anything for the year of the report. I do not know why this data is missing, but I might be able to find out more by contacting the Clery Center which could provide an explanation. Until I receive a response, I am stuck making this assumption.
When I finally combined all the columns, I was able to look at some preliminary results which were disappointing to say the least. It appeared that the amount of information provided by the Clery Act was much smaller than it initially looked. Some columns were added within the past couple of years and did not have a single reported incident. Other columns only had a few incidents reported over this nine-year time span. Many schools with over ten thousand students reported far too few crimes to be plausible. Due to this disappointing finding, I decided to add a new step in this project. Since I already have research surrounding sexual violence and the probability that sexual violence occurs on campus, it is possible that I could generate data based on those probabilities in tandem with the student population. By creating this “more realistic” dataset, I could make some compelling visuals comparing what colleges reported to what researchers think is going on. In order to add this step into the capstone project, I needed to do more work over the month long break.
Over the second half of break I made and finished the validation program. The validation program takes a lot of code from the previous program, but rather than concatenating dataframes it compares every cell of the original report and to the cells of my newly generated master report. Any inconsistences were recorded in a CSV file for further review. After running my program, I only found a few hundred errors which were simply improper string comparisons between null values. It seems that my original programs were successful in concatenating dataframes.
While I was building my validation program, I also had copy edits from my capstone partner. Since I had time over the month-long break, I did also go through my research paper and to make changes from the final edits of last semester as well as my partner’s edits. I then edited Chapter 2 to change some outdated assumptions. (Mainly I wanted to change my context research explanation to not include interviews since we are not conducting any interviews.) Since the Professional Writing capstone advisor has given the go ahead to run calls for submissions without IRB approval, I will be using the calls for submissions simply as a source for my research, not conducting any research directly with participants. This will make any research regarding Champlain College retroactive, not proactive, therefore no human subjects will be involved. Students will simply be giving their own stories just as they would any student publication. The professional writing side of this capstone will only affect the final presentation of the information, which will be in the form of a website where I present information alongside Champlain College student’s stories.
Now that I am confident that my data is accurate, I can start the current semester (this week) making the data probability generator program. Within the next week I will have another dataset to compare results and I can start visualizations, linear regression, time series analysis, and much more within the next month. The Professional Writing Capstone has also begun, allowing me to attend these classes as well to help with any call for submissions and website building. So far, everything is going as planned.