Written on February 28, 2020 by Gale Proulx
Category: Capstone
Over the past two weeks this capstone has developed new pathways to expressing information about the Clery Act data that were originally not planned. The main form of these pathways were changes to the initial data exploration visualizations to convey more meaning and a new statistic to explain how confident the researchers are that a college is reporting the correct numbers. While these new ideas sprung to life in a capstone check-in meeting, other parts of this project have been temporarily delayed. Chapters 3 and 4 will make more progress in the coming weeks now that a large part of the project changed.
After presenting some mock exploration visualizations in demo one, it became apparent that the dataset was too complex for the initial graphs. Part of this complexity comes from the nature of the data. For example, from the analysis thus far, it appears that one third of institutions are reporting zero crimes for at least one year. This means the total number of institutions is fluctuating, and it is unclear as to whether or not institutions are closing, or if they simply just aren’t reporting their numbers. A graph presented in the first demo showed the number of liquor crimes dropping across institutions. This dip in reports of liquor crimes could be an indication that less students are drinking. It could also mean more institutions are closing; thus, the number of reports is dropping. It could also mean four-year postsecondary institutions are being more lax with the number of crimes they report. Looking at the visualization, any of these interpretations could be easily concluded, but none of these answers are most likely the sole reason as to why there is a drop in the number of reported liquor crimes.
A large part of data visualization is dealing with the interpretations of the audience. While it can be very easy to make a scatter plot, it can be immeasurably hard to make a visualization that inherently conveys the message you want it to convey. My first attempt to fix this wild array of interpretations was to add the number of institutions that are reporting onto the same scatterplot. Unfortunately, this does not explain why reporting on liquor crimes might be dropping. It could be possible that some institutions decided to not receive federal funding that year and therefore didn’t have to report. In other words, this addition to the graph only added to the possible number of interpretations.
Rather than use a statistic that does not accurately reflect reporting, I will use the number of crimes reported per student. This new statistic will be resistant to changes in the number of institutions, as the calculation involves both liquor crime reports and students. If institutions are included on the report, then so are the students. If the sample size falls, it does not affect directly affect the statistic in a proportionate way.
Extending this logic, I was able to create a new statistic that judges the confidence that an institution is reporting its numbers correctly. The new figure is called a confidence score (or lack of confidence score). By taking the difference of crimes per student and the national average, divided by the total number of institution reports over time, I was able to produce a number that shows how far off an institution’s crime reporting is from the national average. As a quick prototype, I threw a link up on our website with a searchable table. The higher the score, the further a college is from the national average for reporting crimes. While this logic isn’t perfect, it does do a good job of finding colleges that have questionable practices. For example, Harrison College has the worst score. This college made the news for closing without notifying its students and it had an unusually small student enrollment for being a four-year institution. This statistic may not label every college correctly, but it is pointing towards institutions that should be investigated further.