Tuesday, April 28, 2015

Regression Analysis

Part 1

     With this regression equation the news station was correct in saying that there is a positive correlation between crime rates and free lunches. Although there is a significant positive correlation, between the two variables, there are probably more important variables that influence crime rate than free lunches given out in schools.  The equation to find out the percentage of free lunches given out is as follows Y=21.819+1.685(X). If there is a crime rate of 79.7, the percentage of persons that will receive a free lunch will be 2.16%. 


Part 2

Introduction:

     This assignment gave the student experiencing running regression analysis using a real life situation. This particular example relates to the University of Wisconsin school system. The system wants to know what factors influence a student's decision to attend a particular university. Since the options for attending a particular school are nearly endless, three factors have been chosen for the student to use. The student will run regression analysis on the University of Wisconsin Eau Claire and the choosing of one other school, in this case, the University of Wisconsin La Crosse. 

Methods:

     The data used in this lab was provided by the professor. The three variables used in this analysis were median household income, percent of people with a bachelors degree and population distance. The third variable, population distance, was created by normalizing two fields within the data set. total population of each county was divided by distance from the university. The main reason this was done was to reduce the influence of large population centers from skewing the regression analysis. 

     The first data sets were in excel files. The first step was to take the data needed from the original file and copy it into a new excel spreadsheet. Once this sheet was created, it was then brought into SPSS to run regression analysis. Once the regression analysis was completed on all variables for both schools, significant variables then needed to be identified. There were two significant variable for Eau Claire and three significant variables for La Crosse. Once these variables were identified, regression was ran on them again. This second time, the residuals were then saved in the same SPSS worksheet. The saved residuals allow for a spatial component to be connected to the raw numbers. The numbers allow to viewer to see what is significant, but the spatial connection allows to viewer to see what areas are most significant. 

     Once all the residuals were collected for each variable, the table was then exported as a dBase table. Once the table was exported, it was then brought into ArcMap. After being brought into ArcMap, it was then joined with a shapefile of all the counties in the state of Wisconsin. This join allows for a spatial representation of all the residuals.

Results:

     The results of this lab were very interesting. The table below (Table 1) shows the results of the regression analysis with the variable population distance. The table shows that this variable is significant. Although you can see it is significant, it is hard to see where the most students are coming from. 

Table 1 Shows the population distance regression analysis
 for students attending the University of Wisconsin Eau Claire.


The figure below (Figure 1) shows where most students are coming from to attend Eau Claire. 



Figure 1 show the residual of population distance for
students attending the University of Wisconsin Eau Claire. 






As the map shows, many of the students attending the University of Wisconsin Eau Claire come from counties that typically have higher populations. Areas that are red or yellow in color show a higher amount of students than expected, while light blue and dark blue show which areas are sending less students to the university. 











The second variable that was significant involving Eau Claire was the percentage of people with a bachelors degree. 

Table 2 Shows the percent of people with a Bachelors degree
regression 
analysis for students attending the University of Wisconsin Eau Claire.

Figure 2 showsthe residual of percent of people with a bachelors degreefor students attending the University of Wisconsin Eau Claire.



The table show above (Table 2) shows the significance of the percentage of people with a bachelors degree and the enrollment of students at Eau Claire. The numbers show that there is a very steep slope and this is a significant factor, but it is hard to determine much more than that. The figure to the left (Figure 2) shows where most students are coming from. As the map shows, areas in red or gold have a high percentage of people with a bachelors degree and higher amounts of students are attending the university from those areas. It does not appear to have a tie with population. 







The next three tables shown below (Table 3, Table 4 and Table 5) show the results of the significant variables for the University of Wisconsin La Crosse. 

Table 3 Shows the population distance regression analysis for students attending the University of Wisconsin La Crosse.

Table 4 Shows the percent of people with a Bachelors degree
regression 
analysis for students attending the University of Wisconsin La Crosse.

Table 5 Shows the median household income regression analysis for students attending the University of Wisconsin La Crosse.

As the three tables show above, all the variables were extremely significant in the regression analysis. The figures shown below (Figure 3, Figure 4 and Figure 5) show the spatial context of these different variables. 
Figure 3 shows the residual of median householdincome for students attending the
 University of Wisconsin La Crosse.
Figure 4 shows the residual of percent of
people with a bachelors degree 
for students
  attending the
 University of Wisconsin La Crosse.























The map to the above (Figure 3) shows the median household income. As this map shows, median household income by county only seems to have a great effect on La Crosse county, with slight effects happening in Dane, Milwaukee and Waukesha county. The map above (Figure 4) show the percent of people with bachelors degrees and the effect it has on enrollment. Once again it seems to only have a great effect on La Crosse county, and a slight positive effect in counties that have higher populations. The third map shown below (Figure 5) show the results of the variable population distance. As the map shows, more students come from Dane and Waukesha counties than what would be expected. Most of the northern part of the state is well below the average of what the expected mean enrollment would be. 
Figure 5 shows the residual of population
distance 
for students attending the
 University of Wisconsin La Crosse.

Conclusions:

This lab teaches the student how useful regression analysis can be. The numbers themselves are useful, but incorporating the spatial component allows for the viewer to have a better understanding of the data. Mapping the data allows for trends to be easily noticed and not passed over by looking at the raw numbers of the regression. 

Thursday, April 9, 2015

Spatial Autocorrelation

Part 1

Figure 1 shows a Pearson correlation matrix for the
selected variables of sound level and distance. 

The figure above (Figure 1) shows the results of Pearson correlation between sound levels and distance. There is a negative correlation between the two variables. Although there is a negative correlation between the two, the correlation is still considered to be high, meaning there is a strong association between the variables. 

Part 2

Figure 2. Shows a correlation matrix of the data provided by the instructor. A variety of
variables related to lifestyle, education attainments and race or ethnicity were tested. 
The results of the correlation matrix shown above (Figure 2) were very interesting. Many of the results had a negative correlation. Although it is negative, we cannot determine which variables are increasing or decreasing without a scatterplot of the data. When comparing living below the poverty line to having a bachelor’s degree and not having a high school diploma, the results are very similar in some aspects. Having no high school diploma and living below the poverty line have a positive correlation while living below the poverty line and having a bachelor’s degree is a negative correlation.  Although this is a negative relationship, we are not able to identify which is the increasing or decreasing variable from the data provided. After looking at this data, it appears that education attainments have a strong correlation with lifestyle. There is both a positive and negative correlation, but is a strong correlation in both cases.

Part 3
Introduction:

This assignment gives the student a background in spatial autocorrelation. The analysis will involve using spatial autocorrelation on presidential elections for the state of Texas. The data specifically is the percent democratic vote for the 1980 and 2008 presidential elections as well as the voter turnout for each election. In addition to the provided data, the percent of Hispanic population data will also need to be downloaded. The assignment asks the student to analyze the results to see if there is clustering or similar voting patterns for the particular variables across the state of Texas.   

Methods:

In order to start analysis using spatial autocorrelation, the Hispanic 2010 population percentage shapefile needed to be downloaded from the U.S. census website. Once the data was downloaded, the student needed to navigate the metadata to find which field within the shapefile had the necessary information. After the data was identified, it was then copied into an Excel file with the provided four datasets. Once all the data was combined, it was then joined with a shapefile for the state of Texas. 

Once all the tables were joined, the data was then ready for analysis.
The software used for this analysis is Geoda. Geoda only works with shapefiles which is the reason all the data had to be converted to that particular format. When opening the new file in Geoda, a weight needed to be created. When selecting how to weight the file, ROOK continuity was the specified choice. This weight allows for the student to determine both Moran’s I and create LISA cluster maps.

Moran's I and LISA maps were used for different purposes. The Moran's I allows for a visual representation of where all the data would fall on a graph. It allows to see where there may be clusters of votes or how strong a correlation may be. The LISA cluster maps add the geographical aspect to the data. It allows to see where in the state of Texas that these particular voting patterns were taking place. Using one method or the other would both provide valuable information, When used together, it allows the user to see exactly what is happening in the state. 

 Results: 

The results of all this testing provided some insight into the voting patterns for the state of Texas related to Democratic voting patterns. The two figures below (Figure 3 and Figure 4) show the percent Democratic vote for the 1980 and 2008 presidential elections. 


Figure 3 shows the Moran's I
value for the 2008 percent
Democratic vote for the president.
Figure 4 shows the Moran's I
value for the 1980 percent
Democratic vote for the president.

The graph to the right has much more sign of clustering within the data. A high percentage of the points are right around the center of the graph with only a few outliers. 
The graph on the right is much more spread out than the the one on the left. The points are not centered around the center of the graph. There are not any obvious outliers of the data due to the spaced out nature of the entire dataset. When looking at the Moran's I number on the top of each graph, that also shows that the left graph should have more clustering, due to the stronger correlation. 


The next two graphs shown below (Figure 5 and Figure 6) Show the voter turnout for the 1980 and 2008 presidential elections. 

Figure 5 shows the Moran's I
value for the percent voter
turnout in the 2008
presidential election. 
Figure 6 shows the Moran's I
value for the percent voter
turnout in the 2008
presidential election.

The graph on the left appears to have more clustering than the graph on the right, but it has a lower strength correlation. Although more points appear to cluster around the center of the graph, they do not cluster along the line. The greater clustering around the line in the graph to the right shows the stronger correlation. The Moran's I value is also higher than the graph on the left which further explains the findings.

Figure 7  shows the Moran's I value for 2010
Hispanic population  percentage.




The graph to the left (Figure 7) shows the Moran's I chart of Hispanic population percentage in the year 2010. Although the data does not appear to cluster around the center of the graph as much as the previous four figures, it has the highest Moran's I value. This value also tells us that there is a high degree of clustering around the fitted line. 











The next maps are LISA cluster maps created in Geoda. It is the same information shown in figures 3, 4, 5, 6 and 7 but represented geographically. The two figures below (Figure 8 and Figure 9) show the percent Democratic vote of the 1980 and 2008 presidential elections. 

Figure 8 shows the percent of Democratic vote for
the 2008 presidential election.



The figure to the right shows the 2008 Democratic percentage vote for the president. There appears to be a high percentage of Democratic voting in the southern portion of the state. The low percentage of Democratic voting happened in the central and northern sections of the state. 



Figure 9 shows the percent of Democratic vote for
the 1980 presidential election.


The figure to the left shows the 1980 percent Democratic vote for the president. In comparison to figure 8, there is still a high percentage of Democratic vote happening in the southern portion of the state, as well as a large percentage in the eastern part of the state. The lower amounts of Democratic vote stayed in relatively the same places within the central and northern sections of the state. 






Figure 10 shows the percent of voter
turnout for the 2008 presidential election.

Figure 11 shows the percent of voter
turnout for the 1980 presidential election.
The figure to the left (Figure 10) shows the percent of voter turnout in the 2008 presidential election. Other than the southern portion of the county, there does not appear to be a large connection between voter turnout and percent Democratic vote. The figure to the right (Figure 11) show sthe percent of voter turnout in the 1980 presidential election. This also shows that the southern portion of the state there is a connection between voter turnout and a greater percentage of democratic vote. 

Figure 12 show the 2010 Hispanic population percentage.


The figure to the right (Figure 12) shows the percentage of Hispanic population in the year 2010. As the map shows, there is a high Hispanic population percentage in the south and west part of the state, while there are lower percentage in the east-central parts of the state. When comparing this map to figure 8, it appears that there is a correlation between Hispanic population and percent democratic vote. 









Results:

The results of this exercise are very interesting. Although Hispanic population data was not used in this assignment, it does appear that there is a connection between Hispanic percent population and the Democratic presidential vote. In terms of voting pattern, they have stayed fairly consistent over the past 20 years. Democratic voting has started to shift westward, but has also decreased in other areas of the state. There does not appear to be a great change in voter turnout across the state. Overall there does not appear to be a great change with the different variables used in this assignment. 


Sources:

U.S. Census
Professor Weichelt