Tuesday, April 28, 2015

Regression Analysis

Part 1

     With this regression equation the news station was correct in saying that there is a positive correlation between crime rates and free lunches. Although there is a significant positive correlation, between the two variables, there are probably more important variables that influence crime rate than free lunches given out in schools.  The equation to find out the percentage of free lunches given out is as follows Y=21.819+1.685(X). If there is a crime rate of 79.7, the percentage of persons that will receive a free lunch will be 2.16%. 


Part 2

Introduction:

     This assignment gave the student experiencing running regression analysis using a real life situation. This particular example relates to the University of Wisconsin school system. The system wants to know what factors influence a student's decision to attend a particular university. Since the options for attending a particular school are nearly endless, three factors have been chosen for the student to use. The student will run regression analysis on the University of Wisconsin Eau Claire and the choosing of one other school, in this case, the University of Wisconsin La Crosse. 

Methods:

     The data used in this lab was provided by the professor. The three variables used in this analysis were median household income, percent of people with a bachelors degree and population distance. The third variable, population distance, was created by normalizing two fields within the data set. total population of each county was divided by distance from the university. The main reason this was done was to reduce the influence of large population centers from skewing the regression analysis. 

     The first data sets were in excel files. The first step was to take the data needed from the original file and copy it into a new excel spreadsheet. Once this sheet was created, it was then brought into SPSS to run regression analysis. Once the regression analysis was completed on all variables for both schools, significant variables then needed to be identified. There were two significant variable for Eau Claire and three significant variables for La Crosse. Once these variables were identified, regression was ran on them again. This second time, the residuals were then saved in the same SPSS worksheet. The saved residuals allow for a spatial component to be connected to the raw numbers. The numbers allow to viewer to see what is significant, but the spatial connection allows to viewer to see what areas are most significant. 

     Once all the residuals were collected for each variable, the table was then exported as a dBase table. Once the table was exported, it was then brought into ArcMap. After being brought into ArcMap, it was then joined with a shapefile of all the counties in the state of Wisconsin. This join allows for a spatial representation of all the residuals.

Results:

     The results of this lab were very interesting. The table below (Table 1) shows the results of the regression analysis with the variable population distance. The table shows that this variable is significant. Although you can see it is significant, it is hard to see where the most students are coming from. 

Table 1 Shows the population distance regression analysis
 for students attending the University of Wisconsin Eau Claire.


The figure below (Figure 1) shows where most students are coming from to attend Eau Claire. 



Figure 1 show the residual of population distance for
students attending the University of Wisconsin Eau Claire. 






As the map shows, many of the students attending the University of Wisconsin Eau Claire come from counties that typically have higher populations. Areas that are red or yellow in color show a higher amount of students than expected, while light blue and dark blue show which areas are sending less students to the university. 











The second variable that was significant involving Eau Claire was the percentage of people with a bachelors degree. 

Table 2 Shows the percent of people with a Bachelors degree
regression 
analysis for students attending the University of Wisconsin Eau Claire.

Figure 2 showsthe residual of percent of people with a bachelors degreefor students attending the University of Wisconsin Eau Claire.



The table show above (Table 2) shows the significance of the percentage of people with a bachelors degree and the enrollment of students at Eau Claire. The numbers show that there is a very steep slope and this is a significant factor, but it is hard to determine much more than that. The figure to the left (Figure 2) shows where most students are coming from. As the map shows, areas in red or gold have a high percentage of people with a bachelors degree and higher amounts of students are attending the university from those areas. It does not appear to have a tie with population. 







The next three tables shown below (Table 3, Table 4 and Table 5) show the results of the significant variables for the University of Wisconsin La Crosse. 

Table 3 Shows the population distance regression analysis for students attending the University of Wisconsin La Crosse.

Table 4 Shows the percent of people with a Bachelors degree
regression 
analysis for students attending the University of Wisconsin La Crosse.

Table 5 Shows the median household income regression analysis for students attending the University of Wisconsin La Crosse.

As the three tables show above, all the variables were extremely significant in the regression analysis. The figures shown below (Figure 3, Figure 4 and Figure 5) show the spatial context of these different variables. 
Figure 3 shows the residual of median householdincome for students attending the
 University of Wisconsin La Crosse.
Figure 4 shows the residual of percent of
people with a bachelors degree 
for students
  attending the
 University of Wisconsin La Crosse.























The map to the above (Figure 3) shows the median household income. As this map shows, median household income by county only seems to have a great effect on La Crosse county, with slight effects happening in Dane, Milwaukee and Waukesha county. The map above (Figure 4) show the percent of people with bachelors degrees and the effect it has on enrollment. Once again it seems to only have a great effect on La Crosse county, and a slight positive effect in counties that have higher populations. The third map shown below (Figure 5) show the results of the variable population distance. As the map shows, more students come from Dane and Waukesha counties than what would be expected. Most of the northern part of the state is well below the average of what the expected mean enrollment would be. 
Figure 5 shows the residual of population
distance 
for students attending the
 University of Wisconsin La Crosse.

Conclusions:

This lab teaches the student how useful regression analysis can be. The numbers themselves are useful, but incorporating the spatial component allows for the viewer to have a better understanding of the data. Mapping the data allows for trends to be easily noticed and not passed over by looking at the raw numbers of the regression. 

Thursday, April 9, 2015

Spatial Autocorrelation

Part 1

Figure 1 shows a Pearson correlation matrix for the
selected variables of sound level and distance. 

The figure above (Figure 1) shows the results of Pearson correlation between sound levels and distance. There is a negative correlation between the two variables. Although there is a negative correlation between the two, the correlation is still considered to be high, meaning there is a strong association between the variables. 

Part 2

Figure 2. Shows a correlation matrix of the data provided by the instructor. A variety of
variables related to lifestyle, education attainments and race or ethnicity were tested. 
The results of the correlation matrix shown above (Figure 2) were very interesting. Many of the results had a negative correlation. Although it is negative, we cannot determine which variables are increasing or decreasing without a scatterplot of the data. When comparing living below the poverty line to having a bachelor’s degree and not having a high school diploma, the results are very similar in some aspects. Having no high school diploma and living below the poverty line have a positive correlation while living below the poverty line and having a bachelor’s degree is a negative correlation.  Although this is a negative relationship, we are not able to identify which is the increasing or decreasing variable from the data provided. After looking at this data, it appears that education attainments have a strong correlation with lifestyle. There is both a positive and negative correlation, but is a strong correlation in both cases.

Part 3
Introduction:

This assignment gives the student a background in spatial autocorrelation. The analysis will involve using spatial autocorrelation on presidential elections for the state of Texas. The data specifically is the percent democratic vote for the 1980 and 2008 presidential elections as well as the voter turnout for each election. In addition to the provided data, the percent of Hispanic population data will also need to be downloaded. The assignment asks the student to analyze the results to see if there is clustering or similar voting patterns for the particular variables across the state of Texas.   

Methods:

In order to start analysis using spatial autocorrelation, the Hispanic 2010 population percentage shapefile needed to be downloaded from the U.S. census website. Once the data was downloaded, the student needed to navigate the metadata to find which field within the shapefile had the necessary information. After the data was identified, it was then copied into an Excel file with the provided four datasets. Once all the data was combined, it was then joined with a shapefile for the state of Texas. 

Once all the tables were joined, the data was then ready for analysis.
The software used for this analysis is Geoda. Geoda only works with shapefiles which is the reason all the data had to be converted to that particular format. When opening the new file in Geoda, a weight needed to be created. When selecting how to weight the file, ROOK continuity was the specified choice. This weight allows for the student to determine both Moran’s I and create LISA cluster maps.

Moran's I and LISA maps were used for different purposes. The Moran's I allows for a visual representation of where all the data would fall on a graph. It allows to see where there may be clusters of votes or how strong a correlation may be. The LISA cluster maps add the geographical aspect to the data. It allows to see where in the state of Texas that these particular voting patterns were taking place. Using one method or the other would both provide valuable information, When used together, it allows the user to see exactly what is happening in the state. 

 Results: 

The results of all this testing provided some insight into the voting patterns for the state of Texas related to Democratic voting patterns. The two figures below (Figure 3 and Figure 4) show the percent Democratic vote for the 1980 and 2008 presidential elections. 


Figure 3 shows the Moran's I
value for the 2008 percent
Democratic vote for the president.
Figure 4 shows the Moran's I
value for the 1980 percent
Democratic vote for the president.

The graph to the right has much more sign of clustering within the data. A high percentage of the points are right around the center of the graph with only a few outliers. 
The graph on the right is much more spread out than the the one on the left. The points are not centered around the center of the graph. There are not any obvious outliers of the data due to the spaced out nature of the entire dataset. When looking at the Moran's I number on the top of each graph, that also shows that the left graph should have more clustering, due to the stronger correlation. 


The next two graphs shown below (Figure 5 and Figure 6) Show the voter turnout for the 1980 and 2008 presidential elections. 

Figure 5 shows the Moran's I
value for the percent voter
turnout in the 2008
presidential election. 
Figure 6 shows the Moran's I
value for the percent voter
turnout in the 2008
presidential election.

The graph on the left appears to have more clustering than the graph on the right, but it has a lower strength correlation. Although more points appear to cluster around the center of the graph, they do not cluster along the line. The greater clustering around the line in the graph to the right shows the stronger correlation. The Moran's I value is also higher than the graph on the left which further explains the findings.

Figure 7  shows the Moran's I value for 2010
Hispanic population  percentage.




The graph to the left (Figure 7) shows the Moran's I chart of Hispanic population percentage in the year 2010. Although the data does not appear to cluster around the center of the graph as much as the previous four figures, it has the highest Moran's I value. This value also tells us that there is a high degree of clustering around the fitted line. 











The next maps are LISA cluster maps created in Geoda. It is the same information shown in figures 3, 4, 5, 6 and 7 but represented geographically. The two figures below (Figure 8 and Figure 9) show the percent Democratic vote of the 1980 and 2008 presidential elections. 

Figure 8 shows the percent of Democratic vote for
the 2008 presidential election.



The figure to the right shows the 2008 Democratic percentage vote for the president. There appears to be a high percentage of Democratic voting in the southern portion of the state. The low percentage of Democratic voting happened in the central and northern sections of the state. 



Figure 9 shows the percent of Democratic vote for
the 1980 presidential election.


The figure to the left shows the 1980 percent Democratic vote for the president. In comparison to figure 8, there is still a high percentage of Democratic vote happening in the southern portion of the state, as well as a large percentage in the eastern part of the state. The lower amounts of Democratic vote stayed in relatively the same places within the central and northern sections of the state. 






Figure 10 shows the percent of voter
turnout for the 2008 presidential election.

Figure 11 shows the percent of voter
turnout for the 1980 presidential election.
The figure to the left (Figure 10) shows the percent of voter turnout in the 2008 presidential election. Other than the southern portion of the county, there does not appear to be a large connection between voter turnout and percent Democratic vote. The figure to the right (Figure 11) show sthe percent of voter turnout in the 1980 presidential election. This also shows that the southern portion of the state there is a connection between voter turnout and a greater percentage of democratic vote. 

Figure 12 show the 2010 Hispanic population percentage.


The figure to the right (Figure 12) shows the percentage of Hispanic population in the year 2010. As the map shows, there is a high Hispanic population percentage in the south and west part of the state, while there are lower percentage in the east-central parts of the state. When comparing this map to figure 8, it appears that there is a correlation between Hispanic population and percent democratic vote. 









Results:

The results of this exercise are very interesting. Although Hispanic population data was not used in this assignment, it does appear that there is a connection between Hispanic percent population and the Democratic presidential vote. In terms of voting pattern, they have stayed fairly consistent over the past 20 years. Democratic voting has started to shift westward, but has also decreased in other areas of the state. There does not appear to be a great change in voter turnout across the state. Overall there does not appear to be a great change with the different variables used in this assignment. 


Sources:

U.S. Census
Professor Weichelt

Monday, March 16, 2015

"Up-North" Wisconsin

Part 1


2.
     Asian-Long Horned Beetle - Null Hypothesis: There will be no difference between the average numbers of Asian-Long Horned Beetles in Buck County compared to the average of the whole state of Pennsylvania. Alternative Hypothesis: There will be a difference between the average numbers of Asian-Long Horned Beetles in Buck County compared to the average of the whole state of Pennsylvania. Conclusion: Reject the null hypothesis. There was a difference between the average number of Asian-Long Horned Beetles in Buck County compared to the average of the whole state of Pennsylvania. There was a lower amount of these beetles found in Buck County compared to the whole state average. This part of the state does not have as much of a concern with this particular invasive species when comparing to the state average.

     Emerald Ash Borer Beetle – Null Hypothesis: There will be no difference between the average numbers of Emerald Ash Borer Beetles in Buck County compared to the average of the whole state of Pennsylvania. Alternative Hypothesis: There will be a difference between the average numbers of Emerald Ash Borer Beetle in Buck County compared to the average of the whole state of Pennsylvania. Conclusion: Reject the null hypothesis. There was a difference between the average number of Emerald Ash Borer Beetles in Buck County compared to the average of the whole state of Pennsylvania. There was a higher amount of these beetles found in Buck County compared to the whole state average. This part of the state has a concern with this invasive species when being compared to the state average.

     Golden Nematode – Null Hypothesis: There will be no difference between the average numbers of Golden Nematodes in Buck County compared to the average of the whole state of Pennsylvania. Alternative Hypothesis: There will be a difference between the average numbers of Golden Nematodes in Buck County compared to the average of the whole state of Pennsylvania. Conclusion: Reject the null hypothesis. There was a difference between the average numbers of Golden Nematodes in Buck County compared to the average for the whole state of Pennsylvania. There was a higher amount of Golden Nematodes found in Buck County compared to the whole state average. This part of the state has a concern with this invasive species when being compared to the state average.

3. 
      Null Hypothesis: There is not a difference of persons per party visiting a particular wilderness park from the 1960 study and sample taken in 1985. Alternative hypothesis: There is a difference of persons per party visiting a particular wilderness park from the 1960 study and sample taken in 1985. The corresponding probability value was 7.22 which gives an almost 100% chance that there is a difference between the studies.  


Part 2


Introduction:


     In the state of Wisconsin, the tourism board has inquired about the concept of “Up-North”. The board wants to see if there is a statistical difference between the northern and southern zones of Wisconsin. The state of Wisconsin has provided a large set of data with different variables for each county across the entire state. From the numerous variables, four will be examined to see if there is a difference between the two parts of the state. The four variables chosen include resident and non-resident deer gun licenses sold as well as resident and non-resident deer bow licenses sold.  For the purpose of this study, Highway 29 running across the state will be used as the dividing line for the northern and southern zone boundary. The map below shows the boundary what is deemed as "Up-north" for the purpose of this study (Figure 1).


Figure 1. This map shows where the boundary is representing what is the northern and southern portions of the state.


Methodology: 

     In order to start the analysis, the data needed to be manipulated to fit the objectives of the assignment. A shapefile for all the counties of the state of Wisconsin (provided in a previous lab) needed to be joined with the master data set provided by the state. Once the tables were joined, the next step was to add fields to the combined data set. Four new fields were created, one for each of the variables that were to be used for analysis.  Once the four fields were created, they needed to be filled with information that could be used for statistical testing.

     The objectives of this assignment called for Chi-Squared testing to be conducted on each of the four variables. Chi-Squared is a test that is used to compare the observed distribution to the expected distribution of a frequency. In order to complete this test absolute values have to be used. Rates, percentages or proportions are not acceptable to use in this testing.

     The variables needed to be broken up into different classes in order for Chi-Squared testing to be effective. The assignment called for the variables to be broken into four classes based on an equal interval.  After the new classes had been made, the data was then exported to a dBase table. This file type is compatible with IBM SPSS software, which is what will be used for Chi-Squared testing.

Results:

The results from this analysis were very interesting. One would expect there to be a major difference with the concept of "Up-North" and the southern part of the state. The first variable of non-resident bow deer licenses sold is shown below (Figure 2).


Figure 2 Shows the number of non-resident bow
deer licenses sold in 2005 for the state of Wisconsin.

Table 1. The result of running Chi-Squared
testing on the first variable. 
When initially looking at the map, it appears that there is an obvious difference between the two parts of the state. In the northwestern portion of the map there is an obvious cluster of higher amounts of licenses sold. This would make sense because the locations with the highest amounts of licenses sold are also closest to bordering states. The table shown above shows the results of the Chi-Squared testing (Table 1). The test is used based upon a 95% confidence rating. The result of the test show that we fail to reject the null hypothesis. There is not a statistical difference between the northern are southern parts of the states for non-resident bow deer licenses sold. Although, initial predictions based off the map seem to show there is a difference, it was statistically proven that there is no difference.

Table 2. The result of running Chi-Squared
testing on the second variable.
Figure 3 Shows the number of non-resident gun
deer licenses sold in 2005 for the state of Wisconsin.












The next variable the was used was the non-resident gun deer licenses sold. The map to the right shows the distribution of where licenses were sold (Figure 3). When comparing this to figure 2, they are very similar. Many of the areas that were high in bow licenses sold are also high in gun licenses sold. The numbers of licenses sold are higher farther inward from the borders. This map also appears to give the impression that there would be a great difference between the northern and southern parts of the state. The table shown above shows us a different perspective (Table 2). The result of the Chi-Squared test are based on a 95% confidence level. We fail to reject the null hypothesis. There is no difference between the expected results the observed results of licenses sold. 

Table 3. The result of running Chi-Squared
testing on the third variable.

Figure 3 Shows the number of resident gun
deer licenses sold in 2005 for the state of Wisconsin.









The third variable was the resident gun deer licenses sold. The map to the left shows the distribution of resident gun deer licenses sold (Figure 4). This pattern is much different than that of figure 3. The amount of licenses sold seems to correlate to areas of higher population. The counties with the highest amount of licenses sold also have a higher population. The map does not have an apparent split like the first two maps. The table above is the result of the Chi-Squared test (Table 3). The results show that we fail to reject the null hypothesis.  It shows that there is no statistical difference between the expected amounts of licenses to be sold and the observed amount of licenses sold.


Table 4. The result of running Chi-Squared
testing on the fourth variable.
Figure 4 Shows the number of resident bow
deer licenses sold in 2005 for the state of Wisconsin.










The fourth and final variable used was resident bow deer licenses sold. The map to the right shows the distribution of resident deer bow licenses sold (Figure 5). This map is very similar to that of figure 4. Many areas of higher amounts of sales also correlate to the counties with higher populations. There does not seem to be an apparent split of the state. The table above shows the results of the Chi-Squared test   (Table 4). The test was based on a 95% confidence interval. With this variable, we fail to reject the null hypothesis. There is no statistical difference between the expected amounts of licenses to be sold and the observed amount of licenses sold.


Conclusion:

The results of this lab were surprising. Initially, I had thought there would be a difference between the northern part of the state compared to the southern portion. Although the maps do appear to show a difference, there was no statistical difference between the two. It was apparent that the most popular spot for non-residents to hunt deer in the state were in the counties that bordered other states. The map does not show if that reason is because of geographical distance or if there is more opportunity for deer hunting in the border counties. 



Sources:

State of Wisconsin

Wednesday, February 25, 2015

Kansas and Oklahoma Tornado Shelters

Introduction:

This particular investigation is related to the frequency and sizes of tornadoes in Kansas and Oklahoma. Data has been provided for the location and size of each tornado for the years from 1995-2012. The data is broken up into two different block groups. The first being year 1995-2006 and the second being 2007-2012. The second block group also has the number of tornadoes for each county along with the location and size. There is a debate of whether or not to build tornado shelters in particular locations. Some of the public believe that there is a pattern as to where tornadoes are occurring with a higher frequency, while another group believes just the opposite. The other opinion is that not all places see tornadoes and therefore, it is an unnecessary waste of money to build these shelters. The state believes it is better to build the shelters in order to be on the safe side in case disaster strikes.

Methodology:

There were multiple tools that needed to be used in order to accurately assess whether or not shelters should be built. When modeling the data, it was broken up into the two different block groups. The first statistic to be mapped was the mean center. Each tornado location has an X coordinate and Y coordinate attached to it. In order to find the mean distance, all of these different points needed to be added up. The average from all the X points and Y points make up the two final points that represent the mean center. This shows where the exact middle is from all of the data points provided.


Figure 1 show the locations of Tornadoes in Kansas and Oklahoma for the years of 1995-2006. The locations are shown on the map by the size of the tornado's width in feet. The mean center and weighted mean center are also shown


The next tool used, which is very similar to mean center, is weighted mean center. Instead of only taking the average of the points, the weighted mean center also takes into consideration different frequencies of the grouped data. In other words, the points are weighted by frequencies which will most likely cause a different result than the mean center. 

Figure 2 show the locations of Tornadoes in Kansas and Oklahoma for the years of 2007-2012. The locations are shown on the map by the size of the tornado's width in feet. The mean center and weighted mean center are also shown


These two maps above (Figures 1 and 2) show the locations of tornadoes for the different years as well as the different mean centers and weighted mean centers. When looking at Figure 1, you notice that there is a shift of the weighted mean center to the south. This shows there were more tornadoes to the south of the mean center, rather than to the north of it. Figure 2 had a similar phenomena happen as what was shown in Figure 1. The one difference is the shift was in more of a southeastern direction, rather than straight south. 

Figure 3 show the locations of Tornadoes in Kansas and Oklahoma for the years of 1995-2012. The locations are shown on the map by the size of the tornado's width in feet. The mean center and weighted mean center are also shown




The map above (Figure 3) is the compilation of both Figure 1 and 2. When comparing the two results, the mean center has shifted north from the first block year to the second, but the weighted mean center has continued to move to the south. 


The second set of tools that were used involved standard distance. The standard distance is the spatial equivalent to the standard deviation. The standard distance shows where a particular percentage of tornadoes will occur around a particular point. For this example, 1 standard distance was used. The weighted mean center was the point used as the center maker for the standard distance. Since the weighted mean center was used, the map created was actually the weighted standard distance. You cannot create a weighted standard distance if there is not a weighted mean.


Figure 4 shows the tornado locations from
1995-2006 as well as where the weighted
standard distance is located. 
Figure 5 shows the tornado locations from
1995-2006 as well as where the weighted
standard distance is located. 

 The map to the left (Figure 4) shows the result of creating a weighted standard distance around the mean center. The map to the right (Figure 5) also shows the weighted standard distance. 



When comparing Figures 4 and 5, it is interesting to see the results. The map below (Figure 6) shows both maps combined together. Although Figure 3 had previously shown a shift to the south and east from the mean center to the weighted mean centers, The shift of weighted standard distance is to the northeast. Although this is the opposite of Figure 3, it is reasonable result. It is only comparing the results of the weighted mean center from the first block group to the second. Since it is only using these two points, the shift is understandable. 

Figure 6 is a compilation map of the weighted standard distance maps with the tornado locations overlaid to show where all the tornadoes have occurred from 1995-2012.

The last set of tools used was to find the standard deviation of the number of tornadoes that occurred. The data provided only had occurrences from the year 2007-2012, so the results will not reflect the two block groups that have been used for the duration of this project. The standard deviation shows allows you to see what areas are above or below the average number of tornado occurrences. The map below (Figure 7) shows how the standard deviation varies across the two different states. 

Figure 7 shows the standard deviation for the amount of tornadoes that occured from 2007-2012. The mean of this data set was four tornadoes. The map shows a large portion of tornadoes that occurred above the average were in central Kansas. 

Results:

While looking at the results of all the different, the assignment also called for finding Z scores for three different counties. The counties were Russel, Co, KS, Caddo, Co. OK and Alfalfa, Co. OK. The Z score results for the counties were the following:

Russell: 4.80
Caddo: 2.09
Alfalfa: .23

After looking at the Z score for those three counties, the assignment also wanted to know how many tornadoes will occur 70% and 20% of the time for the next five years. The results are as follows:
There is a 70% chance that one tornado will occur over the next five years in the study area. There is a 20% chance that seven tornadoes will occur over the next five years in the study area. 

Conclusion:

When looking at all of the maps and the numbers associated with them, the findings were interesting. When looking at the probability of tornadoes occuring, according to the Z scores, the number seems very low. This would mean that it would not be a necessity to build shelters. On the other hand, when analyzing the different maps, it seems as if some areas are more prone to tornadoes and it may be a good investment to build storm shelters. 

Overall, it is hard to estimate where shelters should be built due to the large size of the study area. In order to get a more accurate representation of where shelters should be located, multiple maps may need to be made in specific locations within Kansas or Oklahoma.