Summarizing Data
Why do we need exploratory analysis and summarizing data
Suppose that you have a information of weights of 1000 students of a school. To understand an anything from it, one way is to going through the data row by row. For e.g. if you want to know what is the lowest weight out of these 1000 weights, you may start comparing very row of data. But, is that prudent? Of course no. Or, if you want to get any insight out of the raw data, is it even possible without exploring it? Of course no. And that is why we need to summarize data and explore it; to find information that can be easily interpreted.
Summarizing Categorical variable
Read about different types of data in this article.
When we have categorical variable, we count. Then we show the result in absolute numbers or in percentage. For e.g. 5 red ball, 2 blue balls and 3 green balls or 50 % red balls, 20 % blue balls and 30 % green balls. We can show the result as a table. Or we can show it as a graph.
1x<-read.csv("https://query.data.world/s/ycimehoogc3wiwgkd65z7d24v6mqik", header=TRUE, stringsAsFactors=FALSE)
2
3names(x)[1]<-"EnglishSpeaker"
4names(x)[6]<-"ClassAttribute"
5names(x)[4]<-"Semester"
6
7x$EnglishSpeaker<-as.factor(x$EnglishSpeaker)
8x$Semester<-as.factor(x$Semester)
9x$ClassAttribute<-as.factor(x$ClassAttribute)
10x<- x %>%
11 mutate(EnglishSpeaker=ifelse(EnglishSpeaker==1,"yes","no"))%>%
12 mutate(ClassAttribute=case_when(ClassAttribute==1 ~ "low",
13 ClassAttribute==2 ~ "medium",
14 ClassAttribute==3 ~ "high"))%>%
15 mutate(Semester=ifelse(Semester==1, "Summer", "Regular"))%>%
16 select(1,4,6)
17
18
19x%>%
20 group_by(ClassAttribute)%>%
21 summarise(Count=n())%>%
22 mutate(CountPercent=Count/sum(Count))%>%
23 ggplot(aes(y=Count,
24 x = reorder(ClassAttribute,Count),
25 )
26 )+
27 geom_col(width = 0.5,fill='turquoise')+
28 geom_text(aes(y=Count-6, label=Count), color="white", size=10)+
29 labs(title = "Class Attribute",
30 subtitle = "Teaching assistant evaluation",
31 x="Attribute",
32 y="Count",
33 caption = "Source: UCI machine learning repository")+
34 theme_clean() +
35 annotation_custom(l, xmin = 2.7, xmax = 4, ymin = 50, ymax = 63) +
36 coord_cartesian(clip = "off")
That is the case when we are summarizing one variable. What if there are more than one categorical variable? Then also, we can show as a table, or as a different kind of bar chart as shown below.
1x%>%
2 group_by(EnglishSpeaker)%>%
3 summarise(low=sum(ClassAttribute=="low"), medium=sum(ClassAttribute=="medium"), high=sum(ClassAttribute=="high"))%>%
4 tidyr::gather("ClassAttribute","Count",-1)%>%
5 ggplot(aes(x=EnglishSpeaker, y=Count, fill=ClassAttribute))+
6 geom_col(position="dodge2")+
7 geom_text(aes(y=Count-2, label=Count),
8 position = position_dodge(width = 1
9 ),
10 color="white",
11 size=5
12 )+
13 labs(title = "Class Attribute by native language",
14 subtitle = "Teaching assistant evaluation",
15 x="English Speaker",
16 y="Count",
17 caption = "Source: UCI machine learning repository")+
18 theme_clean() +
19 annotation_custom(l, xmin = 2, xmax = 3.4, ymin = 36, ymax = 52) +
20 coord_cartesian(clip = "off")
Sometimes, it may be beneficial to show the same data as shown below. Please notice that the y axis has been scaled down to 1 and both yes and no categories add up to one. This means, the colours show the proportion of high, medium and low in each of the yes and no category.
1x%>%
2 group_by(EnglishSpeaker)%>%
3 summarise(low=sum(ClassAttribute=="low"), medium=sum(ClassAttribute=="medium"), high=sum(ClassAttribute=="high"))%>%
4 tidyr::gather("ClassAttribute","Count",-1)%>%
5 ggplot(aes(x=reorder(EnglishSpeaker,Count),
6 y=Count, fill=ClassAttribute))+
7 geom_col(position="fill", width = 0.5)+
8 geom_text(aes(label=Count),
9 position = position_fill(vjust = 0.5),
10 color="white",
11 size=5
12 )+
13 labs(title = "Class Attribute by native language",
14 subtitle = "Teaching assistant evaluation",
15 x="English Speaker",
16 y="Count",
17 caption = "Source: UCI machine learning repository")+
18 theme_clean() +
19 annotation_custom(l, xmin = 2, xmax = 3.4, ymin = 0.8, ymax = 1.2) +
20 coord_cartesian(clip = "off")
In case of categorical variables, we count the number of occurences. If more than one variable is involved, we count the combination of variables.
Summarizing Numeric Variables
Single variable
Like in case of categorical variables, it is also possible to count numeric variables. Usually, the counting is bit different. We create bins and count the bins. For example, if we have weights of 200 individuals that range from 50 kg to 100 kg, we may create bins of 50 kg to 59 kg, 60 kg to 69 kg and so on. These are called bins. And then we show the number of data points that occur in that bin, We can show that in table or as graph.
1df <- data.frame(
2 sex=factor(rep(c("F", "M"), each=200)),
3 weight=round(c(rnorm(200, mean=40, sd=4), rnorm(200, mean=50, sd=5)))
4 )
5
6ggplot(df, aes(x=weight)) +
7 geom_histogram(binwidth=1, color="black", fill="blue", alpha=0.3)+
8 labs(title="Histogram of weight",
9 y="frequency")+
10 theme_clean()+
11 annotation_custom(l, xmin = 60, xmax = 65, ymin = 22, ymax = 30) +
12 coord_cartesian(clip = "off")
It becomes very easy to understand how much do most of the students weigh, from the histogram. Another way to visualize the information is by using density plot.
1ggplot(df, aes(x=weight)) +
2 geom_density(alpha=.2, fill="#FF6666")+
3 labs(title="Density of weight")+
4 theme_clean()+
5 annotation_custom(l, xmin = 60, xmax = 65, ymin = 0.045, ymax = 0.06) +
6 coord_cartesian(clip = "off")
Another way to summarize single numerical variable is using cumulative frequency. This is done by creating bins of the variable and placing them in order. Then the cumulative occurrences or data points are counted. An example is shown below.
1df%>%
2 mutate(weight=case_when(
3 weight<40 ~ "30-39",
4 weight>=40 & weight<50 ~ "40-49",
5 weight>=50 & weight<60 ~ "50-59",
6 weight>=60 ~ "60-69"
7 ))%>%
8 group_by(weight)%>%
9 summarise(Count=n())%>%
10 mutate(cumulative_count=cumsum(Count))%>%
11 mutate(cumulative_percent=cumulative_count*100/sum(Count)) %>%
12 knitr::kable()
weight | Count | cumulative_count | cumulative_percent |
---|---|---|---|
30-39 | 95 | 95 | 23.75 |
40-49 | 206 | 301 | 75.25 |
50-59 | 91 | 392 | 98.00 |
60-69 | 8 | 400 | 100.00 |
This is particularly useful while trying to answer questions like "How many are less than or how many are more than certain value".
1df%>%
2 mutate(weight=case_when(
3 weight<40 ~ "30-39",
4 weight>=40 & weight<50 ~ "40-49",
5 weight>=50 & weight<60 ~ "50-59",
6 weight>=60 ~ "60-69"
7 ))%>%
8 group_by(weight)%>%
9 summarise(Count=n())%>%
10 mutate(cumcount=cumsum(Count))%>%
11 mutate(cumper=cumcount*100/sum(Count))%>%
12 ggplot(aes(x=weight, y=cumper, group=1))+geom_line(color="black")+geom_point()+
13 labs(title="Cumulative Percentage of Weight(Range)",
14 x="Weight Ranges",
15 y="Cumulative Percentage (%)")+
16 theme_clean() +
17 annotation_custom(l, xmin = 3, xmax = 4, ymin = 40, ymax = 60) +
18 coord_cartesian(clip = "off")
For example, from the above graph, it is evident that most of the weights are less than or equal to 59 kg.
Combination of numeric and categorical variables
Sometimes we may have combination of numeric and categorical variables. From the previous example, if we want to check the weights of the students by gender, we can plot overlaying histograms.
1ggplot(df, aes(x=weight, color=sex, fill=sex)) +
2 geom_histogram(alpha=0.2, position="identity")+
3 labs(title="Histogram of weight by sex")+
4 theme_clean()+
5 annotation_custom(l, xmin = 60, xmax = 65, ymin = 35, ymax = 45) +
6 coord_cartesian(clip = "off")
1## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Two variables
Whenever two numeric variables are involved, we try to understand the relationship between them.
1mtcars%>%
2 ggplot(aes(x=mpg,y=hp))+geom_point(aes(colour=hp))+
3 labs(x="Fuel consumption (Miles per gallon)",
4 y="Engine power (HP)",
5 title = "Relationship between Engine power and fuel consumption"
6 )+
7 theme_clean()+
8 annotation_custom(l, xmin = 30, xmax = 35, ymin = 275, ymax = 350) +
9 coord_cartesian(clip = "off")
And if one of the variables happen to be a date/year/month/time or similar, we try to understand trend.
1x = WDI(indicator='SL.UEM.TOTL.ZS', country=c('IN'))
2x %>%
3 filter(is.na(x$SL.UEM.TOTL.ZS))
1## iso2c country SL.UEM.TOTL.ZS year
2## 1 IN India NA 1990
3## 2 IN India NA 1989
4## 3 IN India NA 1988
5## 4 IN India NA 1987
6## 5 IN India NA 1986
7## 6 IN India NA 1985
8## 7 IN India NA 1984
9## 8 IN India NA 1983
10## 9 IN India NA 1982
11## 10 IN India NA 1981
12## 11 IN India NA 1980
13## 12 IN India NA 1979
14## 13 IN India NA 1978
15## 14 IN India NA 1977
16## 15 IN India NA 1976
17## 16 IN India NA 1975
18## 17 IN India NA 1974
19## 18 IN India NA 1973
20## 19 IN India NA 1972
21## 20 IN India NA 1971
22## 21 IN India NA 1970
23## 22 IN India NA 1969
24## 23 IN India NA 1968
25## 24 IN India NA 1967
26## 25 IN India NA 1966
27## 26 IN India NA 1965
28## 27 IN India NA 1964
29## 28 IN India NA 1963
30## 29 IN India NA 1962
31## 30 IN India NA 1961
32## 31 IN India NA 1960
1x<-na.omit(x)
2x$year<-lubridate::ymd(x$year, truncated = 2L)
3
4x%>%
5 ggplot(aes(x=year, y=SL.UEM.TOTL.ZS))+
6 geom_line()+
7 geom_point()+
8 labs(x="Year",
9 y="Unemployment Rate (% of labour force)",
10 title = "Uemployment Rate by year in India",
11 caption = "Source: World Bank")+
12 theme_clean() +
13 annotation_custom(l, xmin = 1995, xmax = 2000, ymin = 6.5, ymax = 7.0) +
14 coord_cartesian(clip = "off")
You may want to have a look at the video which explains the above.