Data Visualization

Dr. Mine Dogucu

Data Visualization

Examples

We will only touch the surface of data visualization in this class. Visualization is a rich field on its own, in case some of you would consider it as a career option.

Data Visualizations

are graphical representations of data
use different colors, shapes, and the coordinate system to summarize data
can tell a story or can be useful for exploring data

Packages for Today

library(hellodatascience)
library(tidyverse)

Data Context

data(atus_college)
glimpse(atus_college)

Rows: 312
Columns: 13
$ employment        <fct> Part Time, Full Time, Part Time, Part Time, NA, Part…
$ age               <dbl> 19, 23, 22, 21, 26, 25, 27, 36, 30, 20, 18, 20, 25, …
$ enrollment        <fct> Part Time, Full Time, Full Time, Full Time, Full Tim…
$ weekly_earnings   <dbl> 400.00, 1476.92, 561.25, 100.00, NA, 300.00, 1076.92…
$ household_size    <dbl> 6, 2, 2, 4, 2, 3, 3, 1, 3, 2, 4, 4, 2, 2, 4, 5, 4, 2…
$ time_alone        <dbl> 326, 150, 357, 22, 0, 455, 90, 340, 326, 120, 285, 5…
$ sleep_time        <dbl> 680, 180, 470, 660, 875, 765, 630, 300, 445, 630, 66…
$ work_time         <dbl> 315, 0, 0, 0, 0, 0, 0, 645, 555, 0, 0, 0, 520, 615, …
$ degree_class_time <dbl> 0, 0, 238, 0, 0, 0, 0, 0, 0, 0, 0, 228, 0, 0, 0, 0, …
$ shopping_time     <dbl> 14, 0, 0, 0, 0, 0, 0, 5, 20, 345, 0, 0, 0, 0, 0, 0, …
$ lunch_break_time  <dbl> 66, 60, 20, 115, 35, 50, 25, 30, 75, 60, 15, 15, 60,…
$ sports_time       <dbl> 0, 60, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0…
$ religious_time    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0…

Documentation for Data

?atus_college

Visualizing a Single Variable

Visualizing a ___ Variable

A bar plot of employment status. The x-axis is labeled 'employment status' with categories Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 140. Full Time has the tallest bar at about 130, Part Time is around 80, and NA is close to 100.

Figure 1: A bar plot of employment status

What do you notice about this graph?
If you could talk in English to R, how would you tell R to make this plot?

Pick your data

ggplot(data = atus_college)

Figure 2: Blank coordinate system

Map your variable

ggplot(data = atus_college, aes(x = employment))

A blank rectangular space with only the x axis labeled emplyment with three categories Full Time, Part Time, and NA

Figure 3: Mapping employment variable to the x-axis

Pick your plot type

ggplot(data = atus_college, aes(x = employment)) +
  geom_bar()

Figure 4: Creating the bars by adding the geometric layer of a barplot

Summary

The three main steps to make a plot are:

Call the ggplot() function on the data we want to plot.
Map variables to plot aesthetics using the aes() function.
Add a layer to specify the plot type.

code style

The tidyverse style guide has the following convention for writing ggplot2 code.

The plus sign for adding layers + always has a space before it and is followed by a new line.
The new line is indented by two spaces. RStudio does this automatically for you.

code style

ggplot(data = atus_college, aes(x = employment)) +
  geom_bar()

Both the above and below code are correct styles of writing ggplot code.

ggplot(
  data = atus_college, 
  aes(x = employment)
) +
  geom_bar()

Visualizing a Numeric Variable - Histogram

ggplot(data = atus_college, aes(x = weekly_earnings)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

A histogram of 'weekly earnings. The x-axis is labeled 'Weekly Earnings' and ranges from 0 to 3000 in increments of 1000. The y-axis is labeled 'Count' and ranges from 0 to 20 in increments of 5. There are about 30 bars with a width of about 100. The bars height represents the frequency of weekly earnings in different ranges, with most counts concentrated at lower earnings and gradually decreasing toward higher earnings.

Figure 5: A histogram of weekly earnings

ggplot(data = atus_college, aes(x = weekly_earnings)) +
  geom_histogram(binwidth = 20)

the same weekly earnings histogram with thinner bin widths set at 20

ggplot(data = atus_college, aes(x = weekly_earnings)) +
  geom_histogram(binwidth = 100)

the same weekly earnings histogram with medium bin widths set at 100

ggplot(data = atus_college, aes(x = weekly_earnings)) +
  geom_histogram(binwidth = 500)

the same weekly earnings histogram with thick bin widths set at 500

Describing histograms

A figure showing four histograms illustrating different distribution shapes. Each histogram has the x-axis labeled x and the y-axis labeled count. The first histogram, labeled Left-skewed, with the tail on the left side is longer meaning that it has most high-count bars concentrated on the right side, and low counts towards the left. The second histogram, labeled Bell-shaped, shows a symmetric distribution with most bars centered in the middle and equal length tails on both sides. The third histogram, labeled Uniform, has bars of roughly equal height across the range. The fourth histogram, labeled Right-skewed, with the tail on the right side is longer meaning that it has most high-count bars concentrated on the left side, and low counts towards the right.

Figure 6: Understanding skewness of a histogram

Visualizing a Numeric Variable - Boxplot

ggplot(data = atus_college, aes(y = weekly_earnings)) +
  geom_boxplot()

A vertical boxplot showing weekly earnings distribution. The x-axis has no label and its range from -0.4 to 0.4 has no real meaning. The y-axis is labeled weekly_earnings, ranging from 0 to 3000. The box spans approximately from 500 to 1250, with a median line near 750. The lower whisker extends from the lower end of the box down to 0, and the upper whisker extends from the upper end of the box up to about 2500. Several outliers appear above the upper whisker, ranging from about 2500 to 3000.

Figure 7: Boxplot of weekly earnings

Understanding Boxplots

A detailed boxplot illustrating weekly earnings distribution with annotations. The y-axis is labeled 'Weekly Earnings ($)' and ranges from -1000 to 3000. The box spans from Q1 = 394 to Q3 = 1250, with a median at 706. The minimum value is 12, and the maximum is 3040. The upper whisker limit is 2534, and the lower whisker limit is -890. Several green points above the upper whisker represent potential outliers between 2534 and 3040. The interquartile range (IQR) is highlighted between Q1 and Q3. Text annotations indicate key values: Min = 12, Median = 706, Q1 = 394, Q3 = 1250, Upper whisker limit = 2534, Lower whisker limit = -890, and Max = 3040.

Figure 8: Annotated boxplot

Understanding Boxplots

A vertical boxplot combined with jittered points showing weekly earnings distribution. The x-axis is labeled 'x' and the y-axis is labeled weekly_earnings, ranging from 0 to 3000. The box spans approximately from 500 to 1250, with a median line near 750. The lower whisker extends from the lower end of the box down to 0, and the upper whisker extends from the upper end of the box up to about 2500. Several outliers appear above the upper whisker, ranging from about 2500 to 3000. The boxplot is also overlayed with pink jittered points representing individual data values, scattered around the boxplot, with most points concentrated between 0 and 1500 and fewer points at higher earnings.

Figure 9: Boxplot overlayed with individual observation points

Visualizing Two Variables

Stacked Bar Plot

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    fill = enrollment 
  )
) +
  geom_bar()

A stacked bar chart showing counts of enrollment status within employment categories. The x-axis is labeled employment with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 150. Each bar is divided into two segments: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the total height is about 140, with roughly 60 teal and 80 red. For Part Time employment, the total is about 85, mostly red with a small teal segment near 10. For NA, the total is about 100, with a larger red segment around 80 and teal around 20. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 10: Stacked barplot of employment and enrollment status

Standardardized stacked barplot

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    fill = enrollment
  )
) +
  geom_bar(position = "fill")

A stacked bar chart showing the proportion of enrollment status within employment categories. The x-axis is labeled 'employment' with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 1, representing proportions. Each bar is divided into two segments: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the bar is about 45% teal and 55% red. For Part Time employment, the bar is mostly red (around 90%) with a small teal segment (about 10%). For NA, the bar is about 80% red and 20% teal. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 11: Standardardized stacked barplot of employment and enrollment

Dodged barplot

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    fill = enrollment
  )
) +
  geom_bar(position = "dodge")

A grouped side-by-side bar plot showing counts of employment categories broken into enrollment status. The x-axis is labeled 'employment' with three categories: Full Time, Part Time, and NA. The y-axis is labeled 'count' and ranges from 0 to 90. Each emplyment category has two bars: red for Full Time enrollment and teal for Part Time enrollment. For Full Time employment, the red bar is about 70 and the teal bar about 60. For Part Time employment, the red bar is about 75 and the teal bar about 10. For NA, the red bar is about 85 and the teal bar about 15. A legend on the right identifies colors for enrollment status: red for Full Time enrollment and teal for Part Time enrollment. — Figure 12: Side-by-side (dodge) barplot of employment and enrollment status

Scatterplot

ggplot(
  data = atus_college,
  aes(
    x = time_alone, 
    y = weekly_earnings
  )
) +
  geom_point()

Side-by-side Boxplot

ggplot(
  data = atus_college, 
  aes(
    x = employment,
    y = weekly_earnings
  )
) +
  geom_boxplot()

Visualizing More Than Two Variables

Use of color

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    color = employment 
  )
) +
  geom_point()

A scatterplot showing the relationship between time alone and weekly earnings, with points colored by employment category. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation, with red for Full Time, teal for Part Time, and gray for NA. Most points are concentrated in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (red) dominate across the range, including higher earnings up to 3000. Part Time points (teal) cluster mostly at lower earnings below 1000. NA points (gray) are sparse. A legend on the right identifies colors for employment categories: red for Full Time, teal for Part Time, and gray for NA. — Figure 15: Grouping points by color based on employment status

If you are color-blind, depending on the type, you may possibly not be able to distinguish these colors so the next slide will make much more sense.

Use of shape

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    shape = employment
  ) 
) +
  geom_point()

A scatterplot showing the relationship between time alone and weekly earnings, with point shapes indicating employment category. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation: circles for Full Time, triangles for Part Time, and no shape for NA because there are no points with income and NA status for employment. Most points cluster in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (circles) are spread across the range, including higher earnings up to 3000. Part Time points (triangles) are concentrated at lower earnings below 1000. There are no NA points. A legend on the right identifies shapes for employment categories: circles for Full Time, triangles for Part Time, and no shapte for NA. — Figure 16: Grouping points by shape based on employment status

Use of color and shape

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    color = employment,
    shape = employment
  )
) +
  geom_point()

Use of size

ggplot(
  data = atus_college,
  aes(
    x = time_alone,
    y = weekly_earnings,
    color = employment,
    size = work_time
  ) 
) +
  geom_point()

A scatterplot showing the relationship between time alone and weekly earnings, with points varying in color and size to represent employment category and work time. The x-axis is labeled time_alone and ranges from 0 to 1000. The y-axis is labeled weekly_earnings and ranges from 0 to 3000. Each point represents an observation: red circles for Full Time, teal circles for Part Time, and gray circles for NA. Point size indicates work time, with larger circles representing more work time (up to 800) and smaller circles representing less. Most points cluster in the lower left quadrant, where time alone is below 500 and weekly earnings are below 1500. Full Time points (red) dominate across the range, including higher earnings up to 3000, while Part Time points (teal) are concentrated at lower earnings below 1000. A legend on the right shows color for employment and size scale for work time. — Figure 18: Adding a numerical variable to differentiate points based on size

Practice

Using the penguins data frame ask a question that you are interested in answering. Visualize data to get a visual answer to the question. What is the visual telling you? Note all of this down in your lecture notes.

Learning Tip of the Day

Why does peer instruction benefit student learning?

Data Visualization

Data Visualization

Examples

Data Visualizations

Packages for Today

Data Context

Documentation for Data

Visualizing a Single Variable

Visualizing a ___ Variable

Pick your data

Map your variable

Pick your plot type

Summary

code style

code style

Visualizing a Numeric Variable - Histogram

Comparing binwidth

Describing histograms

Think 💭 - Pair 👫🏽 - Share 💬

Visualizing a Numeric Variable - Boxplot

Understanding Boxplots

Understanding Boxplots

Visualizing Two Variables

Stacked Bar Plot

Standardardized stacked barplot

Dodged barplot

Scatterplot

Side-by-side Boxplot

Visualizing More Than Two Variables

Use of color

Use of shape

Use of color and shape

Use of size

Practice

Learning Tip of the Day