Labelr - Special Topics

Larger Data Frames

labelr is not intended for “large” data.frames, which is a fuzzy concept. To give a sense of what labelr can handle, let’s see it in action with the NYC Flights 2013 data set: a moderate-not-big data.frame of ~340K rows.

Let’s load labelr and the nycflights13 package.

opening_ding <- Sys.time() # to time labelr

library(labelr)
library(nycflights13)
### > Warning: package 'nycflights13' was built under R version 4.3.2

We’ll assign the data.frame to one we call df.

df <- flights

nrow(df)
### > [1] 336776

We’ll add a “frame label,” which describes the data.frame overall.

df <- add_frame_lab(df, frame.lab = "On-time data for all flights that
                    departed NYC (i.e. JFK, LGA or EWR) in 2013.")
### > Warning in as_base_data_frame(data): 
### > data argument object coerced from augmented to conventional (Base R) data.frame.

Note that the source data.frame (nycflights13::flights) is a tibble. The labelr package coerces augmented data.frames, such as tibbles and data.tables, into “pure” Base R data.frames – and alerts you that it has done so. The intent is to avoid the dependencies, errors, or inconsistent and unpredictable behaviors that might result from labelr trying to integrate with or make sense of these or other competing, alternative data.frame constructs, which (a) by design behave differently from standard R data.frames in various subtle or not-so-subtle ways and which (b) may continue to evolve in the future.

Let’s see what this did.

attr(df, "frame.lab") # check for attribute
### > [1] "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."

get_frame_lab(df) # return frame.lab alongside data.frame name as a data.frame
### >   data.frame
### > 1         df
### >                                                                        frame.lab
### > 1 On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

get_frame_lab(df)$frame.lab
### > [1] "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."

Now, let’s assign variable NAME labels.

names_labs_vec <- c(
  "year" = "Year of departure",
  "month" = "Month of departure",
  "year" = "Day of departure",
  "dep_time" = "Actual departure time (format HHMM or HMM), local tz",
  "arr_time" = "Actual arrival time (format HHMM or HMM), local tz",
  "sched_dep_time" = "Scheduled departure times (format HHMM or HMM)",
  "sched_arr_time" = "Scheduled arrival time (format HHMM or HMM)",
  "dep_delay" = "Departure delays, in minutes",
  "arr_delay" = "Arrival delays, in minutes",
  "carrier" = "Two letter airline carrier abbreviation",
  "flight" = "Flight number",
  "tailnum" = "Plane tail number",
  "origin" = "Flight origin airport code",
  "dest" = "Flight destination airport code",
  "air_time" = "Minutes spent in the air",
  "distance" = "Miles between airports",
  "hour" = "Hour of scheduled departure time",
  "minute" = "Minutes component of scheduled departure time",
  "time_hour" = "Scheduled date and hour of the flight as a POSIXct date"
)

df <- add_name_labs(df, name.labs = names_labs_vec)

get_name_labs(df) # show that they've been added
### >               var                                                     lab
### > 1            year                                        Day of departure
### > 2           month                                      Month of departure
### > 3             day                                                     day
### > 4        dep_time    Actual departure time (format HHMM or HMM), local tz
### > 5  sched_dep_time          Scheduled departure times (format HHMM or HMM)
### > 6       dep_delay                            Departure delays, in minutes
### > 7        arr_time      Actual arrival time (format HHMM or HMM), local tz
### > 8  sched_arr_time             Scheduled arrival time (format HHMM or HMM)
### > 9       arr_delay                              Arrival delays, in minutes
### > 10        carrier                 Two letter airline carrier abbreviation
### > 11         flight                                           Flight number
### > 12        tailnum                                       Plane tail number
### > 13         origin                              Flight origin airport code
### > 14           dest                         Flight destination airport code
### > 15       air_time                                Minutes spent in the air
### > 16       distance                                  Miles between airports
### > 17           hour                        Hour of scheduled departure time
### > 18         minute           Minutes component of scheduled departure time
### > 19      time_hour Scheduled date and hour of the flight as a POSIXct date

Let’s add variable VALUE labels for variable “carrier.” Helpfully, a mapping of airlines’ carrier codes to their full names ships with the nycflights13 package itself.

airlines <- nycflights13::airlines

head(airlines)
### > # A tibble: 6 × 2
### >   carrier name                    
### >   <chr>   <chr>                   
### > 1 9E      Endeavor Air Inc.       
### > 2 AA      American Airlines Inc.  
### > 3 AS      Alaska Airlines Inc.    
### > 4 B6      JetBlue Airways         
### > 5 DL      Delta Air Lines Inc.    
### > 6 EV      ExpressJet Airlines Inc.

The carrier field of airlines matches the carrier column of df (formerly, flights)

ny_val <- airlines$carrier

The name field of airlines gives us the full airline names.

ny_lab <- airlines$name

Let’s add use these vectors to add value labels to df. We’ll demo add_val1(), which accepts only one variable but allows you to pass its name unquoted.

df <- add_val1(df,
  var = carrier, vals = ny_val,
  labs = ny_lab,
  max.unique.vals = 20
)
### > Warning in add_val1(df, var = carrier, vals = ny_val, labs = ny_lab, max.unique.vals = 20): 
### > 
### > Note: labelr is not optimized for data.frames this large.

(Side note on warnings: The package issues the first in what will become a series of potentially annoying warnings that you are applying value labels to a larger data.frame than labelr was built to handle. There is a reason that this is a warning, not an error: labelr will work on larger data.frames until it doesn’t, which is to say that the burdens of computational intensiveness will become a drag on speed and R’s in-session memory capacity. In the present case, labelr handles the data.frame just fine, but things take a little longer, and labelr seizes most opportunities to remind you that you’re making it work overtime.)

Okay, back to the value-labeling. Our data.frame also has a month variable, expressed in integer terms (e.g., 1 indicates January, 9 indicates September). We will “hand-jam” month value labels,using add_val_labs(). This command is equivalent to add_val1(), except that it requires variable names to be quoted but allows you to supply more than one of them at a time (i.e., you can supply a character vector of variable names). In this case, we’ll use it on just one variable.

First, we’ll create our vectors of unique values and labels.

ny_month_vals <- c(1:12) # values
ny_month_labs <- c(
  "JAN", "FEB", "MAR", "APR", "MAY", "JUN",
  "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"
) # labels

Note that order is important here: We need to supply exactly as many values as value labels, with each value label being uniquely associated with the value that shares its index. For example, in the above case, ny_month_vals[3] (here, 3) is associated with the ny_month_labs[3] (here, "MAR")).

Now, let’s use these two vectors to add value labels for the variable “month”.

df <- add_val_labs(df,
  vars = "month",
  vals = ny_month_vals,
  labs = ny_month_labs,
  max.unique.vals = 20
)
### > Warning in add_val_labs(df, vars = "month", vals = ny_month_vals, labs = ny_month_labs, : 
### > 
### > Note: labelr is not optimized for data.frames this large.

Finally, we’ll use add_quant_labs() to provide numerical range value labels for five quintiles of the variable “dep_time.”

df <- add_quant_labs(df, "dep_time", qtiles = 5)
### > Warning in add_quant_labs(df, "dep_time", qtiles = 5): 
### > 
### > Note: labelr is not optimized for data.frames this large.

Let’s see where these value-labeling operations have left us.

get_val_labs(df)
### >         var vals                        labs
### > 1     month    1                         JAN
### > 2     month    2                         FEB
### > 3     month    3                         MAR
### > 4     month    4                         APR
### > 5     month    5                         MAY
### > 6     month    6                         JUN
### > 7     month    7                         JUL
### > 8     month    8                         AUG
### > 9     month    9                         SEP
### > 10    month   10                         OCT
### > 11    month   11                         NOV
### > 12    month   12                         DEC
### > 13    month   NA                          NA
### > 14 dep_time  827                        q020
### > 15 dep_time 1200                        q040
### > 16 dep_time 1536                        q060
### > 17 dep_time 1830                        q080
### > 18 dep_time 2400                        q100
### > 19 dep_time   NA                          NA
### > 20  carrier   9E           Endeavor Air Inc.
### > 21  carrier   AA      American Airlines Inc.
### > 22  carrier   AS        Alaska Airlines Inc.
### > 23  carrier   B6             JetBlue Airways
### > 24  carrier   DL        Delta Air Lines Inc.
### > 25  carrier   EV    ExpressJet Airlines Inc.
### > 26  carrier   F9      Frontier Airlines Inc.
### > 27  carrier   FL AirTran Airways Corporation
### > 28  carrier   HA      Hawaiian Airlines Inc.
### > 29  carrier   MQ                   Envoy Air
### > 30  carrier   OO       SkyWest Airlines Inc.
### > 31  carrier   UA       United Air Lines Inc.
### > 32  carrier   US             US Airways Inc.
### > 33  carrier   VX              Virgin America
### > 34  carrier   WN      Southwest Airlines Co.
### > 35  carrier   YV          Mesa Airlines Inc.
### > 36  carrier   NA                          NA

We can use head() to get a baseline look at select rows and variables

head(df[c("origin", "dep_time", "dest", "year", "month", "carrier")])
### >   origin dep_time dest year month carrier
### > 1    EWR      517  IAH 2013     1      UA
### > 2    LGA      533  IAH 2013     1      UA
### > 3    JFK      542  MIA 2013     1      AA
### > 4    JFK      544  BQN 2013     1      B6
### > 5    LGA      554  ATL 2013     1      DL
### > 6    EWR      554  ORD 2013     1      UA

Now, let’s do the same for a version of df that we’ve modified with use_val_labs(), which converts all values of value-labeled variables to their corresponding labels.

df_swapd <- use_val_labs(df)
### > Warning in use_val_labs(df): 
### > Note: labelr is not optimized for data.frames this large.

head(df_swapd[c("origin", "dep_time", "dest", "year", "month", "carrier")])
### >   origin dep_time dest year month                carrier
### > 1    EWR     q020  IAH 2013   JAN  United Air Lines Inc.
### > 2    LGA     q020  IAH 2013   JAN  United Air Lines Inc.
### > 3    JFK     q020  MIA 2013   JAN American Airlines Inc.
### > 4    JFK     q020  BQN 2013   JAN        JetBlue Airways
### > 5    LGA     q020  ATL 2013   JAN   Delta Air Lines Inc.
### > 6    EWR     q020  ORD 2013   JAN  United Air Lines Inc.

Instead of replacing values using use_val_labs() – something we can’t directly undo – it might be safer to simply add “value-labels-on” character variables to the data.frame, while preserving the parent variables. This adds nearly 1M new cells to our df (!), but let’s throw caution to the wind with add_lab_cols().

df_plus <- add_lab_cols(df, vars = c("carrier", "month", "dep_time"))
### > Warning in add_lab_cols(df, vars = c("carrier", "month", "dep_time")): 
### > 
### > Note: labelr is not optimized for data.frames this large.

head(df_plus[c(
  "origin", "dest", "year",
  "month", "month_lab",
  "dep_time", "dep_time_lab",
  "carrier", "carrier_lab"
)])
### >   origin dest year month month_lab dep_time dep_time_lab carrier
### > 1    EWR  IAH 2013     1       JAN      517         q020      UA
### > 2    LGA  IAH 2013     1       JAN      533         q020      UA
### > 3    JFK  MIA 2013     1       JAN      542         q020      AA
### > 4    JFK  BQN 2013     1       JAN      544         q020      B6
### > 5    LGA  ATL 2013     1       JAN      554         q020      DL
### > 6    EWR  ORD 2013     1       JAN      554         q020      UA
### >              carrier_lab
### > 1  United Air Lines Inc.
### > 2  United Air Lines Inc.
### > 3 American Airlines Inc.
### > 4        JetBlue Airways
### > 5   Delta Air Lines Inc.
### > 6  United Air Lines Inc.

We can use flab() to filter df based on month and carrier, even when value labels are “invisible” (i.e., existing only as attributes() meta-data.

# labels are not visible (they exist only as attributes() meta-data)
head(df[c("carrier", "arr_delay")])
### >   carrier arr_delay
### > 1      UA        11
### > 2      UA        20
### > 3      AA        33
### > 4      B6       -18
### > 5      DL       -25
### > 6      UA        12

# we still can use them to filter (note: we're filtering on "JetBlue Airways",
# ...NOT its obscure code "B6")
df_fl <- flab(df, carrier == "JetBlue Airways" & arr_delay > 20)
### > Warning in use_val_labs(data): 
### > Note: labelr is not optimized for data.frames this large.

# here's what's returned when we filtered on "JetBlue Airways" using flab()
head(df_fl[c("carrier", "arr_delay")])
### >     carrier arr_delay
### > 70       B6        44
### > 129      B6        24
### > 174      B6        40
### > 203      B6        42
### > 292      B6        29
### > 314      B6        38

# double-check that this is JetBlue
head(use_val_labs(df_fl)[c("carrier", "arr_delay")])
### >             carrier arr_delay
### > 70  JetBlue Airways        44
### > 129 JetBlue Airways        24
### > 174 JetBlue Airways        40
### > 203 JetBlue Airways        42
### > 292 JetBlue Airways        29
### > 314 JetBlue Airways        38

How long did this entire NYC Flights session take (results will vary)?

the_buzzer <- Sys.time()
the_buzzer - opening_ding
### > Time difference of 1.274677 mins

NA and “Irregular” Values

labelr is not a fan of NA values or other “irregular” values, which are defined as infinite values, not-a-number values, and character values that look like them (e.g., “NAN”, “INF”, “inf”, “Na”).

When value-labeling a column / variable, such values are automatically given the catch-all label “NA” (which will be converted to an actual NA in any columns created by add_lab_cols() or use_val_labs()). You do not need (and should not try) to specify this yourself, and you should not try to over-ride labelr on this. If you want to use labelr AND you present with these sorts of values, your options are to accept the default “NA” label or convert these sorts of values to something else before labeling.

With that said, let’s see how labelr handles this, with an assist from our old friend mtcars (packaged with R’s base distribution).

First, let’s assign mtcars to a new data.frame object that we will besmirch.

mtbad <- mtcars

Let’s get on with the besmirching.

mtbad[1, 1:11] <- NA
rownames(mtbad)[1] <- "Missing Car"
mtbad[2, "am"] <- Inf
mtbad[3, "gear"] <- -Inf
mtbad[5, "carb"] <- NaN
mtbad[2, "mpg"] <- Inf
mtbad[3, "mpg"] <- NaN

# add a character variable, for demonstration purposes
# if it makes you feel better, you can pretend these are Consumer Reports or
# ...JD Power ratings or something
set.seed(9202) # for reproducibility
mtbad$grade <- sample(c("A", "B", "C"), nrow(mtbad), replace = TRUE)
mtbad[4, "grade"] <- NA
mtbad[5, "grade"] <- "NA"
mtbad[6, "grade"] <- "Inf"

# see where this leaves us
head(mtbad)
### >                    mpg cyl disp  hp drat    wt  qsec vs  am gear carb grade
### > Missing Car         NA  NA   NA  NA   NA    NA    NA NA  NA   NA   NA     B
### > Mazda RX4 Wag      Inf   6  160 110 3.90 2.875 17.02  0 Inf    4    4     C
### > Datsun 710         NaN   4  108  93 3.85 2.320 18.61  1   1 -Inf    1     C
### > Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1   0    3    1  <NA>
### > Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0   0    3  NaN    NA
### > Valiant           18.1   6  225 105 2.76 3.460 20.22  1   0    3    1   Inf

sapply(mtbad, class)
### >         mpg         cyl        disp          hp        drat          wt 
### >   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
### >        qsec          vs          am        gear        carb       grade 
### >   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" "character"

Now, let’s add value labels to this unruly data.frame.

mtlabs <- mtbad |>
  add_val1(grade,
    vals = c("A", "B", "C"),
    labs = c("Gold", "Silver", "Bronze")
  ) |>
  add_val1(am,
    vals = c(0, 1),
    labs = c("auto", "stick")
  ) |>
  add_val1(carb,
    vals = c(1, 2, 3, 4, 6, 8), # not the most inspired use of labels
    labs = c(
      "1c", "2c", "3c",
      "4c", "6c", "8c"
    )
  ) |>
  add_val1(gear,
    vals = 3:5, # again, not the most compelling use case
    labs = c(
      "3-speed",
      "4-speed",
      "5-speed"
    )
  ) |>
  add_quant1(mpg, qtiles = 4) # add quartile-based value labels

get_val_labs(mtlabs, "am") # NA values were detected and dealt with
### >   var vals  labs
### > 6  am    0  auto
### > 7  am    1 stick
### > 8  am   NA    NA

Let’s streamline the data.frame with sselect() to make it more manageable.

mtless <- sselect(mtlabs, mpg, cyl, am, gear, carb, grade) # safely select

head(mtless, 5) # note that the irregular values are still here
### >                    mpg cyl  am gear carb grade
### > Missing Car         NA  NA  NA   NA   NA     B
### > Mazda RX4 Wag      Inf   6 Inf    4    4     C
### > Datsun 710         NaN   4   1 -Inf    1     C
### > Hornet 4 Drive    21.4   6   0    3    1  <NA>
### > Hornet Sportabout 18.7   8   0    3  NaN    NA

Notice how all irregular values are coerced to NA when we substitute labels for values with use_val_labs().

head(use_val_labs(mtless), 5) # but they all go to NA if we `use_val_labs`
### >                    mpg cyl    am    gear carb  grade
### > Missing Car       <NA>  NA  <NA>    <NA> <NA> Silver
### > Mazda RX4 Wag     <NA>   6  <NA> 4-speed   4c Bronze
### > Datsun 710        <NA>   4 stick    <NA>   1c Bronze
### > Hornet 4 Drive    q075   6  auto 3-speed   1c   <NA>
### > Hornet Sportabout q050   8  auto 3-speed <NA>   <NA>

Now, let’s try an add_lab_cols() view.

mtlabs_plus <- add_lab_cols(mtlabs, c("mpg", "am")) # creates, adds "am_lab" col
mtlabs_plus <- sselect(mtlabs_plus, mpg, mpg_lab, am, am_lab) # select cols

head(mtlabs_plus) # where we landed
### >                    mpg mpg_lab  am am_lab
### > Missing Car         NA    <NA>  NA   <NA>
### > Mazda RX4 Wag      Inf    <NA> Inf   <NA>
### > Datsun 710         NaN    <NA>   1  stick
### > Hornet 4 Drive    21.4    q075   0   auto
### > Hornet Sportabout 18.7    q050   0   auto
### > Valiant           18.1    q050   0   auto

What if we had tried to explicitly label the NA values and/or irregular values themselves? We would have failed.

# Trying to Label an Irregular Value (-Inf)
mtbad <- add_val1(
  data = mtcars,
  var = gear,
  vals = -Inf,
  labs = c("neg.inf")
)
### > Error in add_val1(data = mtcars, var = gear, vals = -Inf, labs = c("neg.inf")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# Trying to Label an Irregular Value (NA)
mtbad <- add_val_labs(
  data = mtbad,
  vars = "grade",
  vals = NA,
  labs = c("miss")
)
### > Error in add_val_labs(data = mtbad, vars = "grade", vals = NA, labs = c("miss")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# Trying to Label an Irregular Value (NaN)
mtbad <- add_val_labs(
  data = mtbad,
  vars = "carb",
  vals = NaN,
  labs = c("nan-v")
)
### > Error in add_val_labs(data = mtbad, vars = "carb", vals = NaN, labs = c("nan-v")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

# labelr also treats "character variants" of irregular values as irregular values.
mtbad <- add_val1(
  data = mtbad,
  var = carb,
  vals = "NAN",
  labs = c("nan-v")
)
### > Error in add_val1(data = mtbad, var = carb, vals = "NAN", labs = c("nan-v")): 
### > Cannot supply NA, NaN, Inf, or character variants as a val or lab arg.
### > These are handled automatically.

Again, labelr handles NA and irregular values and resists our efforts to take such matters into our own hands.

Factors and Value Labels

R’s concept of a factor variable shares some affinities with the concept of a value-labeled variable and can be viewed as one approach to value labeling. However, factors can manifest idiosyncratic and surprising behaviors depending on the function to which you’re trying to apply them. They are character-like, but they are not character values. They are built on top of integers, but they won’t submit to all of the operations that integers do. They do some very handy things in certain model-fitting applications, but their behavior “under the hood” can be counter-intuitive or opaque. Simply put, they are their own thing.

So, while factors have their purposes, it would be nice to associate value labels with the distinct values of data.frame variables in a manner that preserves the integrity and transparency of the underlying values (factors tend to be a bit opaque about this) and that allows you to view or use the labels in flexible ways.

And if you wanted to work with a factor, it would be nice if you could add value labels to it without it ceasing to exist and behave as a factor.

Adding Labels to a Factor

With that said, let’s see if we can have our label-factor cake and eat it, too, using the iris data.frame that comes pre-packaged with R.

unique(iris$Species)
### > [1] setosa     versicolor virginica 
### > Levels: setosa versicolor virginica

sapply(iris, class) # nothing up our sleeve -- "Species" is a factor
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

Let’s add value labels to “Species” and assign the result to a new data.frame that we’ll call irlab. For our value labels, we’ll use “se”, “ve”, and “vi”, which are not adding much new information, but they will help to illustrate what we can do with labelr and a factor variable.

irlab <- add_val_labs(iris,
  vars = "Species",
  vals = c("setosa", "versicolor", "virginica"),
  labs = c("se", "ve", "vi")
)

# this also would've worked
# irlab_dos <- add_val1(iris, Species,
#   vals = c("setosa", "versicolor", "virginica"),
#   labs = c("se", "ve", "vi")
# )

Note that we could have just as (or even more) easily used add_val1(), which works for a single variable at a time and allows us to avoid quoting our column name, if that matters to us. In contrast, add_val_labs() requires us to put our variable name(s) in quotes, but it also gives us the option to apply a common value-label scheme to several variables at once (e.g., Likert-style survey responses). We’ll see an example of this type of use case in action in a little bit.

For now, though, let’s prove that the iris and irlab data.frames are functionally identical.

First, note that irlab looks and acts just like iris in the usual ways that matter

summary(iris)
### >   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
### >  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
### >  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
### >  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
### >  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
### >  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
### >  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
### >        Species  
### >  setosa    :50  
### >  versicolor:50  
### >  virginica :50  
### >                 
### >                 
### > 

summary(irlab)
### >   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
### >  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
### >  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
### >  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
### >  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
### >  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
### >  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
### >        Species  
### >  setosa    :50  
### >  versicolor:50  
### >  virginica :50  
### >                 
### >                 
### > 

head(iris, 4)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2  setosa
### > 2          4.9         3.0          1.4         0.2  setosa
### > 3          4.7         3.2          1.3         0.2  setosa
### > 4          4.6         3.1          1.5         0.2  setosa

head(irlab, 4)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2  setosa
### > 2          4.9         3.0          1.4         0.2  setosa
### > 3          4.7         3.2          1.3         0.2  setosa
### > 4          4.6         3.1          1.5         0.2  setosa

lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
### > 
### > Call:
### > lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
### > 
### > Coefficients:
### >       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
### >            2.2514             0.8036             1.4587             1.9468

lm(Sepal.Length ~ Sepal.Width + Species, data = irlab) # values are same
### > 
### > Call:
### > lm(formula = Sepal.Length ~ Sepal.Width + Species, data = irlab)
### > 
### > Coefficients:
### >       (Intercept)        Sepal.Width  Speciesversicolor   Speciesvirginica  
### >            2.2514             0.8036             1.4587             1.9468

Note also that irlab’s “Species” is still a factor, just like its iris counterpart/parent.

sapply(irlab, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

levels(irlab$Species)
### > [1] "setosa"     "versicolor" "virginica"

But irlab’s “Species” has value labels!

get_val_labs(irlab, "Species")
### >       var       vals labs
### > 1 Species     setosa   se
### > 2 Species versicolor   ve
### > 3 Species  virginica   vi
### > 4 Species         NA   NA

And they work.

head(use_val_labs(irlab))
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se
### > 4          4.6         3.1          1.5         0.2      se
### > 5          5.0         3.6          1.4         0.2      se
### > 6          5.4         3.9          1.7         0.4      se
ir_v <- flab(irlab, Species == "vi")
head(ir_v, 5)
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
### > 101          6.3         3.3          6.0         2.5 virginica
### > 102          5.8         2.7          5.1         1.9 virginica
### > 103          7.1         3.0          5.9         2.1 virginica
### > 104          6.3         2.9          5.6         1.8 virginica
### > 105          6.5         3.0          5.8         2.2 virginica

Our take-aways so far? Factors can be value-labeled while staying factors, and we can use the labels to do labelr-y things with those factors. We can have both.

We may want to go further and add the labeled variable alongside the factor version.

irlab_aug <- add_lab_cols(irlab, vars = "Species")

This gives us a new variable called “Species_lab”. Let’s get select rows of the resulting data.frame, since we want to see all the different species.

set.seed(231)
sample_rows <- sample(seq_len(nrow(irlab)), 10, replace = FALSE)

irlab_aug[sample_rows, ]
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Species_lab
### > 7            4.6         3.4          1.4         0.3     setosa          se
### > 91           5.5         2.6          4.4         1.2 versicolor          ve
### > 41           5.0         3.5          1.3         0.3     setosa          se
### > 133          6.4         2.8          5.6         2.2  virginica          vi
### > 130          7.2         3.0          5.8         1.6  virginica          vi
### > 19           5.7         3.8          1.7         0.3     setosa          se
### > 104          6.3         2.9          5.6         1.8  virginica          vi
### > 43           4.4         3.2          1.3         0.2     setosa          se
### > 8            5.0         3.4          1.5         0.2     setosa          se
### > 68           5.8         2.7          4.1         1.0 versicolor          ve

sapply(irlab_aug, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species  Species_lab 
### >    "numeric"    "numeric"    "numeric"    "numeric"     "factor"  "character"

with(irlab_aug, table(Species, Species_lab))
### >             Species_lab
### > Species      se ve vi
### >   setosa     50  0  0
### >   versicolor  0 50  0
### >   virginica   0  0 50

Caution: Replacing the entire data.frame using use_val_labs() WILL coerce factors to character, since the value labels are character values, not recognized factor levels

ir_char <- use_val_labs(irlab) # we assign this to a new data.frame
sapply(ir_char, class)
### > Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
### >    "numeric"    "numeric"    "numeric"    "numeric"  "character"

head(ir_char, 3)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se

class(ir_char$Species) # it's character
### > [1] "character"

Of course, even then, we could explicitly coerce the labels to be factors if we wanted

ir_fact <- use_val_labs(irlab)

ir_fact$Species <- factor(ir_char$Species,
  levels = c("se", "ve", "vi"),
  labels = c("se", "ve", "vi")
)
head(ir_fact, 3)
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
### > 1          5.1         3.5          1.4         0.2      se
### > 2          4.9         3.0          1.4         0.2      se
### > 3          4.7         3.2          1.3         0.2      se

class(ir_fact$Species) # it's a factor
### > [1] "factor"

levels(ir_fact$Species) # it's a factor
### > [1] "se" "ve" "vi"

We’ve recovered.

with(ir_fact, tapply(Sepal.Width, Species, mean))
### >    se    ve    vi 
### > 3.428 2.770 2.974
with(irlab, tapply(Sepal.Width, Species, mean))
### >     setosa versicolor  virginica 
### >      3.428      2.770      2.974
with(iris, tapply(Sepal.Width, Species, mean))
### >     setosa versicolor  virginica 
### >      3.428      2.770      2.974

Ordered factors

Value labels work with ordered factors, too. Let’s make a fictional ordered factor that we add to ir_ord. We can pretend that this is some sort of judge’s overall quality rating, if that helps.

ir_ord <- iris

set.seed(293)
qrating <- c("AAA", "AA", "A", "BBB", "AA", "BBB", "A")

ir_ord$qrat <- sample(qrating, 150, replace = TRUE)

ir_ord$qrat <- factor(ir_ord$qrat,
  ordered = TRUE,
  levels = c("AAA", "AA", "A", "BBB")
)

Where do we stand with this factor?

levels(ir_ord$qrat)
### > [1] "AAA" "AA"  "A"   "BBB"

class(ir_ord$qrat)
### > [1] "ordered" "factor"

Now, let’s add value labels to it.

ir_ord <- add_val_labs(ir_ord,
  vars = "qrat",
  vals = c("AAA", "AA", "A", "BBB"),
  labs = c(
    "unimpeachable",
    "excellent",
    "very good",
    "meh"
  )
)

Let’s add a separate column with those labels as a distinct (character) variable unto itself, existing in addition to (not replacing) “qrat”.

ir_ord <- add_lab_cols(ir_ord, vars = "qrat")

head(ir_ord, 10)
### >    Sepal.Length Sepal.Width Petal.Length Petal.Width Species qrat      qrat_lab
### > 1           5.1         3.5          1.4         0.2  setosa   AA     excellent
### > 2           4.9         3.0          1.4         0.2  setosa   AA     excellent
### > 3           4.7         3.2          1.3         0.2  setosa   AA     excellent
### > 4           4.6         3.1          1.5         0.2  setosa  AAA unimpeachable
### > 5           5.0         3.6          1.4         0.2  setosa   AA     excellent
### > 6           5.4         3.9          1.7         0.4  setosa  BBB           meh
### > 7           4.6         3.4          1.4         0.3  setosa  AAA unimpeachable
### > 8           5.0         3.4          1.5         0.2  setosa   AA     excellent
### > 9           4.4         2.9          1.4         0.2  setosa    A     very good
### > 10          4.9         3.1          1.5         0.1  setosa    A     very good

with(ir_ord, table(qrat_lab, qrat))
### >                qrat
### > qrat_lab        AAA AA  A BBB
### >   excellent       0 49  0   0
### >   meh             0  0  0  43
### >   unimpeachable  11  0  0   0
### >   very good       0  0 47   0

class(ir_ord$qrat)
### > [1] "ordered" "factor"

levels(ir_ord$qrat)
### > [1] "AAA" "AA"  "A"   "BBB"

class(ir_ord$qrat_lab)
### > [1] "character"

get_val_labs(ir_ord, "qrat") # labs are still there for qrat
### >    var vals          labs
### > 1 qrat    A     very good
### > 2 qrat   AA     excellent
### > 3 qrat  AAA unimpeachable
### > 4 qrat  BBB           meh
### > 5 qrat   NA            NA

get_val_labs(ir_ord, "qrat_lab") # no labs here; this is just a character var
### > Warning in get_val_labs(ir_ord, "qrat_lab"): 
### >  
### >   No val.labs found.
### > [1] var  vals labs
### > <0 rows> (or 0-length row.names)

Other Factor and Categorical Variable Possibilities

labelr offers some additional facilities for working with factors and categorical variables. For example, functions add_lab_dummies() (alias ald()) and add_lab_dumm1() (alias ald1()) will generate and assign a dummy (aka binary aka indicator) variable for each unique value label of a value-labeled variable – factor or otherwise.

Alternatively, lab_int_to_factor() (alias int2f()) allows you to convert a value-labeled integer variable (or other non-decimal-having numeric column) to a factor, while factor_to_lab_int() (alias f2int()) allows you to convert a factor to a value-labeled integer variable. Note that the latter is NOT a straightforward “undo” for the former: the resulting unique integer values and their ordering may differ, as we demonstrate.

First, let’s convert a factor to a value-labeled integer.

class(iris[["Species"]])
### > [1] "factor"

iris_df <- factor_to_lab_int(iris, Species)

class(iris_df[["Species"]])
### > [1] "integer"

head(iris_df$Species)
### > [1] 1 1 1 1 1 1

get_val_labs(iris_df, "Species")
### >       var vals       labs
### > 1 Species    1     setosa
### > 2 Species    2 versicolor
### > 3 Species    3  virginica
### > 4 Species   NA         NA

Now, let’s value-label an integer and convert it to a factor. Note that our variable is not a strict as.integer() integer, but it’s a numeric variable with no decimal values, and that’s good enough for lab_int_to_factor().

carb_orig <- mtcars

carb_orig <- add_val_labs(
  data = mtcars,
  vars = "carb",
  vals = c(1, 2, 3, 4, 6, 8),
  labs = c(
    "1c", "2c", # a tad silly, but these value labels will demo the principle
    "3c", "4c",
    "6c", "8c"
  )
)

# carb as labeled numeric
is.integer(carb_orig$carb) # note: carb not technically an "as.integer()" integer
### > [1] FALSE

class(carb_orig$carb) # but it IS numeric
### > [1] "numeric"

has_decv(carb_orig$carb) # and does NOT have decimals; so, lab_int_to_fac() works
### > [1] FALSE

levels(carb_orig$carb) # none, not a factor
### > NULL

head(carb_orig$carb, 3) # remember to compare to carb_to_int (below)
### > [1] 4 4 1

mean(carb_orig$carb) # remember to compare to carb_to_int (below)
### > [1] 2.8125

lm(mpg ~ carb, data = carb_orig) # remember to compare to carb_to_int (below)
### > 
### > Call:
### > lm(formula = mpg ~ carb, data = carb_orig)
### > 
### > Coefficients:
### > (Intercept)         carb  
### >      25.872       -2.056

# note this for comparison to below
(adj_r2_orig <- summary(lm(mpg ~ carb, data = carb_orig))$adj.r.squared)
### > [1] 0.2803024

# compare to counterparts below
AIC(lm(mpg ~ carb, data = carb_orig))
### > [1] 199.1807

# Make carb a factor
carb_fac <- lab_int_to_factor(carb_orig, carb) # alias int2f() also works

class(carb_fac$carb) # now it's a factor
### > [1] "factor"

levels(carb_fac$carb) # like any good factor, it has levels
### > [1] "1c" "2c" "3c" "4c" "6c" "8c"

head(carb_fac$carb, 3)
### > [1] 4c 4c 1c
### > Levels: 1c 2c 3c 4c 6c 8c

lm(mpg ~ carb, data = carb_fac) # again: carb is a factor
### > 
### > Call:
### > lm(formula = mpg ~ carb, data = carb_fac)
### > 
### > Coefficients:
### > (Intercept)       carb2c       carb3c       carb4c       carb6c       carb8c  
### >      25.343       -2.943       -9.043       -9.553       -5.643      -10.343

# compare these model fit stats to counterparts above and below
(adj_r2_fac <- summary(lm(mpg ~ carb, data = carb_fac))$adj.r.squared)
### > [1] 0.3377081

# compare to counterparts above and below
AIC(lm(mpg ~ carb, data = carb_fac))
### > [1] 199.9415

Note that we can use factor_to_lab_int() to convert “carb” from a factor to a labeled integer variable. However, this is not a straightforward “undo” of what we just did: the resulting labeled integer won’t be identical to the “carb” column of mtcars that we started with, because factor_to_lab_int() converts the supplied factor variable’s values to sequentially ordered integers (from 1 to k, where k is the number of unique factor levels), ordered in terms of the levels of the factor variable being converted.

# ??"back"?? to integer? Not quite. Compare below to carb_orig above
carb_to_int <- factor_to_lab_int(carb_fac, carb) # alias f2int() also works

class(carb_to_int$carb) # Is an integer
### > [1] "integer"

levels(carb_to_int$carb) # NOT a factor
### > NULL

mean(carb_to_int$carb) # NOT the same as carb_orig
### > [1] 2.71875

identical(carb_to_int$carb, carb_orig$carb) # really!
### > [1] FALSE

lm(mpg ~ carb, data = carb_to_int) # NOT the same as carb_orig
### > 
### > Call:
### > lm(formula = mpg ~ carb, data = carb_to_int)
### > 
### > Coefficients:
### > (Intercept)         carb  
### >      27.330       -2.663

# Compare to counterpart calls from earlier iterations of carb (above)
(adj_r2_int <- summary(lm(mpg ~ carb, data = carb_to_int))$adj.r.squared)
### > [1] 0.3470751
AIC(lm(mpg ~ carb, data = carb_to_int))
### > [1] 196.0649

Now, let’s quickly demo add_lab_dummies(). To do so, we’ll revisit the “Species” column of irlab, our factor variable from iris that we value-labeled a few moments ago. It’s still here and still has value labels.

get_val_labs(irlab, "Species")
### >       var       vals labs
### > 1 Species     setosa   se
### > 2 Species versicolor   ve
### > 3 Species  virginica   vi
### > 4 Species         NA   NA

Let’s use add_lab_dummies() to create a dummy variable for each of its labels.

irl_dumm <- add_lab_dummies(irlab, "Species")
head(irl_dumm) # they're there!
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_1 species_2
### > 1          5.1         3.5          1.4         0.2  setosa         1         0
### > 2          4.9         3.0          1.4         0.2  setosa         1         0
### > 3          4.7         3.2          1.3         0.2  setosa         1         0
### > 4          4.6         3.1          1.5         0.2  setosa         1         0
### > 5          5.0         3.6          1.4         0.2  setosa         1         0
### > 6          5.4         3.9          1.7         0.4  setosa         1         0
### >   species_3
### > 1         0
### > 2         0
### > 3         0
### > 4         0
### > 5         0
### > 6         0
tail(irl_dumm) # again, they're there!
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species species_1
### > 145          6.7         3.3          5.7         2.5 virginica         0
### > 146          6.7         3.0          5.2         2.3 virginica         0
### > 147          6.3         2.5          5.0         1.9 virginica         0
### > 148          6.5         3.0          5.2         2.0 virginica         0
### > 149          6.2         3.4          5.4         2.3 virginica         0
### > 150          5.9         3.0          5.1         1.8 virginica         0
### >     species_2 species_3
### > 145         0         1
### > 146         0         1
### > 147         0         1
### > 148         0         1
### > 149         0         1
### > 150         0         1

We can use add_lab_dumm1() to achieve the same result without quoting the column name. The countervailing advantage of add_lab_dummies() is that it lets you create dummy variables for more than one value-labeled variable at a time (add_lab_dumm1() does not).

irl_dumm2 <- add_lab_dumm1(irlab, Species)
head(irl_dumm2) # again, they're there!
### >   Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_1 species_2
### > 1          5.1         3.5          1.4         0.2  setosa         1         0
### > 2          4.9         3.0          1.4         0.2  setosa         1         0
### > 3          4.7         3.2          1.3         0.2  setosa         1         0
### > 4          4.6         3.1          1.5         0.2  setosa         1         0
### > 5          5.0         3.6          1.4         0.2  setosa         1         0
### > 6          5.4         3.9          1.7         0.4  setosa         1         0
### >   species_3
### > 1         0
### > 2         0
### > 3         0
### > 4         0
### > 5         0
### > 6         0
tail(irl_dumm2) # again, they're there!
### >     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species species_1
### > 145          6.7         3.3          5.7         2.5 virginica         0
### > 146          6.7         3.0          5.2         2.3 virginica         0
### > 147          6.3         2.5          5.0         1.9 virginica         0
### > 148          6.5         3.0          5.2         2.0 virginica         0
### > 149          6.2         3.4          5.4         2.3 virginica         0
### > 150          5.9         3.0          5.1         1.8 virginica         0
### >     species_2 species_3
### > 145         0         1
### > 146         0         1
### > 147         0         1
### > 148         0         1
### > 149         0         1
### > 150         0         1

Value-Labeling Many Variables at Once

Functions for adding value labels (e.g., add_val_labs, add_quant_labs and add_m1_lab) will do partial matching if the partial argument is set to TRUE. Let’s use labelr’s make_likert_data() function to generate some fake Likert scale-style survey data to demonstrate this more fully.

set.seed(272) # for reproducibility
dflik <- make_likert_data(scale = 1:7) # another labelr function
head(dflik)
### >     id x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
### > U-1  1  5  7  2  2  2  7  1  1  4  2
### > O-2  2  6  2  7  6  2  3  5  4  1  4
### > H-3  3  7  7  5  5  6  6  4  1  5  7
### > Z-4  4  4  5  5  4  5  6  3  7  3  4
### > C-5  5  3  3  3  1  6  2  7  6  3  5
### > P-6  6  7  3  5  3  7  5  7  1  6  2

We’ll put the values we wish to label and the labels we wish to use in stand-alone vectors, which we will supply to add_val_labs in a moment.

vals2label <- 1:7
labs2use <- c(
  "VSD",
  "SD",
  "D",
  "N",
  "A",
  "SA",
  "VSA"
)

Now, let’s associate/apply the value labels to ALL vars with “x” in their name and also to var “y3.” Note: partial = TRUE.

dflik <- add_val_labs(
  data = dflik, vars = c("x", "y3"), ###  note the vars args
  vals = vals2label,
  labs = labs2use,
  partial = TRUE # applying to all cols with "x" or "y3" substring in names
)

Let’s compare dflik with value labels present but “off” to labels “on.”

First, present but “off.”

head(dflik)
### >     id x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
### > U-1  1  5  7  2  2  2  7  1  1  4  2
### > O-2  2  6  2  7  6  2  3  5  4  1  4
### > H-3  3  7  7  5  5  6  6  4  1  5  7
### > Z-4  4  4  5  5  4  5  6  3  7  3  4
### > C-5  5  3  3  3  1  6  2  7  6  3  5
### > P-6  6  7  3  5  3  7  5  7  1  6  2

Now, let’s “turn on” (use) these value labels.

lik1 <- uvl(dflik) # assign to new object, since we can't "undo"
head(lik1) # we could have skipped previous call by using labelr::headl(dflik)
### >     id  x1  x2  x3  x4  x5 y1 y2  y3 y4 y5
### > U-1  1   A VSA  SD  SD  SD  7  1 VSD  4  2
### > O-2  2  SA  SD VSA  SA  SD  3  5   N  1  4
### > H-3  3 VSA VSA   A   A  SA  6  4 VSD  5  7
### > Z-4  4   N   A   A   N   A  6  3 VSA  3  4
### > C-5  5   D   D   D VSD  SA  2  7  SA  3  5
### > P-6  6 VSA   D   A   D VSA  5  7 VSD  6  2

Yea, verily: All variables with “x” in their name (and “y3”) got the labels!

Suppose we want to drop these value labels for a select few, but not all, of these variables. drop_val_labs can get the job done.

dfdrop <- drop_val_labs(dflik,
  c("x2", "y3"),
  partial = FALSE
)

Most of our previously labeled columns remain so; but not “x2” and “y3.”

get_val_labs(dfdrop, c("x2", "y3"))
### > Warning in get_val_labs(dfdrop, c("x2", "y3")): 
### >  
### >   No val.labs found.
### > [1] var  vals labs
### > <0 rows> (or 0-length row.names)

Compare to values for variable “x1” (we did not drop value labels from this one)

get_val_labs(dfdrop, "x1")
### >   var vals labs
### > 1  x1    1  VSD
### > 2  x1    2   SD
### > 3  x1    3    D
### > 4  x1    4    N
### > 5  x1    5    A
### > 6  x1    6   SA
### > 7  x1    7  VSA
### > 8  x1   NA   NA

Just like we did with add_val_labs(), we also can use a single command to drop value labels from all variables with “x” in their variable names.

dfxgone <- drop_val_labs(dflik,
  c("x"),
  partial = TRUE # note
)

“y3” still has value labels, but now all “x” var value labels are gone.

get_val_labs(dfxgone)
### >   var vals labs
### > 1  y3    1  VSD
### > 2  y3    2   SD
### > 3  y3    3    D
### > 4  y3    4    N
### > 5  y3    5    A
### > 6  y3    6   SA
### > 7  y3    7  VSA
### > 8  y3   NA   NA

Tabulating Frequencies with `tabl()`

Finally, let’s get to know labelr’s tabl() function, which supports count or proportion tabulations with labels turned “on” or “off” and offers some other functionalities.

set.seed(4847) # for reproducibility
df <- make_demo_data(n = 1000) # make a fictional n = 1000 data set

df <- add_val1(df, # data.frame
  var = raceth, # var to label, unquoted since this is add_val1()
  vals = c(1:7), # label values 1 through 7, inclusive
  labs = c(
    "White", "Black", "Hispanic", # ordered labels for sequential vals 1-7
    "Asian", "AIAN", "Multi", "Other"
  )
)

df <- add_val1(
  data = df,
  var = gender,
  vals = c(0, 1, 2, 3, 4), # the values to be labeled
  labs = c("M", "F", "TR", "NB", "Diff-Term"), # labs order should reflect vals order
  max.unique.vals = 10
)

# label values of var "x1" according to quantile ranges
df <- add_quant1(
  data = df,
  var = x1, # apply quantile range value labels to this var
  qtiles = 3 # first, second, and third tertiles
)

# apply many-vals-get-one-label labels to "edu" (note vals 3-5 all get same lab)
df <- add_m1_lab(df, "edu", vals = c(3:5), lab = "Some College+")
df <- add_m1_lab(df, "edu", vals = 1, lab = "Not HS Grad")
df <- add_m1_lab(df, "edu", vals = 2, lab = "HSG, No College")

# show value labels
get_val_labs(df)
### >       var   vals            labs
### > 1  gender      0               M
### > 2  gender      1               F
### > 3  gender      2              TR
### > 4  gender      3              NB
### > 5  gender      4       Diff-Term
### > 6  gender     NA              NA
### > 7  raceth      1           White
### > 8  raceth      2           Black
### > 9  raceth      3        Hispanic
### > 10 raceth      4           Asian
### > 11 raceth      5            AIAN
### > 12 raceth      6           Multi
### > 13 raceth      7           Other
### > 14 raceth     NA              NA
### > 15    edu      1     Not HS Grad
### > 16    edu      2 HSG, No College
### > 17    edu      3   Some College+
### > 18    edu      4   Some College+
### > 19    edu      5   Some College+
### > 20    edu     NA              NA
### > 21     x1  92.63            q033
### > 22     x1 108.83            q067
### > 23     x1 157.06            q100
### > 24     x1     NA              NA

With tabl(), tables can be generated…

…in terms of values

tabl(df, vars = "gender", labs.on = FALSE)
### >   gender   n
### > 1      1 460
### > 2      0 431
### > 3      3  56
### > 4      2  40
### > 5      4  13

…or in terms of labels

tabl(df, vars = "gender", labs.on = TRUE) # labs.on = TRUE is the default
### >      gender   n
### > 1         F 460
### > 2         M 431
### > 3        NB  56
### > 4        TR  40
### > 5 Diff-Term  13

…in proportions

tabl(df, vars = c("gender", "edu"), prop.digits = 3)
### >       gender             edu     n
### > 1          F   Some College+ 0.307
### > 2          M   Some College+ 0.296
### > 3          F HSG, No College 0.139
### > 4          M HSG, No College 0.124
### > 5         NB   Some College+ 0.032
### > 6         TR   Some College+ 0.031
### > 7         NB HSG, No College 0.024
### > 8          F     Not HS Grad 0.014
### > 9  Diff-Term   Some College+ 0.011
### > 10         M     Not HS Grad 0.011
### > 11        TR HSG, No College 0.007
### > 12 Diff-Term HSG, No College 0.002
### > 13        TR     Not HS Grad 0.002
### > 14 Diff-Term     Not HS Grad 0.000
### > 15        NB     Not HS Grad 0.000

…cross-tab style

head(tabl(df, vars = c("raceth", "edu"), wide.col = "gender"), 20)
### >      raceth             edu  F  M NB TR Diff-Term
### > 1     Multi   Some College+ 55 49  1  5         1
### > 2      AIAN   Some College+ 53 44  3  2         2
### > 3     White   Some College+ 40 50  3  8         1
### > 4  Hispanic   Some College+ 49 35  2  2         2
### > 5     Black   Some College+ 35 44  9  7         1
### > 6     Asian   Some College+ 41 42  7  2         1
### > 7     Other   Some College+ 34 32  7  5         3
### > 8  Hispanic HSG, No College 19 26  3  0         1
### > 9     Multi HSG, No College 26 17  2  0         0
### > 10    White HSG, No College 23 22  3  2         0
### > 11     AIAN HSG, No College 20 18  3  0         0
### > 12    Other HSG, No College 20 15  4  3         0
### > 13    Asian HSG, No College 17 11  4  0         1
### > 14    Black HSG, No College 14 15  5  2         0
### > 15    Asian     Not HS Grad  5  1  0  0         0
### > 16    Other     Not HS Grad  2  4  0  1         0
### > 17    Black     Not HS Grad  3  0  0  0         0
### > 18    Multi     Not HS Grad  1  3  0  0         0
### > 19     AIAN     Not HS Grad  1  2  0  1         0
### > 20 Hispanic     Not HS Grad  1  1  0  0         0

…with non-value-labeled data.frames

tabl(iris, "Species") # explicit vars arg with one-var ("Species")
### >      Species  n
### > 1     setosa 50
### > 2 versicolor 50
### > 3  virginica 50

# many-valued numeric vars automatically converted to quantile categories
tabl(mtcars, c("am", "gear", "cyl", "disp", "mpg"),
  qtiles = 4, zero.rm = TRUE
)
### >    am gear cyl disp  mpg n
### > 1   0    3   8 q100 q025 5
### > 2   1    4   4 q025 q100 4
### > 3   0    3   8 q075 q050 3
### > 4   0    3   8 q075 q025 2
### > 5   0    3   8 q100 q050 2
### > 6   0    4   6 q050 q050 2
### > 7   1    4   6 q050 q075 2
### > 8   1    5   4 q025 q100 2
### > 9   0    3   4 q025 q075 1
### > 10  0    3   6 q075 q050 1
### > 11  0    3   6 q075 q075 1
### > 12  0    4   4 q050 q075 1
### > 13  0    4   4 q050 q100 1
### > 14  1    4   4 q025 q075 1
### > 15  1    4   4 q050 q075 1
### > 16  1    5   6 q050 q075 1
### > 17  1    5   8 q075 q025 1
### > 18  1    5   8 q100 q050 1

Labelr - Special Topics

Overview

Larger Data Frames

NA and “Irregular” Values

Factors and Value Labels

Adding Labels to a Factor

Ordered factors

Other Factor and Categorical Variable Possibilities

Value-Labeling Many Variables at Once

Tabulating Frequencies with `tabl()`

Conclusion

Labelr - Special Topics

Overview

Larger Data Frames

NA and “Irregular” Values

Factors and Value Labels

Adding Labels to a Factor

Ordered factors

Other Factor and Categorical Variable Possibilities

Value-Labeling Many Variables at Once

Tabulating Frequencies with tabl()

Conclusion

Tabulating Frequencies with `tabl()`