Data Types in R

February 26, 2014 | Alan

This blog details the different data types in R and the different data structures in which these data types are stored. An open data set will be used to provide examples.

Related Blog Posts

Data Source

The dataset used in this blog is the msleep data from R, within the library ggplot2. It provides a framework for understanding mammalian sleep patterns.

Workflow

Index

To help you understand the data structures and data types in R, we cover the following topics in this blog:

Step 1 – Reading in the data

The data used for this blog is from the R library ggplot2. The code chunk below shows how to read this data into R. This automatically assigns the data to the variable msleep:

library(ggplot2)
data(msleep)

Note that when all data is read into R it is in the form of a data frame. For the purposes of this blog we are not using the full data frame, but instead are using a new data frame which is a subset of 10 rows. To create this subset and assign it to a variable msleep_subset, use the following syntax:

msleep_subset <- msleep[1:10, ]

Step 2 – Finding the classification of data types within a data structure

There are a number of R commands that can be used to classify either a whole data set/subset or the individual fields within the data set. Before doing this we may need to identify the specific fields in the data set and understand how to call a specific field.

To find the names of the fields within the data frame msleep_subset, use the function names(x):

names(msleep_subset)
##  [1] "name"         "genus"        "vore"         "order"       
##  [5] "conservation" "sleep_total"  "sleep_rem"    "sleep_cycle" 
##  [9] "awake"        "brainwt"      "bodywt"

To call a specific field from this subset, you must use the syntax data_subset$column, where:

  • data_subset is the subset of your data
  • column is the name of the field.

So to call the field bodywt from msleep_subset and assign it to a variable msleep_subset1, use the following syntax:

   msleep_subset1 <- msleep_subset$bodywt
   msleep_subset1
   ##  [1]  50.000   0.480   1.350   0.019 600.000   3.850  20.490   0.045
   ##  [9]  14.000  14.800

There are a number of useful functions that we can use to find the classification of our data. Examples of these are given in the code chunks below:

  1. class()– Returns the class of the input. This can be applied either to a whole data set or to an individual field. Applying the function to the data frame msleep_subset tells us that the input is a data frame structure:
    class(msleep_subset)
    
    ## [1] "data.frame"
    

    Applying the function to the field bodywt tells us that the input is of type numeric:

    class(msleep_subset$bodywt)
    
    ## [1] "numeric"
    
  2. typeof()– Returns the type of the input. Again this can be applied either to a whole dataset or to an individual field. Applying the function to the data frame msleep_subset tells us that the input is a list structure:
    typeof(msleep_subset)
    
    ## [1] "list"
    

    Applying the function to the field conservation tells us that the input is of type integer:

    typeof(msleep_subset$conservation)
    
    ## [1] "integer"
    

    Note that in the above example there are differences between what the functions class() and type() return. These differences are due to the fact that every R object has:

  • a mode- the way in which that object is stored
  • a class- the object type.

3. attributes()– Returns the specific attributes of the input. This function cannot be applied to an individual field, only to a whole dataset. For that dataset it returns the following:

  • column names
  • row names
  • data structure.
attributes(msleep_subset)
## $names
##  [1] "name"         "genus"        "vore"         "order"       
##  [5] "conservation" "sleep_total"  "sleep_rem"    "sleep_cycle" 
##  [9] "awake"        "brainwt"      "bodywt"      
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $class
## [1] "data.frame"

4. sapply()– This function provides a way to apply a per-specified function over every field in a dataset. In this example we are using built-in R functions, but you could use a function that you had created. As input, sapply accepts data fields in either the form of a list or vector, together with the function that you want to run over those fields. It returns the output in the form of a vector. Applying the function to the data frame msleep_subset using the class() function tells the class of each field in the input, displayed in vector format:

sapply(msleep_subset, class)
##         name        genus         vore        order conservation 
##  "character"  "character"     "factor"  "character"     "factor" 
##  sleep_total    sleep_rem  sleep_cycle        awake      brainwt 
##    "numeric"    "numeric"    "numeric"    "numeric"    "numeric" 
##       bodywt 
##    "numeric"

Applying the function to the data subset using the typeof() function tells the type of each field in the input, displayed in vector format:

sapply(msleep_subset, typeof)
##         name        genus         vore        order conservation 
##  "character"  "character"    "integer"  "character"    "integer" 
##  sleep_total    sleep_rem  sleep_cycle        awake      brainwt 
##     "double"     "double"     "double"     "double"     "double" 
##       bodywt 
##     "double"

5. lapply()– This function is used in the same way as sapply(). The difference here is that the output is returned in the form of a list. Applying the function to the data frame msleep_subset using the class() function tells the class of each field in the input, displayed in list format:

lapply(msleep_subset, class)
## $name
## [1] "character"
## 
## $genus
## [1] "character"
## 
## $vore
## [1] "factor"
## 
## $order
## [1] "character"
## 
## $conservation
## [1] "factor"
## 
## $sleep_total
## [1] "numeric"
## 
## $sleep_rem
## [1] "numeric"
## 
## $sleep_cycle
## [1] "numeric"
## 
## $awake
## [1] "numeric"
## 
## $brainwt
## [1] "numeric"
## 
## $bodywt
## [1] "numeric"

Applying the function to the data subset using the typeof() function tells the type of each field in the input, displayed in list format:

lapply(msleep_subset, typeof)
## $name
## [1] "character"
## 
## $genus
## [1] "character"
## 
## $vore
## [1] "integer"
## 
## $order
## [1] "character"
## 
## $conservation
## [1] "integer"
## 
## $sleep_total
## [1] "double"
## 
## $sleep_rem
## [1] "double"
## 
## $sleep_cycle
## [1] "double"
## 
## $awake
## [1] "double"
## 
## $brainwt
## [1] "double"
## 
## $bodywt
## [1] "double"

6. str()– Returns the structure of the input. This function cannot be applied to an individual column or vector, only to a dataset. Applying the function to the data subset tells us:

  • the data structure
  • dimensions of the data structure
  • column names
  • the data type of that column
  • sample values.
str(msleep_subset)
## 'data.frame': 10 obs. of  11 variables:
##  $ name        : chr  "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr  "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2
##  $ order       : chr  "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5
##  $ sleep_total : num  12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3
##  $ sleep_rem   : num  NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA
##  $ sleep_cycle : num  NA NA NA 0.133 0.667 ...
##  $ awake       : num  11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21
##  $ brainwt     : num  NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982
##  $ bodywt      : num  50 0.48 1.35 0.019 600 ...

Step 3 – Data types

There are a number of data types which are common in R. To test for a particular data type, use the is.type() function, where type is a particular data type, for example, numeric. This corresponds to the output from the class() function, not the typeof() function.

  1. Numeric data can be stored in decimal or integer format and is probably the most common data type used within R. Numeric format covers both positive and negative integers and decimals. To test if a field is numeric, use the is.numeric() function. Applying this function to the field bodywt from the data frame msleep_subset, returns the value TRUE, indicating that the field is numeric:
    is.numeric(msleep_subset$bodywt)
    
    ## [1] TRUE
    

    Integer format, as the name suggests, covers only whole numbers (integers). To test if a field is in integer format, use the is.integer() function. The following example will illustrate the fact that the is.type() function does not correspond to the output of the typeof() function. Applying the typeof() function to the field conservation from msleep_subset had returned that the variable was of type integer. However, applying the is.integer() function to this field, returns the value FALSE, indicating that the field is not integer:

    is.integer(msleep_subset$conservation)
    
    ## [1] FALSE
    

    The is.integer() function returns TRUE if the field has been explicitly converted to type integer (See Step 6):

    is.integer(as.integer(msleep_subset$conservation))
    
    ## [1] TRUE
    
  2. Character data can be handled as either character or factor format. The character format means that data entries are enclosed within a string- that is “data”. For example, looking at the field names from msleep_subset:
    msleep_subset$name
    
    ##  [1] "Cheetah"                    "Owl monkey"                
    ##  [3] "Mountain beaver"            "Greater short-tailed shrew"
    ##  [5] "Cow"                        "Three-toed sloth"          
    ##  [7] "Northern fur seal"          "Vesper mouse"              
    ##  [9] "Dog"                        "Roe deer"
    

    To test if a field is of character type, use the is.character() function. Applying this function to the field name from msleep_subset, returns the value TRUE, indicating that the field is of type character:

    is.character(msleep_subset$name)
    
    ## [1] TRUE
    

    The factor format is slightly different – data entries are no longer enclosed within a string, and a second output line is returned, giving information about the different factor levels. For example, looking at the field vore from msleep_subset:

    msleep_subset$vore
    
    ##  [1] carni omni  herbi omni  herbi herbi carni <NA>  carni herbi
    ## Levels: carni herbi insecti omni
    

    To test if a field is of factor type, use the is.factor() function. Applying this function to the field vore from msleep_subset, returns the value TRUE, indicating that the field is of type factor.

    is.factor(msleep_subset$vore)
    
    ## [1] TRUE
    
  3. Logical format represents data fields with only ‘TRUE’ and ‘FALSE’ as values. The dataset does not contain any fields in logical format. However, to test if a field contains a logical value, use the is.logical() function.
  4. Missing data elements can be included in a vector, but null data elements cannot.
  • A null value is a data element that has been left blank.
  • A missing value is a data element which, for some reason is missing, but has been replaced by a code such as ‘NA’.

To test for the presence of missing data elements in a field, use the is.na() function. This function returns TRUE if the data element is missing and FALSE if not. Applying this function to the field brainwt from        msleep_subset returns that there are some missing values for that field:

is.na(msleep_subset$brainwt)
##  [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

5.  Date formats are not applicable for this particular example, because none of the data fields in our sample dataset is of type Date. Unlike other data types, you cannot test if a field is of type Date using the is.type() function. It is possible, however, to convert a field to type Date (See Step 6). Note that in the AnalytiXagility platform, all data fields of type Date are represented by class POSIX. For more information, see Working with Dates in R.

Step 4 – Data structures

There are a number of data structures that are commonly used in R. We look at each one of these, using the msleep data for illustration purposes.

  1. Data frame– This is composed of individual vectors, all of the same length, but each of these vectors can be of the same data type, or all a different data type. To illustrate this, we can take a sample of fields from the data frame msleep_subset to create a new data frame msleep_subset2. The new data frame contains the fields sleep_cycle, brainwt and bodywt, which are all of type numeric. To take a sample of fields from your data frame, do the following:
    msleep_subset2 <- subset(msleep_subset, select = c(sleep_cycle, brainwt, bodywt))
    msleep_subset2
    
    ##    sleep_cycle brainwt  bodywt
    ## 1           NA      NA  50.000
    ## 2           NA 0.01550   0.480
    ## 3           NA      NA   1.350
    ## 4       0.1333 0.00029   0.019
    ## 5       0.6667 0.42300 600.000
    ## 6       0.7667      NA   3.850
    ## 7       0.3833      NA  20.490
    ## 8           NA      NA   0.045
    ## 9       0.3333 0.07000  14.000
    ## 10          NA 0.09820  14.800
    
  2. Matrix– This is composed of vectors of the same length, but these vectors must all be of the same data type. Because the vectors in our new data frame msleep_subset2 are all of type numeric, we can have this data frame in the form of a matrix. To do this, first we need to know the number of rows in the data frame, so that we know the dimensions of our matrix. We find this using the nrow() function:
    nrow(msleep_subset2)
    
    ## [1] 10
    

    To construct matrix1 using msleep_subset2, do the following:

    matrix1 <- matrix(c(msleep_subset2$sleep_cycle, msleep_subset2$brainwt, msleep_subset2$bodywt), 
       nrow = 10, ncol = 3, byrow = FALSE)
    matrix1
    
    ##         [,1]    [,2]    [,3]
    ##  [1,]     NA      NA  50.000
    ##  [2,]     NA 0.01550   0.480
    ##  [3,]     NA      NA   1.350
    ##  [4,] 0.1333 0.00029   0.019
    ##  [5,] 0.6667 0.42300 600.000
    ##  [6,] 0.7667      NA   3.850
    ##  [7,] 0.3833      NA  20.490
    ##  [8,]     NA      NA   0.045
    ##  [9,] 0.3333 0.07000  14.000
    ## [10,]     NA 0.09820  14.800
    

    Note that:

  • byrow=FALSE indicates that the matrix should be filled one column at a time
  • nrow indicates the desired number of rows
  • ncol the desired number of columns.

To name the columns of matrix1, use the colnames() function:

colnames(matrix1) <- c("sleep_cycle", "brainwt", "bodywt")

3. List– This can be made up of components of any data type. The following example creates a list using the fields vore and sleep_total from msleep_subset, which are of types factor and numeric respectively:

list1 <- list(msleep_subset$vore, msleep_subset$sleep_total)
list1
## [[1]]
##  [1] carni omni  herbi omni  herbi herbi carni <NA>  carni herbi
## Levels: carni herbi insecti omni
## 
## [[2]]
##  [1] 12.1 17.0 14.4 14.9  4.0 14.4  8.7  7.0 10.1  3.0

You can reverse the creation of a list using the unlist() function.

4. Table– This is best used for a vector of factors. The following example, creates a one-way table for the field vore from msleep_subset, which is of type factor. The output clearly shows the frequency of each factor level within the field vore:

table1 <- table(msleep_subset$vore)
table1
## 
##   carni   herbi insecti    omni 
##       3       4       0       2

Step 5- Identifying elements within a data structure

  1. Vector– This example creates a vector vector1 using the field vore from msleep_subset. To find element i of a vector, use the syntax vector[i]. For example, to find the ninth element of vector1:
    vector1 <- msleep_subset$vore
    vector1[9]
    
    ## [1] carni
    ## Levels: carni herbi insecti omni
    
  2. List– This example uses list1 (See Step 4). To find element i of a list, use the syntax list[[i]]– note the use of double square brackets here. For example, to find the second element of list1:
    list1[[2]]
    
    ##  [1] 12.1 17.0 14.4 14.9  4.0 14.4  8.7  7.0 10.1  3.0
    
  3. Matrix– This example uses matrix1 (See Step 4). To find column i of a matrix, use the syntax matrix[,i]. For example, to find the third column of matrix1:
    matrix1[, 3]
    
    ##  [1]  50.000   0.480   1.350   0.019 600.000   3.850  20.490   0.045
    ##  [9]  14.000  14.800
    

    To find row j of a matrix, use the syntax matrix[j,]. For example, to find the first row of matrix1:

    matrix1[1, ]
    
    ## [1] NA NA 50
    

    To find rows of columns x, y, z of a matrix, use the syntax matrix[a:c, x:z]. For example, to find rows 1, 2, 3 of columns 1, 2, 3 of matrix1:

    matrix1[1:3, 1:3]
    
    ##      [,1]   [,2]  [,3]
    ## [1,]   NA     NA 50.00
    ## [2,]   NA 0.0155  0.48
    ## [3,]   NA     NA  1.35
    
  4. Data frame– This example uses msleep_subset (See Step 1). To find column i of a data frame, use the syntax data frame[i]. For example, to find the third column of msleep_subset:
    msleep_subset[3]
    
    ##     vore
    ## 1  carni
    ## 2   omni
    ## 3  herbi
    ## 4   omni
    ## 5  herbi
    ## 6  herbi
    ## 7  carni
    ## 8   <NA>
    ## 9  carni
    ## 10 herbi
    

    To find row j of a data frame, use the syntax data frame[j,]. For example, to find the third row of msleep_subset:

    msleep_subset[3, ]
    
    ##              name      genus  vore    order conservation sleep_total
    ## 3 Mountain beaver Aplodontia herbi Rodentia           nt        14.4
    ##   sleep_rem sleep_cycle awake brainwt bodywt
    ## 3       2.4          NA   9.6      NA   1.35
    

    To find element in column i and row j in a data frame, use the syntax data frame[j,i]. For example, to find the element in row three and column three of msleep_subset:

    msleep_subset[3, 3]
    
    ## [1] herbi
    ## Levels: carni herbi insecti omni
    
  5. Table– This example uses table1 (See Step 4). To find element i in a table, use the syntax table[i]. For example, to find element three in table1:
    table1[3]
    
    ## insecti 
    ##       0
    

Step 6 – Converting between data types

To convert between data types, use the as.type() function, where ‘type’ is a particular data type. This function can be used for all data types.

  • When converting a variable of type numeric to character, the numeric data entires are enclosed in a character string. To see this, we can create a data frame of the numeric field sleep_total and convert this field to type character:
    data.frame(msleep_subset$sleep_total, as.character(msleep_subset$sleep_total))
    
    ##    msleep_subset.sleep_total as.character.msleep_subset.sleep_total.
    ## 1                       12.1                                    12.1
    ## 2                       17.0                                      17
    ## 3                       14.4                                    14.4
    ## 4                       14.9                                    14.9
    ## 5                        4.0                                       4
    ## 6                       14.4                                    14.4
    ## 7                        8.7                                     8.7
    ## 8                        7.0                                       7
    ## 9                       10.1                                    10.1
    ## 10                       3.0                                       3
    
  • When converting a field of type factor to numeric, each level is assigned a unique number, so that it is still possible to identify the individual levels. To see this, we can create a data frame of the factor field vore and convert this field to type numeric:
    data.frame(msleep_subset$vore, as.numeric(msleep_subset$vore))
    
    ##    msleep_subset.vore as.numeric.msleep_subset.vore.
    ## 1               carni                              1
    ## 2                omni                              4
    ## 3               herbi                              2
    ## 4                omni                              4
    ## 5               herbi                              2
    ## 6               herbi                              2
    ## 7               carni                              1
    ## 8                <NA>                             NA
    ## 9               carni                              1
    ## 10              herbi                              2
    

What’s next?

This post has covered the basics of data types and data structures in R. Other posts look in more detail at working with dates in R, viewing data in R and useful functions in R.


 

Leave a Reply

Your email address will not be published. Required fields are marked *