### February 26, 2014 | Alan

This blog details the different data types in R and the different data structures in which these data types are stored. An open data set will be used to provide examples.

### Related Blog Posts

- Getting Started with the R Console in the AnalytiXagility Platform
- Day 1 in the AnalytiXagility Platform
- Working with Dates in R

## Data Source

The dataset used in this blog is the `msleep`

data from R, within the library `ggplot2`

. It provides a framework for understanding mammalian sleep patterns.

## Workflow

### Index

To help you understand the data structures and data types in R, we cover the following topics in this blog:

- Reading in the data
- Finding the classification of data types
- Data Types
- Data Structures
- Identifying elements within a data structure
- Conversion between Data Types

### Step 1 – Reading in the data

The data used for this blog is from the R library ggplot2. The code chunk below shows how to read this data into R. This automatically assigns the data to the variable `msleep`

:

```
library(ggplot2)
data(msleep)
```

Note that when all data is read into R it is in the form of a data frame. For the purposes of this blog we are not using the full data frame, but instead are using a new data frame which is a subset of 10 rows. To create this subset and assign it to a variable `msleep_subset`

, use the following syntax:

```
msleep_subset <- msleep[1:10, ]
```

### Step 2 – Finding the classification of data types within a data structure

There are a number of R commands that can be used to classify either a whole data set/subset or the individual fields within the data set. Before doing this we may need to identify the specific fields in the data set and understand how to call a specific field.

To find the names of the fields within the data frame `msleep_subset`

, use the function `names(x)`

:

```
names(msleep_subset)
```

```
## [1] "name" "genus" "vore" "order"
## [5] "conservation" "sleep_total" "sleep_rem" "sleep_cycle"
## [9] "awake" "brainwt" "bodywt"
```

To call a specific field from this subset, you must use the syntax **data_subset$column**, where:

**data_subset**is the subset of your data**column**is the name of the field.

So to call the field `bodywt`

from `msleep_subset`

and assign it to a variable `msleep_subset1`

, use the following syntax:

```
msleep_subset1 <- msleep_subset$bodywt
msleep_subset1
```

```
## [1] 50.000 0.480 1.350 0.019 600.000 3.850 20.490 0.045
## [9] 14.000 14.800
```

There are a number of useful functions that we can use to find the classification of our data. Examples of these are given in the code chunks below:

`class()`

– Returns the**class**of the input. This can be applied either to a whole data set or to an individual field. Applying the function to the data frame`msleep_subset`

tells us that the input is a**data frame**structure:`class(msleep_subset)`

`## [1] "data.frame"`

Applying the function to the field

`bodywt`

tells us that the input is of type**numeric**:`class(msleep_subset$bodywt)`

`## [1] "numeric"`

`typeof()`

– Returns the**type**of the input. Again this can be applied either to a whole dataset or to an individual field. Applying the function to the data frame`msleep_subset`

tells us that the input is a**list**structure:`typeof(msleep_subset)`

`## [1] "list"`

Applying the function to the field

`conservation`

tells us that the input is of type**integer**:`typeof(msleep_subset$conservation)`

`## [1] "integer"`

Note that in the above example there are differences between what the functions

`class()`

and`type()`

return. These differences are due to the fact that every R object has:

- a mode- the way in which that object is stored
- a class- the object type.

`3. attributes()`

– Returns the specific **attributes** of the input. This function cannot be applied to an individual field, only to a whole dataset. For that dataset it returns the following:

- column names
- row names
- data structure.

```
attributes(msleep_subset)
```

```
## $names
## [1] "name" "genus" "vore" "order"
## [5] "conservation" "sleep_total" "sleep_rem" "sleep_cycle"
## [9] "awake" "brainwt" "bodywt"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $class
## [1] "data.frame"
```

4. ` sapply()`

– This function provides a way to apply a per-specified function over every field in a dataset. In this example we are using built-in R functions, but you could use a function that you had created. As input, sapply accepts data fields in either the form of a **list** or **vector**, together with the function that you want to run over those fields. It returns the output in the form of a **vector**. Applying the function to the data frame `msleep_subset`

using the `class()`

function tells the **class** of each field in the input, displayed in **vector** format:

```
sapply(msleep_subset, class)
```

```
## name genus vore order conservation
## "character" "character" "factor" "character" "factor"
## sleep_total sleep_rem sleep_cycle awake brainwt
## "numeric" "numeric" "numeric" "numeric" "numeric"
## bodywt
## "numeric"
```

Applying the function to the data subset using the `typeof()`

function tells the **type** of each field in the input, displayed in **vector** format:

```
sapply(msleep_subset, typeof)
```

```
## name genus vore order conservation
## "character" "character" "integer" "character" "integer"
## sleep_total sleep_rem sleep_cycle awake brainwt
## "double" "double" "double" "double" "double"
## bodywt
## "double"
```

5. ` lapply()`

– This function is used in the same way as `sapply()`

. The difference here is that the output is returned in the form of a **list**. Applying the function to the data frame `msleep_subset`

using the `class()`

function tells the **class** of each field in the input, displayed in **list** format:

```
lapply(msleep_subset, class)
```

```
## $name
## [1] "character"
##
## $genus
## [1] "character"
##
## $vore
## [1] "factor"
##
## $order
## [1] "character"
##
## $conservation
## [1] "factor"
##
## $sleep_total
## [1] "numeric"
##
## $sleep_rem
## [1] "numeric"
##
## $sleep_cycle
## [1] "numeric"
##
## $awake
## [1] "numeric"
##
## $brainwt
## [1] "numeric"
##
## $bodywt
## [1] "numeric"
```

Applying the function to the data subset using the `typeof()`

function tells the **type** of each field in the input, displayed in **list** format:

```
lapply(msleep_subset, typeof)
```

```
## $name
## [1] "character"
##
## $genus
## [1] "character"
##
## $vore
## [1] "integer"
##
## $order
## [1] "character"
##
## $conservation
## [1] "integer"
##
## $sleep_total
## [1] "double"
##
## $sleep_rem
## [1] "double"
##
## $sleep_cycle
## [1] "double"
##
## $awake
## [1] "double"
##
## $brainwt
## [1] "double"
##
## $bodywt
## [1] "double"
```

6. `str()`

– Returns the **structure** of the input. This function cannot be applied to an individual column or vector, only to a dataset. Applying the function to the data subset tells us:

- the data structure
- dimensions of the data structure
- column names
- the data type of that column
- sample values.

```
str(msleep_subset)
```

```
## 'data.frame': 10 obs. of 11 variables:
## $ name : chr "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
## $ genus : chr "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
## $ vore : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2
## $ order : chr "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
## $ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5
## $ sleep_total : num 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3
## $ sleep_rem : num NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA
## $ sleep_cycle : num NA NA NA 0.133 0.667 ...
## $ awake : num 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21
## $ brainwt : num NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982
## $ bodywt : num 50 0.48 1.35 0.019 600 ...
```

### Step 3 – Data types

There are a number of data types which are common in R. To test for a particular data type, use the `is.type()`

function, where *type* is a particular data type, for example, **numeric**. This corresponds to the output from the `class()`

function, not the `typeof()`

function.

**Numeric**data can be stored in decimal or integer format and is probably the most common data type used within R.**Numeric**format covers both positive and negative integers and decimals. To test if a field is**numeric**, use the`is.numeric()`

function. Applying this function to the field`bodywt`

from the data frame`msleep_subset`

, returns the value TRUE, indicating that the field is**numeric**:`is.numeric(msleep_subset$bodywt)`

`## [1] TRUE`

**Integer**format, as the name suggests, covers only whole numbers (integers). To test if a field is in**integer**format, use the`is.integer()`

function. The following example will illustrate the fact that the`is.type()`

function does not correspond to the output of the`typeof()`

function. Applying the`typeof()`

function to the field`conservation`

from`msleep_subset`

had returned that the variable was of type**integer**. However, applying the`is.integer()`

function to this field, returns the value FALSE, indicating that the field is not**integer**:`is.integer(msleep_subset$conservation)`

`## [1] FALSE`

The

`is.integer()`

function returns TRUE if the field has been explicitly converted to type**integer**(See Step 6):`is.integer(as.integer(msleep_subset$conservation))`

`## [1] TRUE`

**Character**data can be handled as either**character**or**factor**format. The**character**format means that data entries are enclosed within a string- that is “data”. For example, looking at the field`names`

from`msleep_subset`

:`msleep_subset$name`

`## [1] "Cheetah" "Owl monkey" ## [3] "Mountain beaver" "Greater short-tailed shrew" ## [5] "Cow" "Three-toed sloth" ## [7] "Northern fur seal" "Vesper mouse" ## [9] "Dog" "Roe deer"`

To test if a field is of

**character**type, use the`is.character()`

function. Applying this function to the field`name`

from`msleep_subset`

, returns the value TRUE, indicating that the field is of type**character**:`is.character(msleep_subset$name)`

`## [1] TRUE`

The

**factor**format is slightly different – data entries are no longer enclosed within a string, and a second output line is returned, giving information about the different factor levels. For example, looking at the field`vore`

from`msleep_subset`

:`msleep_subset$vore`

`## [1] carni omni herbi omni herbi herbi carni <NA> carni herbi ## Levels: carni herbi insecti omni`

To test if a field is of

**factor**type, use the`is.factor()`

function. Applying this function to the field`vore`

from`msleep_subset`

, returns the value TRUE, indicating that the field is of type**factor**.`is.factor(msleep_subset$vore)`

`## [1] TRUE`

**Logical**format represents data fields with only ‘TRUE’ and ‘FALSE’ as values. The dataset does not contain any fields in**logical**format. However, to test if a field contains a**logical**value, use the`is.logical()`

function.**Missing**data elements can be included in a vector, but null data elements cannot.

- A null value is a data element that has been left blank.
- A missing value is a data element which, for some reason is missing, but has been replaced by a code such as ‘NA’.

To test for the presence of **missing **data elements in a field, use the `is.na()`

function. This function returns TRUE if the data element is missing and FALSE if not. Applying this function to the field `brainwt`

from `msleep_subset`

returns that there are some missing values for that field:

```
is.na(msleep_subset$brainwt)
```

```
## [1] TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
```

5. ** Date** formats are not applicable for this particular example, because none of the data fields in our sample dataset is of type **Date**. Unlike other data types, you cannot test if a field is of type **Date** using the is.type() function. It is possible, however, to convert a field to type **Date** (See Step 6). Note that in the AnalytiXagility platform, all data fields of type **Date** are represented by class **POSIX**. For more information, see Working with Dates in R.

### Step 4 – Data structures

There are a number of data structures that are commonly used in R. We look at each one of these, using the `msleep`

data for illustration purposes.

**Data frame**– This is composed of individual**vectors**, all of the same length, but each of these**vectors**can be of the*same data type*, or all a*different data type*. To illustrate this, we can take a sample of fields from the data frame`msleep_subset`

to create a new data frame`msleep_subset2`

. The new data frame contains the fields`sleep_cycle`

,`brainwt`

and`bodywt`

, which are all of type**numeric**. To take a sample of fields from your data frame, do the following:`msleep_subset2 <- subset(msleep_subset, select = c(sleep_cycle, brainwt, bodywt)) msleep_subset2`

`## sleep_cycle brainwt bodywt ## 1 NA NA 50.000 ## 2 NA 0.01550 0.480 ## 3 NA NA 1.350 ## 4 0.1333 0.00029 0.019 ## 5 0.6667 0.42300 600.000 ## 6 0.7667 NA 3.850 ## 7 0.3833 NA 20.490 ## 8 NA NA 0.045 ## 9 0.3333 0.07000 14.000 ## 10 NA 0.09820 14.800`

**Matrix**– This is composed of**vectors**of the same length, but these**vectors**must all be of the*same data type*. Because the**vectors**in our new**data frame**`msleep_subset2`

are all of type**numeric**, we can have this**data frame**in the form of a**matrix**. To do this, first we need to know the**number of rows**in the**data frame**, so that we know the dimensions of our**matrix**. We find this using the`nrow()`

function:`nrow(msleep_subset2)`

`## [1] 10`

To construct

`matrix1`

using`msleep_subset2`

, do the following:`matrix1 <- matrix(c(msleep_subset2$sleep_cycle, msleep_subset2$brainwt, msleep_subset2$bodywt), nrow = 10, ncol = 3, byrow = FALSE) matrix1`

`## [,1] [,2] [,3] ## [1,] NA NA 50.000 ## [2,] NA 0.01550 0.480 ## [3,] NA NA 1.350 ## [4,] 0.1333 0.00029 0.019 ## [5,] 0.6667 0.42300 600.000 ## [6,] 0.7667 NA 3.850 ## [7,] 0.3833 NA 20.490 ## [8,] NA NA 0.045 ## [9,] 0.3333 0.07000 14.000 ## [10,] NA 0.09820 14.800`

Note that:

*byrow=FALSE*indicates that the matrix should be filled one column at a time*nrow*indicates the desired number of rows*ncol*the desired number of columns.

To name the columns of `matrix1`

, use the `colnames()`

function:

`colnames(matrix1) <- c("sleep_cycle", "brainwt", "bodywt")`

3.** List**– This can be made up of components of *any data type*. The following example creates a list using the fields `vore`

and `sleep_total`

from `msleep_subset`

, which are of types **factor** and **numeric** respectively:

```
list1 <- list(msleep_subset$vore, msleep_subset$sleep_total)
list1
```

```
## [[1]]
## [1] carni omni herbi omni herbi herbi carni <NA> carni herbi
## Levels: carni herbi insecti omni
##
## [[2]]
## [1] 12.1 17.0 14.4 14.9 4.0 14.4 8.7 7.0 10.1 3.0
```

You can reverse the creation of a **list** using the `unlist()`

function.

4.** Table**– This is best used for a **vector of factors**. The following example, creates a one-way **table** for the field `vore`

from `msleep_subset`

, which is of type **factor**. The output clearly shows the frequency of each factor level within the field `vore`

:

```
table1 <- table(msleep_subset$vore)
table1
```

```
##
## carni herbi insecti omni
## 3 4 0 2
```

### Step 5- Identifying elements within a data structure

**Vector**– This example creates a vector`vector1`

using the field`vore`

from`msleep_subset`

. To find element i of a vector, use the syntax`vector[i]`

. For example, to find the ninth element of`vector1`

:`vector1 <- msleep_subset$vore vector1[9]`

`## [1] carni ## Levels: carni herbi insecti omni`

**List**– This example uses`list1`

(See Step 4). To find element i of a list, use the syntax`list[[i]]`

– note the use of double square brackets here. For example, to find the second element of`list1`

:`list1[[2]]`

`## [1] 12.1 17.0 14.4 14.9 4.0 14.4 8.7 7.0 10.1 3.0`

**Matrix**– This example uses`matrix1`

(See Step 4). To find column i of a matrix, use the syntax`matrix[,i]`

. For example, to find the third column of`matrix1`

:`matrix1[, 3]`

`## [1] 50.000 0.480 1.350 0.019 600.000 3.850 20.490 0.045 ## [9] 14.000 14.800`

To find row j of a matrix, use the syntax

`matrix[j,]`

. For example, to find the first row of`matrix1`

:`matrix1[1, ]`

`## [1] NA NA 50`

To find rows of columns x, y, z of a matrix, use the syntax

`matrix[a:c, x:z]`

. For example, to find rows 1, 2, 3 of columns 1, 2, 3 of`matrix1`

:`matrix1[1:3, 1:3]`

`## [,1] [,2] [,3] ## [1,] NA NA 50.00 ## [2,] NA 0.0155 0.48 ## [3,] NA NA 1.35`

**Data frame**– This example uses`msleep_subset`

(See Step 1). To find column i of a data frame, use the syntax`data frame[i]`

. For example, to find the third column of`msleep_subset`

:`msleep_subset[3]`

`## vore ## 1 carni ## 2 omni ## 3 herbi ## 4 omni ## 5 herbi ## 6 herbi ## 7 carni ## 8 <NA> ## 9 carni ## 10 herbi`

To find row j of a data frame, use the syntax

`data frame[j,]`

. For example, to find the third row of`msleep_subset`

:`msleep_subset[3, ]`

`## name genus vore order conservation sleep_total ## 3 Mountain beaver Aplodontia herbi Rodentia nt 14.4 ## sleep_rem sleep_cycle awake brainwt bodywt ## 3 2.4 NA 9.6 NA 1.35`

To find element in column i and row j in a data frame, use the syntax

`data frame[j,i]`

. For example, to find the element in row three and column three of`msleep_subset`

:`msleep_subset[3, 3]`

`## [1] herbi ## Levels: carni herbi insecti omni`

**Table**– This example uses`table1`

(See Step 4). To find element i in a table, use the syntax`table[i]`

. For example, to find element three in`table1`

:`table1[3]`

`## insecti ## 0`

### Step 6 – Converting between data types

To convert between data types, use the `as.type()`

function, where ‘type’ is a particular data type. This function can be used for all data types.

- When converting a variable of type
**numeric**to**character**, the numeric data entires are enclosed in a character string. To see this, we can create a data frame of the**numeric**field`sleep_total`

and convert this field to type**character**:`data.frame(msleep_subset$sleep_total, as.character(msleep_subset$sleep_total))`

`## msleep_subset.sleep_total as.character.msleep_subset.sleep_total. ## 1 12.1 12.1 ## 2 17.0 17 ## 3 14.4 14.4 ## 4 14.9 14.9 ## 5 4.0 4 ## 6 14.4 14.4 ## 7 8.7 8.7 ## 8 7.0 7 ## 9 10.1 10.1 ## 10 3.0 3`

- When converting a field of type
**factor**to**numeric**, each level is assigned a unique number, so that it is still possible to identify the individual levels. To see this, we can create a data frame of the**factor**field`vore`

and convert this field to type**numeric**:`data.frame(msleep_subset$vore, as.numeric(msleep_subset$vore))`

`## msleep_subset.vore as.numeric.msleep_subset.vore. ## 1 carni 1 ## 2 omni 4 ## 3 herbi 2 ## 4 omni 4 ## 5 herbi 2 ## 6 herbi 2 ## 7 carni 1 ## 8 <NA> NA ## 9 carni 1 ## 10 herbi 2`

## What’s next?

This post has covered the basics of data types and data structures in R. Other posts look in more detail at working with dates in R, viewing data in R and useful functions in R.