Viewing and Subsetting Data Frames

February 26, 2014 | Alan

Data frames provide an intuitive and flexible way to store data within R. This post will make a start on detailing the extensive functionality of data frames by showing how to represent your data as a data frame in the AnalytiXAgility platform, how to get some basic information about a data frame and how to extract and view subsets of your data.

Related Blog Posts

More Advanced data management

Learning Outcomes

  • How to find some basic properties of a data frame.
  • Different ways of subsetting data frames.
  • Ordering data frames by rows or columns.

Workflow

Step 1 – Obtaining a Data Frame

Data frames are used for storing data tables in R and are made up of a list of column vectors, belonging to potentially different classes, each with the same length. Datasets that have been loaded into your workspace can be read into an R session in the AnalytiXAgility platform, by using rshReadTable(), where the data will be represented as a data frame (see Day 1: Getting started with the R Console in the AnalytiXAgility platform) so you can use any dataset in your workspace to try some of these examples. Alternatively, typing data() into the console and hitting enter will give a list of the example datasets within R and calling the data() function again with the name of one of these datasets as an argument will ensure it is able to be used. If you’re using an example dataset from R, check that it is indeed a data frame, by using class(), before trying the rest of the workflow.

Throughout this post I’ll be using a data frame created from scratch using the data.frame() function as follows.

sex <- c("M", "F", "M", "F", "M", "F", "M", "F")
postcode <- c("AB123CD", "EF11GH", "ZY12WX", "AB123CD", "RQ676QR", "LM54NP", 
    "TT94HS", "BB19QC")
height <- c(184.3, 179.3, 179, 174.5, 174.3, 169.5, 185, 170.3)
weight <- c(90, 80, 80, 72, 80, 70, 98, 90)
smoker <- c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)
patients <- data.frame(sex, postcode, height, weight, smoker)

Step 2 – Viewing your Data Frame

Once you have a data frame, just enter the name it is assigned to and it will be displayed as a table in the console with the column names, or header, displayed along the top and the row names displayed down the left.

patients
##   sex postcode height weight smoker
## 1   M  AB123CD  184.3     90   TRUE
## 2   F   EF11GH  179.3     80  FALSE
## 3   M   ZY12WX  179.0     80  FALSE
## 4   F  AB123CD  174.5     72  FALSE
## 5   M  RQ676QR  174.3     80   TRUE
## 6   F   LM54NP  169.5     70  FALSE
## 7   M   TT94HS  185.0     98  FALSE
## 8   F   BB19QC  170.3     90   TRUE

For larger tables, you may not want to have the entire table displayed in your console and here the str() function is useful for giving an overview of the structure of a data frame.

str(patients)
## 'data.frame':    8 obs. of  5 variables:
##  $ sex     : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1
##  $ postcode: Factor w/ 7 levels "AB123CD","BB19QC",..: 1 3 7 1 5 4 6 2
##  $ height  : num  184 179 179 174 174 ...
##  $ weight  : num  90 80 80 72 80 70 98 90
##  $ smoker  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...

This tells us that this object is a data frame with 8 observations and 5 variables, corresponding to 8 rows and 5 columns, and then lists the columns vertically, showing the class and first few values of each. Printing out all 8 rows isn’t really a problem here but if we were working with a larger data frame then, in the interest of keeping the console tidy, using the head() and tail() functions will display only the first few rows of the data frame.

# The first 6 rows of 'patients'
head(patients)
##   sex postcode height weight smoker
## 1   M  AB123CD  184.3     90   TRUE
## 2   F   EF11GH  179.3     80  FALSE
## 3   M   ZY12WX  179.0     80  FALSE
## 4   F  AB123CD  174.5     72  FALSE
## 5   M  RQ676QR  174.3     80   TRUE
## 6   F   LM54NP  169.5     70  FALSE

By default 6 rows will be returned by head() and tail() but this can be changed by passing a second argument to the function.

# The last 3 rows of 'patients'
tail(patients, 3)
##   sex postcode height weight smoker
## 6   F   LM54NP  169.5     70  FALSE
## 7   M   TT94HS  185.0     98  FALSE
## 8   F   BB19QC  170.3     90   TRUE

There are also functions which will return only a few properties of a data frame rather than a full overview. For example, to find the number of rows or columns of a data frame use nrow() or ncol() respectively. Using dim() will return both of these values.

nrow(patients)
## [1] 8
ncol(patients)
## [1] 5
dim(patients)
## [1] 8 5

Similarly, colnames() (or just names()) can be used to return the names of all the columns of a data frame and rownames() will return the names of all the rows. Again, dimnames() will return column names and row names but this time as a two item list.

names(patients)
## [1] "sex"      "postcode" "height"   "weight"   "smoker"
rownames(patients)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
dimnames(patients)
## [[1]]
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
## 
## [[2]]
## [1] "sex"      "postcode" "height"   "weight"   "smoker"

As well as simply extracting the row and column names from a data frame, the above functions can also be used to rename the rows and columns. For example, when we created the patients data frame earlier the row names defaulted to the integers 1 to 8 but we can change this as follows.

rownames(patients) <- c("patient 1", "patient 2", "patient 3", "patient 4", 
    "patient 5", "patient 6", "patient 7", "patient 8")
patients
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE
## patient 6   F   LM54NP  169.5     70  FALSE
## patient 7   M   TT94HS  185.0     98  FALSE
## patient 8   F   BB19QC  170.3     90   TRUE
# or, more succinctly, using the paste() function
rownames(patients) <- paste("patient", 1:8)
patients
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE
## patient 6   F   LM54NP  169.5     70  FALSE
## patient 7   M   TT94HS  185.0     98  FALSE
## patient 8   F   BB19QC  170.3     90   TRUE

Step 3 – Subsetting

The str(), head() and tail() functions are great for quickly getting an idea of what your data looks like, (or checking that some data manipulation you’ve just performed has had the desired result) but to view a more specific snippet, or to extract just the data we are interested in from a data frame, we need to use some of R’s built-in subsetting functions.

One option for subsetting is to use the subset() function which takes a data frame and filters by a condition made from logical operators involving the columns of the data frame. For example:

patients_sub <- subset(patients, sex == "M")
patients_sub
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE
## patient 7   M   TT94HS  185.0     98  FALSE

subset() will return a data frame and so you can use this newly created object as an argument in subset().

patients_subsub <- subset(patients_sub, height > 180)
patients_subsub
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 7   M   TT94HS  185.0     98  FALSE

If we were only interested in the final product we could combine the above subset calls into one call.

patients_sub <- subset(patients, sex == "M" & height > 180)
patients_sub
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 7   M   TT94HS  185.0     98  FALSE

We can also specify which columns we would like subset() to return by adding a select argument to the function call.

patients_sub <- subset(patients, sex == "M" & height > 180, select = c("height", 
    "weight", "smoker"))
patients_sub
##           height weight smoker
## patient 1  184.3     90   TRUE
## patient 7  185.0     98  FALSE

Note that, in the case above, we don’t actually have to tell subset() that the third argument is select since, by default, this is already the expected third argument. However, being verbose allows us to forget about the order in which a function expects its arguments as long as we remember the names of the arguments. The two examples below will produce the same subset, the first by entering the arguments in the expected order and the second by naming each argument put into the function.

patients_sub1 <- subset(patients, sex == "M" & height > 180, c("height", "weight", 
    "smoker"))
patients_sub2 <- subset(select = c("height", "weight", "smoker"), x = patients, 
    subset = sex == "M" & height > 180)

The next example will throw an error since we have deviated from the expected argument order but haven’t specified the arguments’ names.

patients_sub3 <- subset(patients, c("height", "weight", "smoker"), sex == "M" & 
    height > 180)  # this will produce an error
## Error: 'subset' must be logical

As mentioned in Data Types in R, the syntax df$x returns column x of data frame df and df[i,j] returns the value in the ith row and jth column of data frame df. In fact we can obtain all the functionality of subset() using the [, function instead with some slightly different syntax. The first example using subset() could be replicated as follows.

patients_sub <- patients[patients$sex == "M", ]
patients_sub
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE
## patient 7   M   TT94HS  185.0     98  FALSE
# we can get R to check whether the results are the same.
identical(subset(patients, sex == "M"), patients[patients$sex == "M", ])
## [1] TRUE

Note that there is just a space left after the comma in patients[patients$sex == "M", ]. This is because the first argument of [, describes how to filter the rows and the second describes how to filter the columns, so an argument left blank just means there is no filter imposed. An advantage of [, over subset() is that you can subset using indices as well as logical operators which allows for greater flexibility. For example,

patients_ind1 <- patients[1:5, ]
patients_ind1
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE
patients_ind2 <- patients[4:7, 2:4]
patients_ind2
##           postcode height weight
## patient 4  AB123CD  174.5     72
## patient 5  RQ676QR  174.3     80
## patient 6   LM54NP  169.5     70
## patient 7   TT94HS  185.0     98
patients_sample <- patients[sample(nrow(patients), 5), sample(ncol(patients), 
    3)]  # random sample of 5 rows of 3 random columns
patients_sample
##           weight height sex
## patient 5     80  174.3   M
## patient 8     90  170.3   F
## patient 7     98  185.0   M
## patient 2     80  179.3   F
## patient 1     90  184.3   M

We can also subset the columns by name as in select for subset().

patients_sub <- patients[1:5, c("height", "weight", "smoker")]
patients_sub
##           height weight smoker
## patient 1  184.3     90   TRUE
## patient 2  179.3     80  FALSE
## patient 3  179.0     80  FALSE
## patient 4  174.5     72  FALSE
## patient 5  174.3     80   TRUE

It’s even possible to filter over the columns with conditions on the rows as you would for the rows with conditions on the columns.

patients_sub <- patients[1:5, as.numeric(patients[1, ])%%1 == 0]
patients_sub
##           sex postcode weight smoker
## patient 1   M  AB123CD     90   TRUE
## patient 2   F   EF11GH     80  FALSE
## patient 3   M   ZY12WX     80  FALSE
## patient 4   F  AB123CD     72  FALSE
## patient 5   M  RQ676QR     80   TRUE

An important difference to note between subset() and [, is that whenever you reference a column in subset() it can just be called by name whereas, when referencing a column in [, you have to use the syntax df$x to make explicit which data frame you are referring to since, in general, R will not search for variables inside a data frame. We can avoid having to do this by using attach() with the desired data frame as an argument to cause R to search this data frame when looking for variables. However, this can cause problems if there are already variables sharing names with the data frame’s columns in the workspace so it often may be best just to resign yourself to a few extra keystrokes. If you are attaching data frames then remember to use detach() once you no longer need to reference them or look into with() as an alternative.

Much of the same functionality can be obtained from both subset() and [, but if you check the documentation for subset(), you’ll find that it comes with a warning and a recommendation for the use of [, instead. Hadley Wickham gives an in-depth description of why you should heed this warning here.

Step 4 – Ordering

Finally, once we have obtained a set of data in which we are interested, we may want to order our data frame in a logical manner. This can be done using the order() function and similar syntax to when we were subsetting. For example, to sort patients by height from shortest to tallest we could do the following.

patients <- patients[order(patients$height), ]
head(patients)  # use head to save space but check that the output is what was expected
##           sex postcode height weight smoker
## patient 6   F   LM54NP  169.5     70  FALSE
## patient 8   F   BB19QC  170.3     90   TRUE
## patient 5   M  RQ676QR  174.3     80   TRUE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 2   F   EF11GH  179.3     80  FALSE

We can do the same but from tallest to shortest by adding - to the front of the argument of order().

patients <- patients[order(-patients$height), ]
head(patients)
##           sex postcode height weight smoker
## patient 7   M   TT94HS  185.0     98  FALSE
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE

Ordering will be done alphanumerically if the class of the column is “character”.

patients <- patients[order(as.character(patients$postcode)), ]
head(patients)
##           sex postcode height weight smoker
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 8   F   BB19QC  170.3     90   TRUE
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 6   F   LM54NP  169.5     70  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE

Make sure therefore that when you want to order numerically then the column you choose to order by is indeed numeric, otherwise you’ll end up with the following.

alphanum <- as.character(1:12)
alphanum
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
order(alphanum)
##  [1]  1 10 11 12  2  3  4  5  6  7  8  9

Additional arguments can be passed to order() and each one will be used in the ordering, with precedence being given to earlier arguments.

patients <- patients[order(patients$sex, -patients$height), ]  # orders by sex then height
patients
##           sex postcode height weight smoker
## patient 2   F   EF11GH  179.3     80  FALSE
## patient 4   F  AB123CD  174.5     72  FALSE
## patient 8   F   BB19QC  170.3     90   TRUE
## patient 6   F   LM54NP  169.5     70  FALSE
## patient 7   M   TT94HS  185.0     98  FALSE
## patient 1   M  AB123CD  184.3     90   TRUE
## patient 3   M   ZY12WX  179.0     80  FALSE
## patient 5   M  RQ676QR  174.3     80   TRUE

As with subsetting, the second argument of [, is for operations on the columns so by using order() here we can also sort the columns of a data frame.

patients <- patients[, order(names(patients))]
patients
##           height postcode sex smoker weight
## patient 2  179.3   EF11GH   F  FALSE     80
## patient 4  174.5  AB123CD   F  FALSE     72
## patient 8  170.3   BB19QC   F   TRUE     90
## patient 6  169.5   LM54NP   F  FALSE     70
## patient 7  185.0   TT94HS   M  FALSE     98
## patient 1  184.3  AB123CD   M   TRUE     90
## patient 3  179.0   ZY12WX   M  FALSE     80
## patient 5  174.3  RQ676QR   M   TRUE     80

What’s Next

More advanced data management

Subsetting with regex

Aggregating data

Creating visualizations

Further Reading

www.statmethods.net/management/subset.html

www.statmethods.net/management/sorting.html

www.ats.ucla.edu/stat/r/faq/subset_R.htm

stackoverflow.com/questions/17340541/most-efficient-way-of-subsetting-dataframes


 

Leave a Reply

Your email address will not be published. Required fields are marked *