Formatting Date-times R

February 26, 2014 | Alan

The purpose of this blog is to introduce users to formatting date-times in R. All variables of class Date which are imported into the AnalytiXAgility platform are represented by a date-time. Examples will be provided using an arbitrary vector of date-times and an open dataset.

Related blog posts

Data Source

The open dataset used in this blog is the lakers data from R, within the package lubridate. It contains play by play statistics of each LA Lakers basketball game in the 2008-2009 season, including the date and time on the clock at which play was made.

Learning Outcomes

In this blog we will be formatting date-times in R, both with built-in functions, plus functionality provided via the lubridate package. It will introduce:

  • defining an object where date-time formats are all the same
  • defining an object where date-time formats are not the same
  • arithmetic with date-times
  • summary statistics with date-times

Workflow

Step 1- Reading in the data

The open data used for this blog is from the R package lubridate. The code chunk below shows how to read this data into R. This automatically assigns the data to a variable lakers.

library(lubridate)
data(lakers)
head(lakers)
##       date opponent game_type  time period     etype team
## 1 20081028      POR      home 12:00      1 jump ball  OFF
## 2 20081028      POR      home 11:39      1      shot  LAL
## 3 20081028      POR      home 11:37      1   rebound  LAL
## 4 20081028      POR      home 11:25      1      shot  LAL
## 5 20081028      POR      home 11:23      1   rebound  LAL
## 6 20081028      POR      home 11:22      1      shot  LAL
##                player result points  type  x  y
## 1                                 0       NA NA
## 2           Pau Gasol missed      0  hook 23 13
## 3 Vladimir Radmanovic             0   off NA NA
## 4        Derek Fisher missed      0 layup 25  6
## 5           Pau Gasol             0   off NA NA
## 6           Pau Gasol   made      2  hook 25 10

Step 2 – Data Manipulation

In this step I defined a sample of lakers by extracting five rows of the date and time columns, removing any duplicate date values. The date and time elements were then combined to form an element of a single character string, before conversion to class POSIXct. A set of step by step instructions on how this was done can be found here: Working with Date-times and Time Zones in R. This manipulation results in the object type_posixct, which will be used in Step 5 – Step 7.

lakers_unique_dates <- lakers[!duplicated(lakers$date), ]
lakers_unique_dates_times <- subset(lakers_unique_dates, select = c(date, time))
lakers_unique_dates_times_subset <- lakers_unique_dates_times[1:5, ]
rownames(lakers_unique_dates_times_subset) <- NULL
lakers_unique_dates_times_subset$date <- as.character(lakers_unique_dates_times_subset$date)
lakers_unique_dates_times_subset$date <- as.Date(lakers_unique_dates_times_subset$date, 
    "%Y%m%d")
dates_times <- paste(lakers_unique_dates_times_subset$date, lakers_unique_dates_times_subset$time)
type_posixct <- as.POSIXct(dates_times)
type_posixct
## [1] "2008-10-28 12:00:00 GMT" "2008-10-29 12:00:00 GMT"
## [3] "2008-11-01 12:00:00 GMT" "2008-11-05 12:00:00 GMT"
## [5] "2008-11-09 12:00:00 GMT"

Step 3- Defining a vector where date-time formats are the same

Consider the following arbitrary vector of date-times, in which all the date-times are in the format %d %B %Y %H:%M:%S. This has been assigned to the variable dates_same_format. It should be noted that the letters “d, B, Y, H, M, S” represent days, months (in unabbreviated character format), years,hours, minutes and seconds respectively. For a table of the different date component forms that can be used, see Working with Dates in R.

dates_same_format <- c("18 November 1990 13:05:00", "24 July 1954 14:30:00", 
    "26 June 1987 08:25:00", "25 May 1957 18:50:00", "7 March 2008 23:15:00")
dates_same_format
## [1] "18 November 1990 13:05:00" "24 July 1954 14:30:00"    
## [3] "26 June 1987 08:25:00"     "25 May 1957 18:50:00"     
## [5] "7 March 2008 23:15:00"

To define this vector as a date-time object of class POSIXct, apply the as.POSIXct() function with the correct format of the input specified:

dates_same_format <- as.POSIXct(dates_same_format, format = "%d %B %Y %H:%M:%S")

Step 4- Defining a vector in which date formats are not the same

A problem occurs in defining a vector where the date-time formats are not all the same. This is due to the fact that the as.POSIXct() function only allows you to specify one particular date format at a time.

Consider the following vector which contains elements of two different date-time formats: %d %B %Y %H:%M:%S and %d %B %y %H:%M:%S. This has been assigned to the variable dates_different_format:

dates_different_format <- c("18 November 1990 13:05:00", "24 July 1954 14:30:00", 
    "26 June 87 08:25:00", "25 May 57 18:50:00", "7 March 2008 23:15:00")
dates_different_format
## [1] "18 November 1990 13:05:00" "24 July 1954 14:30:00"    
## [3] "26 June 87 08:25:00"       "25 May 57 18:50:00"       
## [5] "7 March 2008 23:15:00"

Applying the as.POSIXct() function to this vector results in the following output:

dates_different_format <- as.POSIXct(dates_different_format, format = "%d %B %Y %H:%M:%S")
dates_different_format
## [1] "1990-11-18 13:05:00 GMT" "1954-07-24 14:30:00 BST"
## [3] "0087-06-26 08:25:00 GMT" "0057-05-25 18:50:00 GMT"
## [5] "2008-03-07 23:15:00 GMT"

It can be seen that the third and fourth components of the vector have been defined incorrectly and are thus invalid. This is because we had specified when defining the format that the year component would have four digits (“Y”), not two (“y”). Basically the format that we choose should be chosen to suit as many date-times in the vector as possible. In our case the majority of the objects were in the format where the year component was represented by four digits, therefore it made sense to use the format in the code chunk above. This will ensure that the majority of date-times in the vector are converted correctly.

Step 5- Arithmetic with date-times

For the purposes of the examples in this step, we will be using the two objects dates_same_format and type_posixct which are both of class POSIXct. It should be noted that to carry out an arithmetic operation, the two objects must be of the same class. In the following code chunk we find the time difference in days between type_posixct and dates_same_format:

type_posixct - dates_same_format
## Time differences in days
## [1]  6554.0 19820.9  7799.2 18791.8   246.5
## attr(,"tzone")
## [1] ""

If an alternative unit of time rather than days is required, the difftime() function can be used, which has an optional argument to assign the units that you want to use. If units=“auto”, a suitable set of units is chosen, the largest possible (with the exclusion of “weeks”), in which all the absolute differences are greater than one. To find the time difference in units=“auto”:

difftime(type_posixct, as.POSIXct(dates_same_format), units = "auto")
## Time differences in days
## [1]  6554.0 19820.9  7799.2 18791.8   246.5
## attr(,"tzone")
## [1] ""

To find the time difference in weeks:

difftime(type_posixct, as.POSIXct(dates_same_format), units = "weeks")
## Time differences in weeks
## [1]  936.28 2831.56 1114.17 2684.54   35.22
## attr(,"tzone")
## [1] ""

To find the time difference in hours:

difftime(type_posixct, as.POSIXct(dates_same_format), units = "hours")
## Time differences in hours
## [1] 157295 475703 187181 451002   5917
## attr(,"tzone")
## [1] ""

To find the time difference in minutes:

difftime(type_posixct, as.POSIXct(dates_same_format), units = "mins")
## Time differences in mins
## [1]  9437695 28542150 11230835 27060130   355005
## attr(,"tzone")
## [1] ""

To find the time difference in seconds:

difftime(type_posixct, as.POSIXct(dates_same_format), units = "secs")
## Time differences in secs
## [1] 5.663e+08 1.713e+09 6.739e+08 1.624e+09 2.130e+07
## attr(,"tzone")
## [1] ""

Step 6- Making Arithmetic with Date-times more accurate

Care needs to be taken when carrying out arithmetic with date-times- consideration needs to be made for instances such as leap years and the putting back/forward of clock times in autumn and spring respectively. In the following example, using package lubridate, we can check if the year 2008 was a leap year, using the leap_year() function. We apply this function in the code chunk below and find that it was indeed a leap year:

leap_year(2008)
## [1] TRUE

Obviously this will lead to some accuracy issues due to the fact that in a leap year there are 366, rather than 365 days. Package lubridate helps to improve accuracy here as it supplies two time span classes. The first is durations which are ‘mathematically correct’- adding on a year will add on 365 days. The second is periods, which will return the accurate answer. The following code chunk will add on a year using the durations class.

as.Date("2008-01-01") + dyears(1)
## [1] "2008-12-31"

The next code chunk will add on a year using the periods class:

as.Date("2008-01-01") + years(1)
## [1] "2009-01-01"

Step 7- Summary functions with Date-times

It is possible to perform a number of summary calculations using date-times, to include the mean, variance, median, maximum, minimum and range. In the code chunk below we find the mean of type_posixct. The additional argument ‘na.rm=TRUE’ takes account of any missing values that may be in the data.

mean_type_posixct <- mean(type_posixct, na.rm = TRUE)

This returns both the date and time components for the POSIXct object. If we only required the date component, we could use the function round_date() from the package lubridate, to round to the nearest day. The syntax required for this is shown in the code chunk below:

round_date(mean_type_posixct, "day")
## [1] "2008-11-03 GMT"

To remove the timestamp, we must convert mean_type_posixct to class character:

as.character(mean_type_posixct)
## [1] "2008-11-02 12:00:00"

Further Reading


 

Leave a Reply

Your email address will not be published. Required fields are marked *