Using XML within AnalytiXagility

February 26, 2015 | Alan

This post provides an introduction to the XML library contained within R. It takes us through the process of parsing XML documents, transforming data into data frames for analysis, generating XML documents and validating them against an XML schema definition in AnalytiXagility.

Data Source

The data set contains information about food hygiene ratings in Edinburgh from the Food Standards Agency and can be downloaded directly from here.

Learning Outcomes

The user will learn some of the fundamentals of the XML library which is contained within R.

Workflow

Step 1 – Loading the data

Firstly we need to upload the data to the workspace. You can download the data directly from the link above and make sure it is named food_hygiene.xml to ensure all steps in the code below work correctly.

The following code should be saved as food_hygiene_summary_stats_schema.xsd which we will use later to validate an XML file:

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="FHRSEstablishmentSummary">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="SummaryDetail" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="RatingValue"/>
              <xs:element type="xs:string" name="BusinessType"/>
              <xs:element type="xs:short" name="count"/>
              <xs:element type="xs:float" name="perc"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Once you have created these two files above, they need to be uploaded to your workspace as workfiles.

Step 2 – Parsing an XML Document in R

Firstly, we need to load in a few libraries and read in the file to the workspace:

library(XML)
library(dplyr)
library(ggplot2)
xap.workfile("food_hygiene.xml")
doc <- xmlParse("food_hygiene.xml", useInternalNodes = TRUE)
class(doc)
[1] "XMLInternalDocument" "XMLAbstractDocument"

Now that we have the document in memory we want to transform it into a data frame. Lets have a look at the contents of the XML to see what we’€™re dealing with. We can access the top node using xmlRoot() and then have a look at some of the sub-nodes below this:

root <- xmlRoot(doc)
root[1]
$Header
<Header>
  <ExtractDate>2015-02-20</ExtractDate>
  <ItemCount>3822</ItemCount>
  <ReturnCode>Success</ReturnCode>
</Header> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"    

This shows us the header is the first element in the XML. Then by further subsetting we can get an idea of the structure per business:

root[2][[1]][[1]]
<EstablishmentDetail>
  <FHRSID>522452</FHRSID>
  <LocalAuthorityBusinessID>147905</LocalAuthorityBusinessID>
  <BusinessName>10 HILL PLACE HOTEL</BusinessName>
  <BusinessType>Hotel/bed &amp; breakfast/guest house</BusinessType>
  <BusinessTypeID>7842</BusinessTypeID>
  <AddressLine1>10 Hill Place</AddressLine1>
  <PostCode>EH8 9DS</PostCode>
  <RatingValue>Pass</RatingValue>
  <RatingKey>fhis_pass_en-GB</RatingKey>
  <RatingDate>2012-01-17</RatingDate>
  <LocalAuthorityCode>773</LocalAuthorityCode>
  <LocalAuthorityName>Edinburgh (City of)</LocalAuthorityName>
  <LocalAuthorityWebSite>http://www.edinburgh.gov.uk/foodsafety</LocalAuthorityWebSite>
  <LocalAuthorityEmailAddress>environmentalhealth@edinburgh.gov.uk</LocalAuthorityEmailAddress>
  <Scores/>
  <SchemeType>FHIS</SchemeType>
  <Geocode>
    <Longitude>-3.18408700000000</Longitude>
    <Latitude>55.94623000000000</Latitude>
  </Geocode>
</EstablishmentDetail> 

Step 3 – Transforming XML Data into a Data Frame

Now that we have an idea about the structure of the data we can decide what information we want to extract. For this example we want to create a summary of the data showing the distribution of different ratings for each business type. We also want to extract the longitude and latitude to demonstrate how to handle a slightly nested structure. Therefore, the following fields are relevant:

  • BusinessName
  • BusinessType
  • BusinessTypeID
  • RatingValue
  • RatingDate
  • Longitude
  • Latitude

Given that Longitude and Latitude are nested within the Geocode tag we cannot use the more straightforward xmlToDataFrame() function. However, if you do have an XML with a simple flat structure, there is some information on how it could be used as an alternative here. We also need to check whether all these elements exist for each EstablishmentDetail, which we can do using the summary() function:

summary(doc)
$nameCounts

              AddressLine1               BusinessName               BusinessType 
                      3822                       3822                       3822 
            BusinessTypeID        EstablishmentDetail                     FHRSID 
                      3822                       3822                       3822 
                   Geocode   LocalAuthorityBusinessID         LocalAuthorityCode 
                      3822                       3822                       3822 
LocalAuthorityEmailAddress         LocalAuthorityName      LocalAuthorityWebSite 
                      3822                       3822                       3822 
                RatingDate                  RatingKey                RatingValue 
                      3822                       3822                       3822 
                SchemeType                     Scores                   PostCode 
                      3822                       3822                       3772 
                  Latitude                  Longitude               AddressLine2 
                      3745                       3745                       1320 
   EstablishmentCollection                ExtractDate          FHRSEstablishment 
                         1                          1                          1 
                    Header                  ItemCount                 ReturnCode 
                         1                          1                          1 

$numNodes
[1] 77562

We can see that there are fewer elements for Latitude and Longitude so we need to make sure we correctly assign nulls within the data frame:

col1 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail", function(node) {
  BusinessName <- xmlValue(node[["BusinessName"]])
  if (is.null(BusinessName)) BusinessName <- NA
  food_hygiene_df <- data.frame("BusinessName" = BusinessName, stringsAsFactors = FALSE)
}))

This produces a data frame with one column. The code uses xpathApply() to select the nodes of interest, then we extract the value from each node given that its not null. Now we can do the same for the remaining fields:

col2 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail", function(node) {
  BusinessType <- xmlValue(node[["BusinessType"]])
  if (is.null(BusinessType)) BusinessType <- NA
  food_hygiene_df <- data.frame("BusinessType" = BusinessType, stringsAsFactors = FALSE)
}))

col3 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail", function(node) {
  BusinessTypeID <- xmlValue(node[["BusinessTypeID"]])
  if (is.null(BusinessTypeID)) BusinessTypeID <- NA
  food_hygiene_df <- data.frame("BusinessTypeID" = BusinessTypeID, stringsAsFactors = FALSE)
}))

col4 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail", function(node) {
  RatingValue <- xmlValue(node[["RatingValue"]])
  if (is.null(RatingValue)) RatingValue <- NA
  food_hygiene_df <- data.frame("RatingValue" = RatingValue, stringsAsFactors = FALSE)
}))

col5 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail", function(node) {
  RatingDate <- xmlValue(node[["RatingDate"]])
  if (is.null(RatingDate)) RatingDate <- NA
  food_hygiene_df <- data.frame("RatingDate" = RatingDate, stringsAsFactors = FALSE)
}))

col6 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail/Geocode", function(node) {
  Longitude <- xmlValue(node[["Longitude"]])
  if (is.null(Longitude)) Longitude <- NA
  food_hygiene_df <- data.frame("Longitude" = Longitude, stringsAsFactors = FALSE)
}))

col7 <- do.call(rbind, xpathApply(doc, "/FHRSEstablishment/EstablishmentCollection/EstablishmentDetail/Geocode", function(node) {
  Latitude <- xmlValue(node[["Latitude"]])
  if (is.null(Latitude)) Latitude <- NA
  food_hygiene_df <- data.frame("Latitude" = Latitude, stringsAsFactors = FALSE)
}))

Notice that the last two paths for longitude and latitude contain an additional node. We then combine these into a single data frame and take a look at the first couple of lines:

food_hygiene <- cbind(col1, col2, col3, col4, col5, col6, col7)
head(food_hygiene)
           BusinessName                      BusinessType BusinessTypeID          RatingValue RatingDate
1  10 HILL PLACE HOTEL Hotel/bed & breakfast/guest house           7842                 Pass 2012-01-17
2    10 to 10 in Delhi            Takeaway/sandwich shop           7844                 Pass 2012-05-29
3        100 WATT CAFE           Restaurant/Cafe/Canteen              1                 Pass 2011-12-19
4 108 CHINESE TAKEAWAY            Takeaway/sandwich shop           7844                 Pass 2014-06-12
5                 1780           Restaurant/Cafe/Canteen              1                 Pass 2013-01-15
6     181 Delicatessen           Restaurant/Cafe/Canteen              1 Improvement Required 2014-09-24
          Longitude          Latitude
1 -3.18408700000000 55.94623000000000
2 -3.18489800000000 55.94554000000000
3 -3.18502100000000 55.97210800000000
4 -3.11574400000000 55.95396700000000
5 -3.20510000000000 55.95153500000000
6 -3.20958700000000 55.93573000000000

Step 4 – Analysing the Data

Lets summarise the data, grouping by the RatingValue and BusinessTypeto get counts for each group. We can then calculate the percentage of each business type with each rating value:

calc_counts <- food_hygiene %>% group_by(RatingValue, BusinessType) %>% summarise(count = n())

summary_stats <- data.frame(calc_counts %>% group_by(BusinessType) %>% mutate(perc = count/sum(count)*100) %>% arrange(BusinessType) %>% ungroup)

head(summary_stats)
           RatingValue                        BusinessType count      perc
1 Improvement Required Hospitals/Childcare/Caring Premises    18  5.454545
2                 Pass Hospitals/Childcare/Caring Premises   308 93.333333
3    Pass and Eat Safe Hospitals/Childcare/Caring Premises     4  1.212121
4 Improvement Required   Hotel/bed & breakfast/guest house    28 12.389381
5                 Pass   Hotel/bed & breakfast/guest house   195 86.283186
6    Pass and Eat Safe   Hotel/bed & breakfast/guest house     3  1.327434

We can also visualise these results by business type:

ggplot(data = summary_stats, aes(x = BusinessType, y = perc, fill = factor(RatingValue))) + 
  geom_bar(position="dodge", stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

food_hygiene_rating_values

Step 5 – Generating an XML document

Now lets say we want to create an XML document with these summary statistics and you want it to conform to a specfied XML schema definition, which you can download directly from this post and upload to your workspace (refer Step 1 above). To do this we can write a function that converts a data frame to an XML document. So we want the first row in the data frame above to look like the following:

<SummaryDetail>
    <RatingValue>Improvement Required</RatingValue>
    <BusinessType>Hospitals/Childcare/Caring Premises</BusinessType>
    <count>18</count>
    <perc>5.45454545454545</perc>
</SummaryDetail>

We can split this code up into two functions. One that converts a single row of a data frame into XML, and a second that runs this function for every row as well as adding an appropriate header for the document:

row_to_xml <- function(row, root, data){
  r <- newXMLNode("SummaryDetail", parent = root)
  for (column in colnames(data)) {
    newXMLNode(column, data[row,column], parent = r)
  }
}  

df_to_xml <- function(data, file_name = ""){
  doc <- newXMLDoc()
  root <- newXMLNode("FHRSEstablishmentSummary", doc = doc)
  lapply(1:nrow(data), row_to_xml, root = root, data = data)
  if (file_name != ""){
    saveXML(doc, file = file_name)
  } else {
    print(doc)
  }
}

Now we just need to pass the name of our data frame to the function and the name of the file we wish to save it as:

df_to_xml(summary_stats, file = "food_hygiene_summary_stats.xml")

Alternatively, we can just print the results to screen by not supplying a file name:

df_to_xml(summary_stats)

The user can save this file as a workfile in AnalytiXagility if they have this functionality switched on. You can do so with the following function:

xap.save_workfile("food_hygiene_summary_stats.xml")

Step 6 – Validating an XML document

To ensure our XML document has the correct structure we can validate it relative to an XML schema using the xmlSchemaValidate() function. We just need to read in the schema to the workspace and then run the function:

xap.workfile("food_hygiene_summary_stats_schema.xsd")
schema <- xmlTreeParse("food_hygiene_summary_stats_schema.xsd", isSchema =TRUE, useInternal = TRUE)
xmlSchemaValidate(schema, "food_hygiene_summary_stats.xml")
$status
[1] 0

$errors
list()
attr(,"class")
[1] "XMLStructuredErrorList"

attr(,"class")
[1] "XMLSchemaValidationResults"

The status is 0 for a validated XML document, and non-zero for an invalid document.


 

Leave a Reply

Your email address will not be published. Required fields are marked *