BHC Data Analysis Details

Boston History Company Data Analysis

Detail Page

This page is intended for those who have read the Project Overview and are looking for more details. I will try not to repeat anything I’ve already covered. If you wish to review the background or context for a particular section, please click the back button to return to the corresponding section of the Overview page.

All graphs and visualizations were created in Power BI.

Ranking Details

Back

TripAdvisor’s ranking strategy is based on three key pillars: relevancy, product competitiveness, and popularity among site users. BHC has invested considerable effort in ensuring their products were both relevant and competitive. However, I determined that more could be done to address the third pillar.

Source: https://operatorresources.viator.com/viators-ranking-and-sort-overview/

Tour popularity is determined by various factors, including click-through rate, page views, traveler searches, and bookings. Some of these factors are measurable: for example, TripAdvisor makes page view statistics available to operators. Bookings are also measurable, and there is a strong correlation between the two. Therefore, more page views should lead to more bookings (and more reviews, all of which translate to higher ranking).

Suggested changes included updating photographs and adopting catchier tour titles. More importantly, my suggestions sparked a discussion among company executives about what changes would be most effective. The results will be verifiable, since the metrics I highlighted establish a baseline against which any improvements could be measured.

Reviews play a prominent role in TripAdvisor’s ranking algorithm. BHC consistently gets excellent reviews, but the Company’s five-star rating did not seem to be reflected in their ranking. In fact, one particular competitor placed much higher in the rankings despite receiving less-than-stellar star ratings. I decided to investigate – and this led to an interesting discovery.

A closer look at the data revealed that this competitor had far more helpful reviews than any other company – or even the Freedom Trail itself. “Helpful” reviews are written reviews which a site visitor has marked as helpful by clicking the thumbs-up icon next to the review. Nearly all of the favorable reviews for this competitor had been marked as helpful. By contrast, a full year had passed since BHC’s last helpful review.

Source: TripAdvisor tour reviews (downloaded via ExportComments.com)

Visitor interaction is another key factor in TripAdvisor’s ranking algorithm. While the actual weighting is unknown, it certainly seems as if all those thumbs-ups served to boost the competitor’s ranking. The Boston History Company will look at ways to leverage this knowledge in the future.

Weather Details

Back

Source: company spreadsheets; historic weather data from visualcrossing.com

This visualization shows that as temperatures climb, more people tend to sign up for the 10:00 a.m. tour – the earliest tour of the day, finishing around 12:30 p.m. For a more detailed look at the underlying data, I created a drill-through page showing a matrix of the dates and times affected, the weather conditions for that day, and the number of guests per tour.

Source: company spreadsheets; historic weather data from visualcrossing.com

Note that start times may vary according to what is being offered at the time. For example, BHC didn’t offer a 10:00 a.m. tour in 2024 until Thursday, June 20th.

It might be worth noting that only one day in 2023 registered maximum temperatures in excess of 90°F (and only seven days in 2024). However, these days occurred in June, July, and August – the heart of the tourist season, when larger groups would be expected.

Source: company spreadsheets; historic weather data from visualcrossing.com

For some charts, I found it useful to group certain weather conditions together. For example, I grouped the categories “Snow, Rain, Partially Cloudy,” “Snow, Rain, Overcast,” and “Snow, Partially Cloudy” together as simply “Snow.” That made the charts easier to read, but I added the original categories to the legend. I did this in order to show, among other things, that “Freezing Rain” is not necessarily comparable to other forms of rain.

This chart is also connected to a drill-through page – and I created a slicer to filter the chart by day of week, since it often doesn’t make sense to combine weekdays and weekends when discussing average group size.

The bottom line on weather: while heat and precipitation do have some affect on the number of individuals who sign up for a tour, it’s important to remember that a significant percentage of guests will be determined to take the tour regardless of conditions.

The Red Sox Effect

Back

This portion of the analysis served as my proof-of-concept. Based on my eight years of experience working at a prominent historic site (i.e., the Old North Church), I suspected that visitors who come to Boston to attend a baseball game might look book a tour of the Freedom Trail, as well. So I downloaded CSV versions of Red Sox schedules for the last two years (I only had access to the last two years’ worth of BHC tour data). Then I added calculated columns to the data model indicating whether the tour date coincided with a Red Sox home game and whether the game was played during the day (before 4:00 p.m.) or at night. Finally, I created measures to calculate the average number of guests under each condition.

Source: company spreadsheets; Red Sox schedules, 2023-2024

I used card visualizations to show the results, and I created a tool-tip page for each card to show the underlying data. That strategy probably took a little more work than creating a single drill-through page, but it saves all the toggling back and forth between pages.

Screen capture showing the underlying data for a card visual as a tool-tip

I also downloaded Boston Celtics schedules, but the baseball season aligns so perfectly with the summer tourist walking-the-Freedom-Trail season that it made sense to start with the Red Sox and save a basketball analysis for another day.

Cancellation Details

Back

The longer gaps between booking and tour date for cancelled bookings are interesting. However, in the absence of additional details such as the date or reason for cancellation, it is difficult to classify this information as actionable. Disallowing bookings more than 30 days in advance simply because the booking might be cancelled could be considered an overreach, and it may violate Viator’s rules for product listings.

Still, a followup investigation might be warranted. While Viator does not provide the date and reason for cancellation, that data may be available from other booking sources. BHC might also wish to collect it themselves by means of a survey.

Source: Viator booking data, 2021-2024

With regard to the age factor, my initial assumption was that groups with children would account for a greater proportion of cancelled/amended bookings. Clearly I was wrong. However, the finding about seniors led to an even more shocking discovery: in 2024, only four guests who booked their tours through Viator identified themselves as senior citizens.

Now, this chart must be taken with a grain of salt (and a five-pound bag of caveats). First, without incentive to identify themselves as seniors, it’s probable that many people simply check the box labeled “Adults.” Second, Viator/TripAdvisor bookings only account for 20-30% of all BHC guests. I did not have access to data from other booking sources such as Get Your Guide. Third, the Boston History Company itself keeps no record of the age of tour guests.

Still, the steep drop-off in self-identified seniors – combined with the higher rate of cancellations – constitutes a red flag. It is conceivable that something about BHC’s Freedom Trail tour is turning off seniors. Then again, it could just be the aforementioned lack of incentive. When informed of this statistic, the company CEO immediately suggested implementing a senior citizen discount.

As an aside, the number of guests identified as children seems low, as well. I did not investigate this, apart from double-checking the figures against the source data. For now, it’s just another data point that’s worth flagging for further study.

Word Cloud Details

Back

I scraped TripAdvisor review data for BHC and five of its closest competitors using an online tool provided by ExportComments.com. I also scraped reviews for the Freedom Trail itself. Word clouds were created using Word Cloud, a Power BI visual downloaded from AppSource.

I had originally intended to do a deeper dive into the review text using Microsoft Azure. Sentiment analysis and key phrase extraction were on the menu – but after wasting several days trying to get that functionality up and running, I determined that it would not work well (if at all) without a paid subscription. Every call to Azure counted against a daily limit. The limit could not easily be determined, and more often than not, one or both columns threw errors.

When the sentiment scores did work, I did not find the results particularly impressive. Too often, the scores did not seem to match the mood of the review. The AI behind the model probably needed more training. But for this project, I had neither the time nor the budget – so I focused on word clouds, which did not require an Azure connection.

In order to create individualized word clouds for each guide, I first had to create an alias table. Reviewers frequently get creative when spelling names – one guide’s name was invoked using no fewer than eight different spellings.

I used the alias table to create a calculated column in the model and populate it with the guide’s definitive, correctly spelled name. Originally, I attempted to do this in DAX. But DAX does not support regex, and it took a lot of iterations to catch all the edge cases. For example, one guide’s name is Sam. If Sam’s name were separated by spaces, that would have been fine – but “Sam” could be surrounded by quotes or followed by the whole gamut of punctuation marks, or at the beginning or end of the review with no space before or after. In fact, any review that used the word “same” could be assigned to Sam.

In the end, I solved the vast majority of edge cases. The DAX code was ugly, but it worked. Still, I thought I could do better, so I tried again later using Power Query’s M language. That solution is described below – but just for fun, you can see the original ugly DAX code at the end of this section.

finding the guide’s name in a tour review

using power query:

A guide’s name could be bounded by just about any punctuation mark. For example, an apostrophe could be made using straight quotes or smart quotes. In the original DAX solution, every possible Unicode character had to be handled separately.

But in Power Query, I could use the the Text.Select function to create a new “cleaned” column consisting of only review words bounded by spaces. All punctuation and numbers were removed. This made searching the text for the guide’s name much simpler. But names found at the beginning and end of reviews were still getting missed, so I added a leading and trailing space for each set of review words. I also appended an ” x” at the end because Power BI automatically removes trailing spaces.

    // Remove all punctuation & numbers 
    #"Add Column" = Table.TransformColumnTypes(
    Table.AddColumn(#"Inserted Literal", "Review words", 
        each " " &  // Add whitespace at start of review
        Text.Select([Review],
            // list of characters to keep
            List.Combine({{"A".."Z"},{"a".."z"},{" "}}) 
            ) 
        & " x"), // append an " x" after each review
        {{"Review words", type text}})

Then I created a calculated “Guide” column in Power BI desktop. There were still a couple of problems: first, if a name was followed by an ‘s and the apostrophe was removed, that still left a trailing s. The solution required an additional loop through the tour guide alias table, this time looking for names followed by an s. Second, a couple of tour guides share the same first name – which meant the name would be duplicated in the Guide column. To prevent this, I added a comma at the end of each concatenated name and used the comma position to return only the first name.

Here is the DAX code for the calculated column:

Guide =
var nameString = 
CONCATENATEX('BHC Tour Guides', 
if(
    // First we search for a name bounded by white space
    SEARCH(" " &
        FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & " ",
        'Reviews-FT-BHC'[Review words],,0)
    // We removed the apostrophes, but not the possessive "s"
    // So go back and look for names followed by an "s"
    ||
    SEARCH(" " &
        FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "s ",
        'Reviews-FT-BHC'[Review words],,0)

     <> 0,'BHC Tour Guides'[First name],""),
    // Append a comma at the end of the iteration. This comma will be used later to find just the first concatenated name (if any)
    ","
)

// Did we find any names?
var notBlank = nameString <> BLANK()
// If so, did we find more than one? 
// i.e., is there a comma separating two names?
var commaPosition = IF(notBlank, FIND(",", nameString, 1, -1), -1
)
// Return only the first name in the string
return
IF(notBlank && commaPosition > -1, LEFT(nameString, commaPosition - 1), BLANK()
)

This solution is much cleaner, and opens the door for additional text analysis – for example, I have used the “cleaned” column to create a list of the most frequently found words in BHC’s tour reviews. But as promised (and for comparison’s sake), I will paste the original solution below.

using dax:

As previously stated, DAX is not the best tool for regex-style text searching. After finishing my analysis for BHC, I was only too happy to go back and look for a better solution. Nevertheless, here it is, for your amusement: quite possibly the ugliest code I have ever written that actually worked:

Guide =
var nameString =
// iterate through the BHC Tour Guides table using a Text iterator
CONCATENATEX('BHC Tour Guides',
if(
// First, find Guide aliases mentioned at the start of a review ("aliases" account for variations in spelling)
// These names would not be preceded by a space, but will be followed by a space or a comma
find(
FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & " ",
'Reviews-FT-BHC'[Review],1,0)
||
find(
FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & ",",
'Reviews-FT-BHC'[Review],1,0)
// Find Guide names preceded & followed by a space. This prevents "remark" being read as "Mark," "Hanna" as "Anna," etc
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & " ",
'Reviews-FT-BHC'[Review],,0)
// This would be easier with Regex. But we need to account for names followed by a comma, period, etc instead of a space
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & ",",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & ".",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "!",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "~?",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "-",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "’",
'Reviews-FT-BHC'[Review],,0)
||
// Who woulda thunk it? DAX treats smart quotes (or apostrophes) & straight quotes differently ...
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "'",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
"(" & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & ")",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
"(" & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "~?",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
" " & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & "…",
'Reviews-FT-BHC'[Review],,0)
||
SEARCH(
"""" & FIRSTNONBLANK('BHC Tour Guides'[Alias],1) & """",
'Reviews-FT-BHC'[Review],,0)
// If anything is found, concatenate the Guide's standard first name
//But the iteration continues, so if another Guide alias is found, it will be added to
the cell value
<> 0,'BHC Tour Guides'[First name],""),
// Append a comma at the end of the iteration. This comma will be used later to find
just the first concatenated name (if any)
","
)
var notBlank = nameString <> BLANK()
var commaPosition = IF(notBlank, FIND(",", nameString, 1, -1), -1
)
// Return only the first Guide name in the string
return
IF(notBlank && commaPosition > -1, LEFT(nameString, commaPosition - 1), BLANK()
)