SCENARIO¶
a data scientist supporting a South East Queensland tourism business. The business provides bespoke tours to international tourists, particularly beach-goers. To ensure the best quality experience of clients, management wants to ensure they are always across information relevant to their clients. Management would like to see if its possible to have a weekly report on relevant information from diverse sources like beach conditions and relevant news. Management don't require a polished report at this point, but they do need a convincing story that shows the concept could be helpful for the business.
Scenario details¶
- Business name: QTraveler
- Major marketing channel: Facebook, Instagram
- QTraveler has been using Facebook and Instagram for marketing for many years. Their Facebook and Instagram channels are well developed and have a decent amount of followers.
- Major customer origins: USA and UK
- QTraveler has branches in USA and UK. Therefore, QTravel mainly focuses on the US and UK market.
- Major tour destination: Gold Coast
- QTraveler specializes in offering beach-focused tours. Along with the beaches of Gold Coast, they curate bespoke experiences that include attractions in Queensland.
A. Question¶
To boost the business of a tourism company, marketing is one of effective ways suggested by Queensland Government (2024). As QTravel is a small business, digital marketing is the most cost effective way for it to reach international customers. Also, as QTravel has already established Facebook and Instagram platform, this data analysis will aim at the digital marketing on these social media.
There are five steps in converting potential customers into actual customers, which are awareness, interest, conversion, action and retention (Queensland Government, 2022).
This data analysis provides a sample of weekly report. It focuses on generating insights for QTravel to improve the marketing strategy in the first two stages (awareness and interest):
- Awareness: Create marketing contents related to the company to draw customers' awareness on the company.
- Interest: Provide engaging bespoke traveling packages to spark custormers interest in the products.
There are several stakeholders involved in this analysis. They are:
- Management of QTravel: The management can use the insights to make strategic planning and business decisions.
- Marketing team of QTravel: The marketing team can use the insights to create attractive marketing contents for Facebook and Instagram.
- Tour guides of QTravel: Tour guides can use the insights to understand customer concerns and interests. By addressing relevant topics, they can build relationships with customers, enhance the tour experience, and contribute to a positive perception of the company.
- Customers of QTravel: The customers' overall experience will be affected by how well the tours are aligned with their expectations and interests.
In order to meet the above goals, this data analysis focuses on the following questions.
1. What are the recent discussions on the destination that might draw customers or potential customers attention?
When someone decide to go traveling, the first thing they may think of is the destination. It is because the destination will affect the duration, budget, travel type and many aspects. A key consideration is the customers' preception on the destination. Therefore, knowing the discussions and customers' preception about Australia, Queensland, Brisbane or Gold Coast is useful to adjust the marketing strategy for promoting the business of QTravel. In addition, this can provide ideas to tour guides about the interests of customers. Tour guides can research popular topics and provide extra information to future customers. It makes the customers feel that the experience is tailor-made. For example, Brisbane is currently recognized as the world most romantic city. Tour guides can highlight the romantic spots and bespoke experiences in the tour. The customers may then leave positive reviews or recommend QTravel to others. It can build good reputation of the company and boost the business in the long run.
2. What are the current travel preferences among beach-goers that can help design travel packages or adjust the tours?
Another consideration of traveling destinations is the things that they want to do. As QTravel specializes tours for beach-goers, understanding the discussions on beaches is also beneifical to create marketing contents and tour packages. Therefore, knowing the discussions about beaches is useful for promoting the business of QTravel.
B. Data¶
Using news articles to get the trends¶
The objective of this analysis is to understand current discussions and trends about different topics. News media play a key role in shaping and reflecting public discourse. They often highlight trends and report on popular topics. Therefore, analyzing news articles is a useful method for gaining insights into recent discussions and trends. By examining news articles, we can understand what is currently capturing public attention.
Ethical considerations in data source¶
Social media may also be a good source of data to understand the trends as the social media content are diverse and updated quickly. However, the new contents on social media are massive and widely spreaded. It takes a lot of work to clear up the data and extract relevant topics. To analyse social media data, the company either need to outsource the work to big data companies or employ an in-house data specialist. QTravel as a small business, it is impossible to allocate large amounts of resources to analyze data on social media weekly. Therefore, we chose a more cost effective way, which is analyzing news articles. Also, social media is an open platform where everyone can create content. The contents may be fake. Using those data may be risk of drawing unsuitable insights. In contrast, news articles are written by professionally trained journalists. They go through editorial processes to generate professional contents. As a tourism company serving diverse customers, it is better to adopt a conservative approach in creating marketing contents. This ensures that the contents cater to most of the potential customers. Therefore, we chose to use news sources instead of social media sources.
Use The Guardian as the source of data¶
The Guardian is a good data source for this analysis due to the following reasons:
International presence: The Guardian is an international news organization with production offices in the UK, US, and Australia. This aligns with the markets of QTravel, whose major customers are from the UK and US.
High coverage: According to data from SimilarWeb summarized by Press Gazette, The Guardian ranked as the 6th most visited English-language news website in March 2025, with 334.8 million monthly visits. This demonstrates a broad reach of their contents, making the analysis more relevant.
Free of charge data access: While many web scraping services are available online, most of them are costly. QTravel as a small business, an affordable solution is preferred. The Gurdian provides free API to retrieve news articles. This makes it a budget-friendly choice for QTravel.
Ethical considerations in news source¶
Although the above advantages of the Guardian, relying on a single source can introduce bias. News media are not always impartial because journalists may hold specific views or ideology. This means that the articles may potentially skew the outcomes of the analysis. Furthermore, the Guardian includes not only news reports but also opinion editorial, which are subjective. These editorial may compromise the objectivity of the analysis. Therefore, it is recommended to include multiple sources of news in the actual analysis.
How to use Gurdian API¶
The Gurdian API provides an interface to fetch news content with keywords and filters. There are three filters that is useful for this data analysis process:
- Keywords: Specify the keywords that we want to be searched through articles.
- Published date: Specify the date range of the contents that we want.
- Production offices: Specify the production offices that the contents are published by.
Consideration when applying "Production Office" filter¶
The Guardian has three production offices, which are US, UK and AUS. All content is published in English and accessible to all readers regardless of the location of the readers and production office. However, the production office filter is still an important tool for this analysis.
This is because each office publishs content according to the interests and persepctives of its regional audience. For example, in covering Australian Federal election, each office may present different angles. Contents from Australia office may focus on how the prime minister-to-be will address the cost of living domestic issues. While on the other hands, content from UK and US office may focus on how the prime minister-to-be will affect the international relations with other countries. Furthermore, the production office of the content can indicate whether a particular event is considered relevant to audiences in the regions.
As a results, when applying a production office filter during a search, we consider the goal of the analysis. If the aim is to understand general global trends or views of the topic, content from all offices should be included. On the other hand, if the goal is to explore how a topic is perceived or emphasized within a specific region, filtering by a particular production office will yield more focused results.
Searching data to answer the questions¶
To answer the questions, the Guardian API will be called three times with different filters applied to get the three datasets. Each dataset is used to answer different questions. Below if the filters that we are going to apply and the reasons. We also predicted the search results for each search. It is because we are planning to generate weekly reports in the future. The expected results help us to decide which data analysis technique is better for the dataset in the long run.
What are the recent discussions on the destination that might draw customers or potential customers attention?
Dataset A) Views about Australia.
- Explanation: This search is to find out the discussions on Australia from US and UK. Australia production office are excluded from this search because almost all the articles produced by Australia office are related to Australia. Including articles from Australia production office will result in a large number of articles. These articles are likely to covered trivia or region-specific events that do not capture the attention of audiences in the US and UK. Therefore, this search filter out the articles from Australia production office.
- Keywords: "Australia"
- Published date: last 7 days
- Production offices: "US" and "UK"
- Expected Results:
- Specificity: Results are expected to be broad, covering various topics in Australia.
- Volume: High. Australia is a popular topic in global discussions.
Dataset B) Views about Queensland and Gold Coast.
- Explanation: This search focused on finding the discussions on Queensland and Gold Coast from US and UK point of view. Australia production office are excluded for the same reason mentioned above. The reason of separating Queensland and Gold Coast from the Australia dataset is to highlight articles that mention these specific regions. Overseas journalists usually pay less attention to specific regions than Australia or tend to name Australia as a whole rather than specific states or cities. If the article specify a particular area in the country, it suggests the topic is unique to the region so the region is worth mentioning in the articles. By separating the articles from these specific regions, we can give more attention to these articles and gain valuable insights to help boost QTravel's business.
- Keywords: "Queensland", "Gold Coast", "Brisbane", "Surfers Paradise"
- Published date: last 7 days
- Production offices: "US" and "UK"
- Expected Results:
- Specificity: Results are expected to be broad, covering various topics.
- Volume: Low to medium.
What are the current travel preferences among beach-goers that can help design travel packages or adjust the tours?
Dataset C) Discussion about beach tour.
- Explanation: This search is to find out the discussions about beach tour. As the topic is not location-specific, the articles from the Australia office may also attract the US and UK audience's attentions. Therefore, the Australia office is included in this search.
- Keywords: ("beach" or "coast") AND ("travel" OR "tour" OR "tourist" OR "tourism" OR "trip" OR "holiday" OR "vacation")"
- Published date: last 7 days
- Production offices: "Aus", "US" and "UK"
- Expected Results:
- Specificity: Results are expected more focused and narrow due to the specificity of the keywords. Discussions are expected to focus on particular beaches or coasts.
- Volume: Medium.
Import all required libraries¶
#import required libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
import pandas as pd
import random
from datetime import datetime,timedelta
import requests
import json
import re
import time
from urllib.parse import quote
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from plotly.subplots import make_subplots
import plotly.graph_objects as go
Define constants¶
#load your personal API key
with open('../../API key/GuardianAPIkey.txt', 'r') as file:
key = file.read().strip()
print(f'Length of the API key: {len(key)} characters')
#build a search URL
base_url = 'https://content.guardianapis.com/'
file_path_store_data = "data/"
Length of the API key: 36 characters
Fetch articles and related information from The Guardian API¶
According to the Guardian API documentation, the API request return not only the title and body text of the articles. It also returns other information of the articles that may be useful for the analysis. In fetching from the Guardian API, the title, body text, type, section and keyward tags of each articles are stored.
As three different sets of data needed to be fetched, algorithm are bundled in functions so that we can repeat the fetching processes without replicating the codes. The functions fetch_data accepts four parameters, search string, office and start date and end date. It returns a list of result contents.
# Fetch search result pages from url
def fetch_data(search_string,office,from_date,to_date):
full_url = base_url+f"search?q={quote(search_string)}&production-office={production_office}&from-date={from_date}&show-fields=body&api-key={key}&show-tags=keyword"
print(f"Fetching contents related to '{search_string}' from production office {office} from {from_date} to {to_date}")
server_response = requests.get(full_url)
server_data = server_response.json()
resp_data = server_data.get('response','')
if resp_data == '':
print("ERROR obtaining results:",server_data)
return False
else:
print("SUCCESS!")
print(f"{resp_data['total']} results found available")
results = resp_data.get('results',[])
articles = get_all_articles_for_response(resp_data,full_url)
file_name = f"{search_string}_articles.json"
with open(f"{file_path_store_data}{file_name}",'w', encoding='utf-8') as fp:
fp.write(json.dumps(articles))
print(f"Data saved to {file_path_store_data}{file_name}")
return articles
def articles_from_page_results(page_results):
articles = {}
for result in page_results:
article_date = result['webPublicationDate']
article_title = result['webTitle']+f" [{article_date}]"
article_html = result['fields']['body']
article_type = result['type']
article_sectionName= result['sectionName']
article_keywords=set()
for tag in result['tags']:
article_keywords.update(tag["id"].split('/'))
article_keywords=list(article_keywords)
article_text = re.sub(r'<.*?>','',article_html)
article_text = re.sub(r'\n','',article_text)
articles[article_title] = {"type":article_type,"section":article_sectionName,"keywords":article_keywords,"text":article_text}
return articles
def get_all_articles_for_response(response_json,full_url):
total_pages = response_json['pages']
total_articles = response_json['total']
print(f"Fetching {total_articles} articles from {total_pages} pages...")
all_articles = {}
page1_articles = articles_from_page_results(response_json['results'])
all_articles.update(page1_articles)
for page in range(2,total_pages+1):
page_response = requests.get(full_url+f"&page={page}")
page_data = page_response.json()['response']
page_articles = articles_from_page_results(page_data['results'])
all_articles.update(page_articles)
print(f"Status: {len(all_articles)} articles fetched.")
time.sleep(1) # make sure we're not hitting the API to hard
print(f"FINISHED: Fetched {len(all_articles)} articles.")
return all_articles
def print_random_number_titles(articles, number):
indices = random.sample(range(len(articles)), number)
keys_list=list(articles.keys())
for idx in indices:
print(keys_list[idx])
Fetch Dataset A Views about Australia¶
today = datetime.today().date()
seven_days_ago = today - timedelta(days=7)
search_string = "Australia"
production_office = "uk & us"
from_date = str(seven_days_ago)
to_date=str(today)
############## Uncomment on weekly report ##########################
# fetch_data(search_string,production_office,from_date,to_date)
############## Uncomment on weekly report ##########################
file_name = f"{search_string}_articles.json"
with open(f"{file_path_store_data}{file_name}",'r', encoding='utf-8') as fp:
australia_articles = json.load(fp)
print(f"=====Random article titles from {len(australia_articles)} articles=====")
print_random_number_titles(australia_articles,min(5, len(australia_articles)))
=====Random article titles from 57 articles===== MPs are voting on the next stage of the assisted dying bill. This is their chance to create a legacy | Polly Toynbee [2025-05-15T17:00:31Z] ‘Cinema doesn’t ship that way’: Wes Anderson mocks Donald Trump’s film tariff plans in Cannes [2025-05-19T14:48:58Z] ‘My sadness is not a burden’: author Yiyun Li on the suicide of both her sons [2025-05-17T08:01:00Z] Twenty years later: how 2005 Ashes marked end of cricket as we knew it [2025-05-17T07:00:59Z] ‘Proving people wrong’: how Central Coast Mariners reached A-League Women grand final [2025-05-15T10:29:32Z]
Fetch Dataset B Views about Queensland and Gold Coast¶
search_string = "Queensland OR Brisbane OR \"Gold Coast\" OR \"Surfers Paradise\""
production_office = "uk & us"
from_date = str(seven_days_ago)
to_date=str(today)
############## Uncomment on weekly report ##########################
# QorGC_articles=fetch_data(search_string,production_office,from_date,to_date)
############## Uncomment on weekly report ##########################
file_name = f"{search_string}_articles.json"
with open(f"{file_path_store_data}{file_name}",'r', encoding='utf-8') as fp:
QorGC_articles = json.load(fp)
print(f"=====Random article titles from {len(QorGC_articles)} articles=====")
print_random_number_titles(QorGC_articles,min(5, len(QorGC_articles)))
=====Random article titles from 4 articles===== Online dating advice: five ways to stay safe, according to the experts [2025-05-14T14:04:00Z] St Helens find hope and a new hero in seven-try rout of Catalans Dragons [2025-05-15T21:02:03Z] Postecoglou adamant work at Spurs is not done but sounds resigned to his fate [2025-05-20T18:52:53Z] Richard Goodman obituary [2025-05-18T15:32:27Z]
Fetch Dataset C Discussion about beaches¶
search_string = "(beach OR beaches OR coast OR coasts) AND (travels OR travel OR tour OR tourist OR tourism OR trip OR holiday OR vacation)"
production_office = "uk & us & aus"
from_date = str(seven_days_ago)
to_date=str(today)
############## Uncomment on weekly report ##########################
# fetch_data(search_string,production_office,from_date,to_date)
############## Uncomment on weekly report ##########################
file_name = f"{search_string}_articles.json"
with open(f"{file_path_store_data}{file_name}",'r', encoding='utf-8') as fp:
beach_articles = json.load(fp)
print(f"=====Random article titles from {len(beach_articles)} articles=====")
print_random_number_titles(beach_articles,min(5, len(beach_articles)))
=====Random article titles from 8 articles===== What Donald Trump did this week should terrify Benjamin Netanyahu. This is why | Jonathan Freedland [2025-05-16T15:44:57Z] The Guardian’s happiest places to live in Britain revealed [2025-05-17T11:00:05Z] Share a tip on a great dog-friendly holiday [2025-05-19T14:51:03Z] ‘Time slows down in Lastovo’: I may just have found Croatia’s most unspoilt archipelago [2025-05-14T06:00:01Z] Thinking of a trip to Barcelona this summer? Beware – here’s what you'll find | Stephen Burgen [2025-05-20T04:00:47Z]
C. Analysis and Visualization¶
Data cleaning¶
Three datasets are fetched from the Guardian API. Before performing topic modelling to look for the topics in each articles, some contents will be cleaned out.
- Cleaning out the liveblogs: The liveblog only contains a title. The content is a video player without any text description. As we are not watching the video one by one and we cannot know the exact topic covered with only the title, we exclude them to avoid affecting the whole results.
- Cleaning out aggregators: Some contents are news update which contain several topics in one article. Several topics are included in one article. This may distort the results of topic modelling.
- Cleaning out articles from "media" section: "Media" section contains the news about the update of the Guardian news, which is not relevant to this study. (e.g.The Guardian relaunches app and updates homepage design)
Ethical considerations¶
The data cleaning mechanism need to be reviewed frequently to prevent irrelevant data from affecting the results of the topic modelling. It is beneficial to sample some articles after fetching from the API to identify and remove articles that are irrelevant to the study.
def remove_liveblog(dataset):
remove_count=0
for article in list(dataset.keys()):
if dataset[article]["type"]=="liveblog":
dataset.pop(article)
remove_count+=1
print(article)
print(f"{remove_count} liveblog(s) removed")
def remove_aggregator(dataset):
remove_count=0
for article in list(dataset.keys()):
if "Morning Mail:" in article or "Afternoon Update:" in article or "briefing:" in article:
dataset.pop(article)
remove_count+=1
print(article)
print(f"{remove_count} aggregator(s) removed")
def remove_media(dataset):
remove_count=0
key_to_remove=[]
for key,article in dataset.items():
if article["section"]=="Media" or article["section"]=="GNM press office":
key_to_remove.append(key)
for key in key_to_remove:
dataset.pop(key)
remove_count+=1
print(key)
print(f"{remove_count} media article(s) removed")
list_of_datasets=[australia_articles,QorGC_articles,beach_articles]
for dataset in list_of_datasets:
remove_liveblog(dataset)
remove_aggregator(dataset)
remove_media(dataset)
print()
Eurovision song contest 2025 – as it happened [2025-05-17T23:38:40Z] Poland’s presidential candidates seek to broaden appeal on campaign trail after nail-biting first round vote – as it happened [2025-05-19T12:22:10Z] UK, France and Canada threaten action if Israel’s offensive continues as first aid crosses into Gaza in weeks – as it happened [2025-05-19T19:08:31Z] 3 liveblog(s) removed Friday briefing: The deepening turmoil over the assisted dying bill [2025-05-16T05:38:13Z] 1 aggregator(s) removed 0 media article(s) removed 0 liveblog(s) removed 0 aggregator(s) removed 0 media article(s) removed 0 liveblog(s) removed 0 aggregator(s) removed 0 media article(s) removed
Create terms dataframe for each dataset¶
def create_dataframe(dataset):
keywords_list=[]
section=[]
for article in list(dataset.values()):
keywords_list.append(article["keywords"])
section.append(article["section"])
terms_df = pd.DataFrame({
"keywords": keywords_list,
"section": section
},index=dataset.keys())
return terms_df
australia_terms_df=create_dataframe(australia_articles)
QorGC_terms_df=create_dataframe(QorGC_articles)
beach_terms_df=create_dataframe(beach_articles)
Perform topic modelling¶
There are many articles for each search. It is time consuming to read all the articles to get the trends or discusssion of the week. Therefore, topic modelling should be performed to produce a summary of terms for each articles.
Apart from the title and body text of articles, the Guardian also assign section and keyword tags to each article. The section and keyword tags provide a brief idea on what the news articles are related to. They are useful information for identifying the topic that the articles are discussing. However, we do not rely on them too much to model the topic for several reasons:
- Limited insight from tags and sections: The tags and sections only show a general theme without providing the actual discussion of the articles. Also, some articles are not assigned with tags. Therefore, tags and sections can only be a reference.
- Potential bias: The process of assigning tags and sections may not always be objective. Editorial or business considerations, such as enhancing search visibility or boosting readership, can influence tag selection. This could lead to misrepresentation of the true content of some articles and introduce potential bias. Therefore, while tags are valuable metadata, they should be used with caution and in combination with other analytical methods to ensure more accurate results.
To overcome the shortcomings of tags and sections, we analyzes the body text of the articles. This allows for a more accurate understanding on the discussions of each article.
As we have retrieved three datasets of different features and for various goals, they will be analyzed independently with suitable topic modelling techniques.
Dataset A: Views About Australia¶
The main goal of dataset A is to identify the discussions related to Australia to help with the marketing of QTravel. As expected, the number of articles for this search relatively high. A total of 53 articles remain after data cleaning. To find out the interesting topics for marketing, we will use the topic modelling technique to group and summarise these articles. Before performing the topic modelling, we draw a bar chart which shows the distribution of the articles in section to get a general topic of the articles. The section assigned by the editor is used as it is the quickest and most direct way to start with.
fig = px.bar(
x=australia_terms_df["section"].value_counts().index,
y=australia_terms_df["section"].value_counts().values,
labels={"x": "Editor Section", "y": "Number of articles"},
title="Number of Articles by Editor Section related to Australia"
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()
The bar chart shows that nearly one-third of the articles are concentrated in the "sport" section, while the rest are distributed across other topics. It shows that sport section could be something big happening in Australia that caught international attentions in the past week. We should dig into the sport section and find out anything that can be used for markerting. Therefore, we tried to extract articles from the "sport" section to perform topic modelling analysis separately. However, after conducting the analysis only on sport section, we found that the results doesn't give more insights compared to the one done with all australia articles together. As separating the "sport" articles for topic modelling did not provide added value to answer the question, we decided to adopt a simpler approach, performing topic modelling on all the articles together, rather than separately on the "sport" section. But we keep in mind that sport topic may worth more attentions in later study.
Non-negative matrix factorization topic modelling¶
We have experimented different topic modelling technique with different parameters. We found that non-negative matrix factorization (NMF) approach can summarize more useful topics. Therefore, we did a NMF on the articles.
Customize stop words¶
When we were experimenting with different topic modelling approach, we noticed that some meaningless words are identified as strong topics. It is understandable as these words are commonly used only in the news industry but not other. As the tools we used for topic modelling are not designed specifically for the news industry, they do not eliminate these words as stop word. To address this, we add these words to the stop words dictionary to exclude these words from analysis.
# Define own stop words
custom_stop_words = {'guardian', 'amp', 'nbsp','said','says','say','news','ve','think','like','did','didn','don','does','doesn','do'}
extended_stop_words = list(ENGLISH_STOP_WORDS.union(custom_stop_words))
tfidf_vectorizer = TfidfVectorizer(max_df=0.7, min_df=3, max_features=100000, stop_words=extended_stop_words)
num_topics = 12
australia_articles_text=[article["text"] for article in australia_articles.values() ]
tfidf_dt_matrix=tfidf_vectorizer.fit_transform(australia_articles_text)
feature_names = tfidf_vectorizer.get_feature_names_out()
doc001_term_counts = list(zip(feature_names,tfidf_dt_matrix))
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius')
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)
topic_term_nmf = nmf_model.components_
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
zipped = zip(feature_names, topic)
top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
#print(top_terms)
top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
nmf_topic_dict[f"topic_{index}"] = top_terms_list
topic_num_list = []
topic_term = []
for idx, topic in enumerate(doc_topic_nmf):
topic_num = topic.argmax()
topic_num_list.append(topic_num)
top_topic = nmf_topic_dict[f"topic_{topic_num}"]
topic_term.append(top_topic)
australia_terms_df['nmfTopic'] = topic_num_list
australia_terms_df['nmf'] = topic_term
By using the NMF, we have generated 12 topics. The number of topic was assigned manually based on the composition of the articles,. The number should be varied depending on datasets. As we will be repeating this process to generate weekly report in the future, the number of topic should be adjusted weekly to obtain valuable insights for marketing. Ths optimal number of topics is decided by reviewing the results. If the NMF results cannot distinguish topic and generate topic containing a wide range of unrelated terms, it indicates that the number of topic is too low so that multiple topics are merged into one topic. We may need to increase the number of topics. At the same time, too many topics do not necessarily lead to better analysis. The aim of topic modelling is to help summarize the data efficiently. For example, if we have 50 articles and generate 30 topics, it may not reduce the effort needed to understand the content compared to reading the 50 titles directly. Hence, it is better to keep the number of topics as low as possible while still maintaining distinction among them.
Visualizing the NMF Topic modelling Results¶
To present the results, we created two visualizations:
Number of articles in each topic This graph shows the distribution of articles across different topics. It helps to visualize the number of articles in each topic. If the topic contains more articles, it may show that there are more discussions about that topic last week. We show pay more attention to those topic and see how it can help the business.
Top 10 terms in each topic This graph provides insights into the most significant terms for each topic. It helps us understand what each topic is primarily related to. Also, the scores of each term
terms_group = australia_terms_df.groupby('nmfTopic').size().reset_index(name='Article Count')
terms_group.rename(columns={'nmfTopic': 'Topic Number'}, inplace=True)
fig1 = px.bar(terms_group,
x='Topic Number',
y='Article Count',
title='Number of Articles per NMF Topic related to Australia',
labels={'Topic Number': 'Topic Number', 'Article Count': 'Article Count'},
text='Article Count')
fig1.update_layout(xaxis=dict(type='category')) # ensures discrete topic numbers
fig1.show()
From the bar chart above, we can draw the following insights:
Topic 9 has the highest number of articles: This suggests that Topic 9 is the most frequently occurring theme in the dataset related to Australia.
Other topics have even coverage: The numbers of articles in topic other than topic 9 are even. This indicates a moderate level of coverage.
However, the number of articles alone does not indicate the usefulness of a topic for QTravel. To better assess the value of each topic, we examined the top terms in each topic. These terms will help interpret the meaning of each topic and determine whether it can help answering the questions.
topics = list(nmf_topic_dict.items())
n_topics = len(topics)
n_cols = 3
n_rows = (n_topics + n_cols - 1) // n_cols # Ceiling division
# Find the global max value for consistent x-axis range
max_val = max(max(terms.values()) for _, terms in topics)
fig2 = make_subplots(
rows=n_rows,
cols=n_cols,
subplot_titles=[f"{topic.replace('_', ' ').title()}" for topic, _ in topics]
)
for idx, (topic, terms) in enumerate(topics):
row = idx // n_cols + 1
col = idx % n_cols + 1
fig2.add_trace(
go.Bar(
x=list(terms.values()),
y=list(terms.keys()),
orientation='h',
text=list(terms.values()),
name=topic,
showlegend=False
),
row=row, col=col
)
fig2.update_yaxes(autorange='reversed', row=row, col=col)
fig2.update_xaxes(range=[0, max_val], row=row, col=col) # Fix x-axis range
fig2.update_layout(
height=350 * n_rows,
width=1200,
title_text="Top 10 Terms per Topic Related to Australia",
showlegend=False
)
fig2.show()
Cross reference the NMF results with section assigned by editor¶
The above visualizations summarize the results of NMF topic modelling. From the graphs, we know what is discussed in each topic. To validate the results, we utilize the section labels assigned by the Guardian. The section can be a tool to cross check the result of our topic modelling results and interpret the generated topics. The graph below illustrates the section distribution of the Guardian articles within each topic.
fig3 = px.histogram(australia_terms_df,
x="nmfTopic",
y="section",
facet_col="nmfTopic",
histfunc="count",
labels={
"nmfTopic": "Topic",
"section": "Section",
"count": "Topic Count"
},
category_orders={"nmfTopic": range(num_topics)})
fig3.update_layout(
title="Sections Distribution Across Topics related to Australia",
xaxis_title="count",
yaxis_title="Section",
showlegend=False
)
fig3.show()
Interpretation of Topic Modeling Results Related to Australia¶
The "Top 10 Terms per Topic Related to Australia" visualization displays the top 10 representative terms for each topic with the NMF scores. The score indicates how strong the term represents the topic. There are two ways to interpret these scores:
Within-topic comparison: Comparing scores of terms within a single topic helps us know which terms strongly define the topic. If a few terms have significantly higher scores than the rest, the term is likely to well-defined the topic. In contrast, if all term scores are similar, the topic may be broad.
Across-topic comparison: Comparing the highest-scoring term across topics helps assess which topics are more focused. A higher maximum score shows a more distinct topic.
Together with the "Section Distribution Across Topics Related to Australia" visualization, we have the following insights for QTravel:
- Sports as a strong theme for marketing (Topic 0,5 and 8)
Topics 0, 5, and 8 are all strongly associated with the sports section, specifically focusing on cricket, women’s rugby league, and the British Lions rugby team. These topics contain high number of articles and show high maximum term scores. It suggests that the topic is well-defined by the terms and this may be one of the focus of the UK and US on Australia. QTravel can consider incorporating sports-related events or destinations into their offerings or make Queensland related sports marketing contents.
- Eurovision and cultural events (Topic 11)
Topic 11 includes terms such as eurovision, song and contest, indicating a strong focus on the Eurovision Song Contest. However, we cannot tell how is it related to Australia. Therefore, we need to look further to the titles or body text of the articles in this topic.
- Health and wellness (Topic 9 and 10)
Topic 9 and 10 shows a discussion on mental health and physical health. We need to read the titles or articles to get more idea about the discussion. QTravel may need to adjust the tour according to the discussion.
- Politics and subjectivity (Topic 4)
Topic 4 is a strong topic as the NMF score is the highest among all topics. Together witg NMF results the section distribution, it suggests a focus on EU and UK politics. As content is mostly from the politics section, it is related to political discussion. Due tho the subjective nature of politic, QTravel should avoid or ignore this topic to maintain a broad customer appeal. The tour guides of QTravel can be informed about these topics so that they can avoid sensitive discussions or arguements with customers.
- Vague or mixed themes (Topics 6)
Topic 6 includes a high volume of articles, but the terms are too general to define a clear theme. The terms have low term scores. This suggests a blended or incoherent topic. Despite their frequency, these topics do not provide actionable marketing insights and are best excluded from campaign planning.
Limitations¶
Although topic modeling helps group articles and highlight contens, it has several limitations:
- Sentiment is not captured: While we can identify a topic, we cannot determine whether the discussion is positive or negative. This information is crucial for marketing as QTravel should only use postiive content for marketing and avoid negative content.
- Relevance to Australia is not clear: All the articles must contain the word "Australia" as it is part of our initial search but we cannot tell how the topic is related to Australia from the topic. For example, is it
To address these limitations, it’s essential to review titles or article texts. Topic modeling is the first step in identifying themes. QTravel can use the well-defined topics such as sports, Eurovision, and wellness to create marketing contents. However, further qualitative analysis of the articles is necessary before finalizing marketing strategies.
Qualitative analysis to the topic¶
We sampele one article from each topic and compare the terms below.
grouped_by_topic = australia_terms_df.groupby('nmfTopic')
samples1 = [group.sample(n=1) for _, group in grouped_by_topic]
samples1 = [sample.iloc[0] for sample in samples1]
for doc in samples1:
print(f"NMF Topic {doc['nmfTopic']}: ")
print(f"\t>> Title:\t {doc.name}")
print("\t>> Section:\t", doc['section'])
print("\t>> Keywords:\t", doc['keywords'])
print("\t>> NMF terms:\t", f"Topic {doc['nmfTopic']}: ", list(doc['nmf'].keys())[:5])
print()
NMF Topic 0: >> Title: The Spin | Gunnersbury women’s cricket club celebrate hitting historic century [2025-05-14T09:20:59Z] >> Section: Sport >> Keywords: ['sport', 'cricket'] >> NMF terms: Topic 0: ['women', 'league', 'victory', 'rugby', 'season'] NMF Topic 1: >> Title: UK urged not to exploit poor countries in rush for critical minerals [2025-05-14T23:01:09Z] >> Section: Business >> Keywords: ['green-politics', 'mining', 'environment', 'business', 'uk', 'commodities'] >> NMF terms: Topic 1: ['russia', 'ukraine', 'australian', 'rights', 'justice'] NMF Topic 2: >> Title: ‘Extreme anxiety and extreme depression’: Jennifer Lawrence says she felt ‘like an alien’ as a new mother [2025-05-18T11:39:54Z] >> Section: Film >> Keywords: ['lifeandstyle', 'postnatal-depression', 'film', 'mental-health', 'lynne-ramsay', 'festivals', 'parents-and-parenting', 'culture', 'robert-pattinson', 'society', 'cannesfilmfestival'] >> NMF terms: Topic 2: ['film', 'trump', 'cannes', 'movie', 'anderson'] NMF Topic 3: >> Title: Here comes summer: reasons to love riesling [2025-05-15T12:00:11Z] >> Section: Food >> Keywords: ['food', 'australian-food-and-drink', 'wine', 'german-food-and-drink'] >> NMF terms: Topic 3: ['wine', 'richard', 'zealand', 'dry', 'sure'] NMF Topic 4: >> Title: EU reset deal puts Britain back on the world stage, says Keir Starmer [2025-05-19T18:31:04Z] >> Section: Politics >> Keywords: ['keir-starmer', 'europe-news', 'ursula-von-der-leyen', 'foreignpolicy', 'politics', 'uk', 'eu', 'eu-referendum', 'world'] >> NMF terms: Topic 4: ['eu', 'deal', 'uk', 'starmer', 'british'] NMF Topic 5: >> Title: Needless controversy over foreign-born Lions players ramps up pressure [2025-05-19T17:00:33Z] >> Section: Sport >> Keywords: ['british-irish-lions', 'sport', 'rugby-union'] >> NMF terms: Topic 5: ['lions', 'players', 'jones', 'ireland', 'farrell'] NMF Topic 6: >> Title: Young British woman held on drug charges in Sri Lanka could be linked to Culley case [2025-05-19T17:46:09Z] >> Section: World news >> Keywords: ['srilanka', 'thailand', 'drugs', 'south-and-central-asia', 'uk', 'society', 'cannabis', 'asia-pacific', 'world'] >> NMF terms: Topic 6: ['phone', 'children', 'writing', 'life', 'people'] NMF Topic 7: >> Title: Gods arrive from India, myths grow Tinguely and meat gets sensual – the week in art [2025-05-16T11:00:53Z] >> Section: Art and design >> Keywords: ['artanddesign', 'painting', 'exhibition', 'photography', 'culture', 'art'] >> NMF terms: Topic 7: ['art', 'single', 'stakes', 'era', 'pop'] NMF Topic 8: >> Title: England expect most players will choose country over IPL for West Indies ODIs [2025-05-13T17:46:13Z] >> Section: Sport >> Keywords: ['england-cricket-team', 'sport', 'cricket'] >> NMF terms: Topic 8: ['cricket', 'kohli', 'test', 'england', 'zimbabwe'] NMF Topic 9: >> Title: UK ‘the sick person of the wealthy world’ amid increase in deaths from drugs and violence [2025-05-19T23:01:41Z] >> Section: Society >> Keywords: ['heart-disease', 'drugs', 'uk', 'society', 'cancer', 'health'] >> NMF terms: Topic 9: ['health', 'mental', 'uk', 'people', 'deaths'] NMF Topic 10: >> Title: Royal College of Psychiatrists says it cannot yet support assisted dying bill [2025-05-14T13:08:45Z] >> Section: Society >> Keywords: ['psychiatry', 'assisted-suicide', 'wales', 'politics', 'uk', 'law', 'uk-news', 'society', 'england', 'houseofcommons', 'health'] >> NMF terms: Topic 10: ['assisted', 'dying', 'mps', 'psychiatrists', 'college'] NMF Topic 11: >> Title: Austrians celebrate JJ bringing home first Eurovision win in 11 years [2025-05-18T16:35:46Z] >> Section: Television & radio >> Keywords: ['austria', 'europe-news', 'eurovision', 'tv-and-radio', 'culture', 'eurovision-2025', 'music', 'world'] >> NMF terms: Topic 11: ['eurovision', 'song', 'contest', 'jj', 'entry']
By the sample titles of each topic, we can have deeper understanding on the topic. For instance:
Sports as a strong theme for marketing (Topic 0,5 and 8) From the titles, we have more ideas of each topic. Topic 0 highlights women’s cricket and rugby. Topic 5 discusses controversy over the foreign-born players for the British and Irish Lions. Topic 8 addresses international cricket scheduling and player availability in relation to the Indian Premier League.
Eurovision and cultural events (Topic 11) From the sample title, we know that the articles about Eurovision is about the win of an "Austrian" contestant instead of a "Australia" contestant. This topic about Eurovision seems not to related to Australia. Therefore, we can ignore this topic for this analysis.
Ethical considerations¶
It is important to note that sometimes the title is not telling what is exactly discussion in the body text. It is common for the news to deliberately making an attractive but unrelevant topic to attract readers. Therefore, if we decide to use that topic as a source of marketing topic. It is better to read the body text of several articles from that topic first. The following code help us to sample the body text from one of the topic and highlight the keywords we are looking for.
text = australia_articles[samples1[9].name]["text"]
words_to_highlight = ['Australia']
def highlight_words_in_html(text, words):
for word in words:
highlighted_word = rf'<span style="background-color: yellow; font-weight: bold;">{word}</span>'
text = re.sub(rf'({word})', highlighted_word, text, flags=re.IGNORECASE)
html_content = f"""
<p>{text}</p>
"""
return html_content
html_content = highlight_words_in_html(text, words_to_highlight)
HTML(html_content)
The UK is becoming “the sick person of the wealthy world” because of the growing number of people dying from drugs, suicide and violence, research has found. Death rates among under-50s in the UK have got worse in recent years compared with many other rich countries, an international study shows. While mortality from cancer and heart disease has decreased, the number of deaths from injuries, accidents and poisonings has gone up, and got much worse for use of illicit drugs. Related: ‘Brain pacemakers’: implants to be tested to help alcohol and opioid addicts The trends mean Britain is increasingly out of step with other well-off nations, most of which have had improvements in the numbers of people dying from such causes. The increase in drug-related deaths has been so dramatic that the rate of them occuring in the UK was three times higher in 2019 – among both sexes – than the median of 21 other countries studied. The findings are contained in a report by the Health Foundation thinktank, based on an in-depth study of health and death patterns in the 22 nations by academics at the London School of Hygiene and Tropical Medicine (LSHTM). “The UK’s health is fraying,” they concluded. The UK’s rising mortality is especially evident among people of working age, aged 25 to 49. Deaths among women that age rose by 46% and among men by 31%, between 1990 and 2023. In contrast, mortality has fallen in 19 of the 21 other countries studied, with only the US and Canada showing the same rise as the UK. Britain now has the fourth highest overall female mortality and sixth highest overall male mortality rate among the 22 nations. The US topped both league tables. Jennifer Dixon, the Health Foundation’s chief executive, said: “This report is a health check we can’t afford to ignore – and the diagnosis is grim. “The UK is becoming the sick person of the wealthy world, especially for people of working age. While other nations moved forward, we stalled – and in some areas slipped badly behind.” Dixon pointed out the improvement in UK death rates since 1990 slowed significantly during the 2010s, with the austerity policies pursued by the coalition government after 2010 a significant factor. Smoking, alcohol misuse and bad diet also help explain Britain’s increasingly sick population. By 2023, women in the UK had a 14% higher death rate than the median in the other countries, while among men of all ages it was 9%. Prof David Leon, who led the research at LSHTM, said: “What is particularly disturbing about our findings is that the risk of dying among adults in the prime of life – those who have not yet got to the age of 50 – has been increasing in the UK for over a decade, while in most other countries it has declined. “This is shocking as most mortality between the ages of 25 and 49 years is in principle avoidable.” Office for National Statistics figures show that 5,448 people died as a result of drug poisoning in England and Wales in 2023 – 11% up on the year before and the highest figure since records began in 1993. The rate of such deaths in 2023 – 93 per million population – was double the 43.5 per 100,000 that occurred as recently as 2012, which underlines the sharp increase in drug mortality. Mortality due to suicide has also risen but alcohol-related deaths plateaued for women and fell for men between 2009 and 2019, the thinktank found. The Local Government Association and WithYou, a drugs charity, called for the government to make it easier for drug users, people close to them and health professionals to access and use naloxone, an emergency antidote to overdoses involving heroin, methadone and other drugs. Robin Pollard, WithYou’s head of policy and influencing, said: “We also know getting people into structured treatment is critical to reduce the numbers of drug deaths, and so we continue to call for easier access to higher-quality opiate substitution treatment.” A Department of Health and Social Care spokesperson said: “Every death from the misuse of drugs is a tragedy. This government is committed to reducing drug-related deaths and supporting more people into recovery to live healthier, longer lives. We remain on high alert to emerging drug threats, including from synthetic opioids.” • In the UK and Ireland, Samaritans can be contacted on freephone 116 123, or email jo@samaritans.org or jo@samaritans.ie. In the US, you can call or text the National Suicide Prevention Lifeline on 988, chat on 988lifeline.org, or text HOME to 741741 to connect with a crisis counselor. In Australia, the crisis support service Lifeline is 13 11 14. Other international helplines can be found at befrienders.org
This article is sampled from Topic 9, which is related to mental health. We observed that Australia is only mentioned in the crisis support hotline at the end of the article. The main content discusses mental health issues in the UK. This indicates that the article is not relevant to the Australia and should be excluded from our analysis. To confirm this, we recommend sampling more articles from Topic 9. If the majority of the samples are not related to Australia, we can reasonably conclude that Topic 9 is not the discussion of the UK and US on Australia.
Summary: Discussions on Australia in International Context¶
To sum up, the goal of dataset A is to explore how Australia is discussed in the UK and US media. Referring to NMF topic modelling and its corresponding visualization, the result suggest that the discussion are mostly related to sports events. These topics could be useful themes for marketing strategies. The marketing team should further investigate these topics to identify opportunities for marketing.
While NMF topic modelling provides valuable initial insights, it’s important to emphasize that these results are not definitive. Manual review remains crucial to ensure that selected topics align with QTravel's policy and to avoid potential public relations issues.
Dataset B Views about Queensland and Gold Coast¶
In addition to the dataset focused on Australia that we analyzed earlier, we retrieved another dataset targeting articles related to Queensland and the Gold Coast. As expected, this search returned only four articles.
Given the small dataset size, we did not apply more complex models such as NMF or LDA. We chose to apply Term Frequency–Inverse Document Frequency(TF-IDF) to keep the analysis simple with interpretable result.
def topics_by_tfidf(dataset,term_dataframe,max_df,min_df,max_features):
tfidf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words=extended_stop_words)
text=[]
for article in dataset.values():
text.append(article["text"])
tfidf_dt_matrix=tfidf_vectorizer.fit_transform(text)
tfidf_dt_matrix.toarray()
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_dt_matrix.toarray(), index=dataset.keys(), columns=feature_names)
tfidf_df
term_dataframe['tfidf'] = None
for idx in term_dataframe.index:
tfidf = dict(tfidf_df.loc[idx].sort_values(ascending=False).head(5))
term_dataframe.at[idx,'tfidf'] = list(tfidf.keys())
topics_by_tfidf(QorGC_articles,QorGC_terms_df,0.5,1,200)
QorGC_terms_df
| keywords | section | tfidf | |
|---|---|---|---|
| Richard Goodman obituary [2025-05-18T15:32:27Z] | [london, uk, newzealand, wine, world, food, au... | Food | [wine, richard, new, zealand, joan] |
| St Helens find hope and a new hero in seven-try rout of Catalans Dragons [2025-05-15T21:02:03Z] | [rugbyleague, catalans, superleague, sport, st... | Sport | [whitby, wellens, saints, year, pressure] |
| Postecoglou adamant work at Spurs is not done but sounds resigned to his fate [2025-05-20T18:52:53Z] | [uefa-europa-league, ange-postecoglou, europea... | Football | [postecoglou, future, spurs, final, ll] |
| Online dating advice: five ways to stay safe, according to the experts [2025-05-14T14:04:00Z] | [dating, technology, internet-safety, lifeands... | The Filter | [dating, app, apps, online, red] |
samples = random.sample(range(0, len(QorGC_terms_df)), min(5, len(QorGC_terms_df)))
for sample in samples:
doc = QorGC_terms_df.iloc[sample]
print(f"[{sample}] {doc.name}")
print("\t>> Section:\t",doc['section'])
print("\t>> Keywords:\t",doc['keywords'])
# print("\t>> NMF terms:\t",f"Topic {doc['nmfTopic']}: ", list(doc['nmf'].keys())[:5])
print("\t>> TFIDF:\t",doc['tfidf'])
print()
[2] Postecoglou adamant work at Spurs is not done but sounds resigned to his fate [2025-05-20T18:52:53Z] >> Section: Football >> Keywords: ['uefa-europa-league', 'ange-postecoglou', 'europeanfootball', 'australia-sport', 'football', 'sport', 'tottenham-hotspur'] >> TFIDF: ['postecoglou', 'future', 'spurs', 'final', 'll'] [0] Richard Goodman obituary [2025-05-18T15:32:27Z] >> Section: Food >> Keywords: ['london', 'uk', 'newzealand', 'wine', 'world', 'food', 'australia-news'] >> TFIDF: ['wine', 'richard', 'new', 'zealand', 'joan'] [3] Online dating advice: five ways to stay safe, according to the experts [2025-05-14T14:04:00Z] >> Section: The Filter >> Keywords: ['dating', 'technology', 'internet-safety', 'lifeandstyle', 'apps', 'tinder', 'relationships'] >> TFIDF: ['dating', 'app', 'apps', 'online', 'red'] [1] St Helens find hope and a new hero in seven-try rout of Catalans Dragons [2025-05-15T21:02:03Z] >> Section: Sport >> Keywords: ['rugbyleague', 'catalans', 'superleague', 'sport', 'sthelens'] >> TFIDF: ['whitby', 'wellens', 'saints', 'year', 'pressure']
Based on the TF-IDF results and the sections and tags assigned by editors, we knew that rugby and sport are the major discussion, same as the discussion of Australia, as 2 out of the 4 articles are related to sports. The remaining two articles are less clear in terms of the relevance to the specific region. To better understand their connections to Queensland, it is advised to read the articles manually.
While the number of articles returned for Gold Coast and Queensland is limited, this search is still valuable. The scarcity of articles is itself an insight. It suggests the low media focus on the region recently.
Moreover, this analysis is not a one-time task. It is going to be a long-term weekly report. If a significant event occurs, such as updates related to the 2032 Olympics in Queensland, we can expect an increase in related articles, which will enhance the value of this search.
Dataset C Discussion about beaches¶
The above two datasets present the discussion of the UK and US on the region. This dataset show the general discussion about travelling in beach area. We used the same approach as dataset A to do the analysis. We generated topic with NMF and visualize the results as the same as dataset A. Before doing NMF, we got a brief idea of the articles by using the section.
fig4 = px.bar(
x=beach_terms_df["section"].value_counts().index,
y=beach_terms_df["section"].value_counts().values,
labels={"x": "Section", "y": "Number"},
title="Dataset C Beach: Number of Articles by Section"
)
fig4.update_layout(xaxis_tickangle=-45)
fig4.show()
This dataset contains 8 articles, which is small volumne. The articles are distributed across 4 sections. In order to provide meaningful insights by NMF, the number of topics should exceed the number of sections, allowing the model to uncover topics within sections. After mant trial, we decided to model 5 topic. The below are the visualizations similar to dataset A.
tfidf_vectorizer = TfidfVectorizer(max_df=0.7, min_df=1, max_features=100000, stop_words=extended_stop_words)
num_topics = 5
beach_articles_text=[article["text"] for article in beach_articles.values() ]
tfidf_dt_matrix=tfidf_vectorizer.fit_transform(beach_articles_text)
feature_names = tfidf_vectorizer.get_feature_names_out()
doc001_term_counts = list(zip(feature_names,tfidf_dt_matrix))
nmf_model = NMF(n_components=num_topics,init='random',beta_loss='frobenius')
doc_topic_nmf = nmf_model.fit_transform(tfidf_dt_matrix)
topic_term_nmf = nmf_model.components_
nmf_topic_dict = {}
for index, topic in enumerate(topic_term_nmf):
zipped = zip(feature_names, topic)
top_terms=dict(sorted(zipped, key = lambda t: t[1], reverse=True)[:10])
top_terms_list= {key : round(top_terms[key], 4) for key in top_terms.keys()}
nmf_topic_dict[f"topic_{index}"] = top_terms_list
topic_num_list = []
topic_term = []
for idx, topic in enumerate(doc_topic_nmf):
topic_num = topic.argmax()
topic_num_list.append(topic_num)
top_topic = nmf_topic_dict[f"topic_{topic_num}"]
topic_term.append(top_topic)
beach_terms_df['nmfTopic'] = topic_num_list
beach_terms_df['nmf'] = topic_term
terms_group = beach_terms_df.groupby('nmfTopic').size().reset_index(name='Article Count')
terms_group.rename(columns={'nmfTopic': 'Topic Number'}, inplace=True)
fig5 = px.bar(terms_group,
x='Topic Number',
y='Article Count',
title='Number of Articles per NMF Topic',
labels={'Topic Number': 'Topic Number', 'Article Count': 'Article Count'},
text='Article Count')
fig5.update_layout(xaxis=dict(type='category')) # ensures discrete topic numbers
fig5.show()
topics = list(nmf_topic_dict.items())
n_topics = len(topics)
n_cols = 3
n_rows = (n_topics + n_cols - 1) // n_cols # Ceiling division
max_val = max(max(terms.values()) for _, terms in topics)
fig6 = make_subplots(
rows=n_rows,
cols=n_cols,
subplot_titles=[f"Top 10 Terms for {topic.replace('_', ' ').title()}" for topic, _ in topics]
)
for idx, (topic, terms) in enumerate(topics):
row = idx // n_cols + 1
col = idx % n_cols + 1
fig6.add_trace(
go.Bar(
x=list(terms.values()),
y=list(terms.keys()),
orientation='h',
text=list(terms.values()),
name=topic,
showlegend=False
),
row=row, col=col
)
fig6.update_yaxes(autorange='reversed', row=row, col=col)
fig6.update_xaxes(range=[0, max_val], row=row, col=col)
fig6.update_layout(
height=350 * n_rows,
width=1200,
title_text="Top 10 Terms per topic related to beaches",
showlegend=False
)
fig6.show()
fig7 = px.histogram(beach_terms_df,
x="nmfTopic",
y="section",
facet_col="nmfTopic",
histfunc="count",
labels={
"nmfTopic": "Topic",
"section": "Section",
"count": "Topic Count"
},
category_orders={"nmfTopic": range(num_topics)})
fig7.update_layout(
title="Section Distribution across Topics related to beaches",
xaxis_title="count",
yaxis_title="Section",
showlegend=False
)
fig7.show()
grouped_by_topic = beach_terms_df.groupby('nmfTopic')
samples = [group.sample(n=1) for _, group in grouped_by_topic]
samples = [sample.iloc[0] for sample in samples]
for doc in samples:
print(f"NMF Topic {doc['nmfTopic']}: ")
print(f"\t>> Title:\t {doc.name}")
print("\t>> Section:\t", doc['section'])
print("\t>> Keywords:\t", doc['keywords'])
print("\t>> NMF terms:\t", f"Topic {doc['nmfTopic']}: ", list(doc['nmf'].keys())[:5])
print()
NMF Topic 0: >> Title: The Guardian’s happiest places to live in Britain revealed [2025-05-17T11:00:05Z] >> Section: Life and style >> Keywords: ['wales', 'lifeandstyle', 'uk', 'happiness', 'property', 'britishidentity', 'scotland', 'money', 'communities', 'cities', 'housing', 'society'] >> NMF terms: Topic 0: ['town', 'muxima', 'kiel', 'boys', '000'] NMF Topic 1: >> Title: First Thing: Trump agrees deal for UAE to build largest AI campus outside US [2025-05-16T12:30:06Z] >> Section: US news >> Keywords: ['us-news'] >> NMF terms: Topic 1: ['trump', 'israel', 'israeli', 'gaza', 'netanyahu'] NMF Topic 2: >> Title: Share a tip on a great dog-friendly holiday [2025-05-19T14:51:03Z] >> Section: Travel >> Keywords: ['travel'] >> NMF terms: Topic 2: ['competition', 'tips', 'terms', 'words', 'uk'] NMF Topic 3: >> Title: Thinking of a trip to Barcelona this summer? Beware – here’s what you'll find | Stephen Burgen [2025-05-20T04:00:47Z] >> Section: Opinion >> Keywords: ['travel', 'barcelona', 'overtourism', 'spain', 'news', 'world', 'europe-news'] >> NMF terms: Topic 3: ['barcelona', 'tourism', 'tourists', 'city', 'spain'] NMF Topic 4: >> Title: ‘Time slows down in Lastovo’: I may just have found Croatia’s most unspoilt archipelago [2025-05-14T06:00:01Z] >> Section: Travel >> Keywords: ['travel', 'europe', 'environment', 'birds', 'croatia', 'wildlife'] >> NMF terms: Topic 4: ['lastovo', 'zaklopatica', 'bay', 'night', 'coast']
The topic modeling results reveal five distinct topics from these eight articles.
Topic 0 contains the highest number of articles. However, the NMF term scores are low. This suggests that the topic likely combines several unrelated discussion. Despite this, it is worth exploring the articles within Topic 0 since 2 out of 3 articles are related to travel and may be valuable for QTravel.
Topic 1, on the other hand, is about the Gaza war. As this topic is unrelated to tourism, it can be excluded from further consideration by QTravel.
Topics 2, 3, and 4 each contain only one article, which leads to high NMF term scores. From the topic, we can tell that Topic 2 discusses dog-friendly holidays. Topic 4 highlights the slow travel in Croatia. These two articles both discuss travelling preferences alternate to traditional travelling. They may inspire new travelling packages for QTravel’s offerings.
The approach we treat this dataset is different form dataset A. In this dataset, we are interested in individual articles even if the NMF scores are low. This is because articles related to tourism are often not tied to widely discussed events. They may reflect personal experiences, or smaller-scale stories which covered by one article. But these articles still hold value for QTravel as they may share certain trends on travelling. As a result, more individual reading and qualitative analysis is necessary to uncover insights for this dataset.
D. Insight¶
This analysis provides a sample of a weekly report together with the justifications. It is more important about the methods than the topic or insights are drawn from this report. We utilize news articles from the Guardian and performed topic modelling with NMF and Tf-Idf. This helps QTravel to get a brief idea about the discussions that might be useful for the business. Based on the result in this report, QTravel can discuss further on the marketing and business planning.
What are the recent discussions on the destination that might draw customers or potential customers attention?¶
To answer this question, we have accessed two dataset. Dataset A studies the recent discussions related to Australia. Dataset B studies the recent discussion specially related to Queensland, Brisbane and Gold Coast.
From both datasets, we found that sports is a major topic of the articles mentioning Australia or Queensland. They focus on cricket, rugby and football in Australia. This suggests a strong international awareness of Australian sporting events. QTravel can use of this trend to promote their business with topics related to sports. They can share sports related contents on their social media to attract the attentions of the US and UK customers. QTravel can also explore sports events in the future and launch tour packages that combine sports event and beach holidays. Also, QTravel can equip their tour guides with information related to Australia sports. This allow the tour guide to provide more information that the customers may be interested in.
Although the search results for Gold Coast and Queensland are few, it does not mean that it is meaningless to do such search. It is because this is not an one off study. This is a long term analysis that will be done weekly. If there is something big happening in Queensland in the future, like updates about the 2032 Olympics, the search results will boost and lead to a meaningful analysis. Also, we can compare the topic generated across week to look for consistent topic. For example, the Austrian winner of the Eurovision seems not related to Australia. However, if the topic appears in many week's reports, it is worths investigating the topic to see if we miss something.
This weekly report provides brief insights into topics related to Australia. While not all topics may be suitable for QTravel's marketing materials, identifying even a single topic can be valuable. A broad focus without clear direction often makes ineffective marketing. Therefore, this study should be viewed as a starting point for both marketing and business management. To gain a more comprehensive understanding of each topic, further qualitative research is necessary. This will allow for deeper insights and a more thorough exploration of the topics. This ensures that the insights are aligned with QTravel’s visions and missions.
What are the current travel preferences among beach-goers that can help design travel packages or adjust the tours?¶
From dataset C, it reveals a shift in travel preferences toward alternative experiences. For example, dog-friendly travel and slow travel. This suggests there are more dicussions on travelling preferences other than the traditional travelling options. Some tourists are looking for relaxation and meaningful local experiences—ideal. QTravel should monitor the trends in future weekly report. These insights can help QTravel design travel packages that cater to customers' needs and boost the business.
Limitations of insights¶
Insight limitations: The topic modelling approach only offers an initial snapshot of the discussions. The results does not provide a comprehensive interpretation of the public discussion. Therefore, the study results should not be treated as a conclusive assessment. Additional qualitative researches must be done to explore these topics in more detail and ensure their relevance with the business.
Implications for stakeholders: The insights generated should only be treatd as a possible direction for QTravel's marketing and management teams. It is not an absolute solution for the company. It is because the results only reflect public discussions. The discussion may not align with the company’s missions, visions or strategies. For example, a hot trend on animal hunting may not help a eco-friendly tourist company. Therefore, it is essential that QTravel study the topic before applying these insights in decision-making processes.
Ethical considerations¶
Sensitive and polarizing topics: Although the data may highlight certain “hot” topics, not all of them are appropriate for marketing. Topics that are politically or socially sensitive could alienate particular customers and generate negative feedbacks. For example, using topics related to elections or polarized issues could harm the company's reputation. Marketing materials that touch on controversial subjects must be used with caution. This study does not filter out these topic so QTravel should be aware of those topics.
Bias in data sources: There may be the potential bias in the news sources. As we are using only data from a news organization, it is very likely to have biases in the results which could lead to misleading conclusions. To reduce the effect of bias source, we should incorporate multiple data sources to do the analysis. This ensures the findings are more balanced and reliable.
Article mismatch: This study uses keywords to search for data. It is possible that the keywords wrongly appears in the articles and the search results. Irrelevant articles could distort the whole topic modelling results. Sampling articles and checking should be done to spot out irrelevant articles and filter them out.
Human bias in interpreting the terms: The process of interpreting data is subject to human biases. Readers may have different perceptions about different terms.
Reference¶
Queensland Government. (2022, May 11). Digital marketing strategy. Queensland Government Business Queensland. https://www.business.qld.gov.au/running-business/marketing-sales/marketing/websites-social-media/marketing-strategy
Queensland Government. (2024). Promoting your tourism business | Business Queensland. Queensland Government Business Queensland; Queensland Government. https://www.business.qld.gov.au/industries/hospitality-tourism-sport/tourism/business-development-operations/promoting