Data for both Part A and Part B¶
This project uses data from the Queensland Government Open Data Portal. Both parts will use data on Queensland Wave Monitoring. Part B will also use data on Storm tide monitoring
For this project, I will use the Coastal Data System - Near real time wave data and for Part B I will add Coastal Data System – Near real time storm tide data. Note that this data will change over the time period of the assignment.
[Q1] Read the data¶
- Open the CSV version of the file. Open directly from the URL into a pandas dataframe.
- Identify an appropriate index, and make a note of the columns.
# import library
import pandas as pd
import datetime
import pytz
# url for the latest data set
url = "https://www.data.qld.gov.au/datastore/dump/2bbef99e-9974-49b9-a316-57402b00609c?bom=True"
# read the data set to the notebook with index "_id"
wave_df = pd.read_csv(url,index_col="_id")
# display the date of access# Get current time in local timezone
localTimezone = pytz.timezone('Australia/Brisbane')
waveRecentAccess = datetime.datetime.now(localTimezone)
print(f"The current data was accessed on {waveRecentAccess:%d-%m-%Y %H:%M}")
# get column names
wave_headings = list(wave_df.columns)
# display column names
[noRow,noCol]=wave_df.shape # spread the shape of the data frame into two variables
print(f"There are {noRow} rows in the dataframe.")
print(f"There are {noCol} columns in the dataframe, which are:")
for col in wave_headings:
print(">",col)
The current data was accessed on 13-04-2025 22:23 ,There are 7504 rows in the dataframe. ,There are 14 columns in the dataframe, which are: ,> Site ,> SiteNumber ,> Seconds ,> DateTime ,> Latitude ,> Longitude ,> Hsig ,> Hmax ,> Tp ,> Tz ,> SST ,> Direction ,> Current Speed ,> Current Direction
[Q2] Save the data¶
- Transform the grouped data into a dataframe
- Save the dataframe as a CSV file with the date that reflects the URL access in Q1
# save the data retrieved from the internet
path="data/"
file_name_recent=f'wave_data({waveRecentAccess}).csv'
wave_df.to_csv(f'{path}{file_name_recent}')
Read the data from a file¶
- To read the same data back in (rather than up-to-date data), write code here to read in the file from Q2
# Read the data locally with index "_id"
wave_file_df = pd.read_csv(f"{path}{file_name_recent}",index_col="_id")
wave_file_df
| Site | SiteNumber | Seconds | DateTime | Latitude | Longitude | Hsig | Hmax | Tp | Tz | SST | Direction | Current Speed | Current Direction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| _id | ||||||||||||||
| 1 | Caloundra | 54 | 1743861600 | 2025-04-06T00:00:00 | -26.84675 | 153.15581 | 0.714 | 1.14 | 6.67 | 5.479 | 25.25 | 85.80 | -99.9 | -99.9 |
| 2 | Caloundra | 54 | 1743863400 | 2025-04-06T00:30:00 | -26.84688 | 153.15564 | 0.716 | 1.23 | 6.67 | 5.479 | 25.25 | 81.60 | -99.9 | -99.9 |
| 3 | Caloundra | 54 | 1743865200 | 2025-04-06T01:00:00 | -26.84700 | 153.15555 | 0.677 | 1.20 | 6.67 | 5.634 | 25.15 | 83.00 | -99.9 | -99.9 |
| 4 | Caloundra | 54 | 1743867000 | 2025-04-06T01:30:00 | -26.84697 | 153.15549 | 0.717 | 1.29 | 6.67 | 5.797 | 25.15 | 87.20 | -99.9 | -99.9 |
| 5 | Caloundra | 54 | 1743868800 | 2025-04-06T02:00:00 | -26.84699 | 153.15553 | 0.708 | 1.10 | 6.67 | 5.714 | 25.20 | 87.20 | -99.9 | -99.9 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7500 | Hay Point TriAxys | 4740tx | 1744537200 | 2025-04-13 19:40:00 | -21.27830 | 149.32270 | 1.860 | 2.91 | 5.60 | 4.800 | 26.50 | 119.26 | -99.9 | -99.9 |
| 7501 | Hay Point TriAxys | 4740tx | 1744538400 | 2025-04-13 20:00:00 | -21.27830 | 149.32270 | 1.940 | 3.15 | 5.90 | 5.000 | 26.48 | 122.26 | -99.9 | -99.9 |
| 7502 | Hay Point TriAxys | 4740tx | 1744539600 | 2025-04-13 20:20:00 | -21.27830 | 149.32260 | 1.900 | 3.04 | 5.90 | 5.000 | 26.47 | 114.26 | -99.9 | -99.9 |
| 7503 | Hay Point TriAxys | 4740tx | 1744540800 | 2025-04-13 20:40:00 | -21.27830 | 149.32260 | 1.890 | 3.21 | 6.20 | 5.200 | 26.45 | 115.26 | -99.9 | -99.9 |
| 7504 | Hay Point TriAxys | 4740tx | 1744542000 | 2025-04-13 21:00:00 | -21.27820 | 149.32250 | 1.930 | 3.51 | 5.90 | 4.900 | 26.44 | 114.26 | -99.9 | -99.9 |
7504 rows × 14 columns
[Q3] Analyse the data¶
- Filter the data to include only sites for the South East coast (from Gold Coast to Sunshine Coast)
- Group the filtered data by site
- Obtain an appropriate aggregate for the groups (e.g. Sum, Mean, etc)
- Save the grouped data as a new dataframe
Correct the data format¶
The "DateTime" column is converted from string to a "datetime object" for easy manipulation.
wave_file_df['DateTime'] = pd.to_datetime(wave_file_df['DateTime'], errors='coerce')
print(wave_file_df.dtypes)
Site object ,SiteNumber object ,Seconds int64 ,DateTime datetime64[ns] ,Latitude float64 ,Longitude float64 ,Hsig float64 ,Hmax float64 ,Tp float64 ,Tz float64 ,SST float64 ,Direction float64 ,Current Speed float64 ,Current Direction float64 ,dtype: object
Correct the name of sites¶
Some site names include the buoy unit names, e.g."Mk4". The unit name is technical information which is not important for the analysis. The buoy unit names are therefore removed to avoid misunderstanding.
wave_file_df['Site'] = wave_file_df['Site'].str.replace('Mk4','').str.rstrip()
print("> new set of site names:",set(wave_file_df["Site"]))
> new set of site names: {'Wide Bay', 'Gladstone', 'Cairns', 'Caloundra', 'Brisbane', 'Gold Coast', 'Emu Park', 'Tweed Heads', 'Townsville', 'Palm Beach', 'Mackay', 'Poruma West', 'Mooloolaba', 'Albatross Bay', 'Tweed Offshore', 'North Moreton Bay', 'Hay Point TriAxys', 'Bundaberg', 'Skardon River Outer', 'Bilinga'}
Filter data¶
Retrieve Significant Wave Height and Average Zero Upcrossing Wave Period – The dataset consists of 14 columns. However, only Significant Wave Height (Hsig) and Average Zero Upcrossing Wave Period (Tz) are useful for answering the questions as they can present the wave energy.
Significant Wave Height is used instead of Maximum Wave Height (Hmax) because the former is more general and representative. Hsig records the height of the highest one-third of waves. It represents wave that is higher than most frequent wave(Bureau of Meterology, 2015). On the other hand, Hmax represent the highest one datum. This may lead to the inclusion of extreme values that merely exists. Using it may make the analysis too sensitive and reduce the system's credibility. This shows that Hsig give more general data to the analysis. This is also the reason why the average wave period (Tz) is used instead of the peak wave eriod (Tmax).
(Bureau of Meteorology, 2015)
Filter to include only South East coast data – To make the study more focus, only monitoring sites along the South East Coast, from the northernmost Mooloolaba to the southernmost Tweed Offshore, are included. The data are filtered based on latitude, meaning only those with a latitude smaller than -26.56° (the latitude of Mooloolaba) are retained.
# Filtering data
wave_SEQ_df=wave_file_df[wave_file_df["Latitude"]<-26.56]
wave_SEQ_df = wave_SEQ_df[['Site','DateTime','Hsig', 'Tz','Longitude', 'Latitude']]
wave_SEQ_df
| Site | DateTime | Hsig | Tz | Longitude | Latitude | |
|---|---|---|---|---|---|---|
| _id | ||||||
| 1 | Caloundra | 2025-04-06 00:00:00 | 0.714 | 5.479 | 153.15581 | -26.84675 |
| 2 | Caloundra | 2025-04-06 00:30:00 | 0.716 | 5.479 | 153.15564 | -26.84688 |
| 3 | Caloundra | 2025-04-06 01:00:00 | 0.677 | 5.634 | 153.15555 | -26.84700 |
| 4 | Caloundra | 2025-04-06 01:30:00 | 0.717 | 5.797 | 153.15549 | -26.84697 |
| 5 | Caloundra | 2025-04-06 02:00:00 | 0.708 | 5.714 | 153.15553 | -26.84699 |
| ... | ... | ... | ... | ... | ... | ... |
| 6558 | Bilinga | 2025-04-13 18:30:00 | 1.590 | 5.770 | 153.51279 | -28.14245 |
| 6559 | Bilinga | 2025-04-13 19:00:00 | 1.730 | 6.190 | 153.51281 | -28.14244 |
| 6560 | Bilinga | 2025-04-13 19:30:00 | 1.810 | 6.540 | 153.51277 | -28.14196 |
| 6561 | Bilinga | 2025-04-13 20:00:00 | 1.720 | 6.160 | 153.51281 | -28.14193 |
| 6562 | Bilinga | 2025-04-13 20:30:00 | 1.950 | 6.170 | 153.51279 | -28.14193 |
3383 rows × 6 columns
Sort data¶
Sort data by latitude – The data are originally sorted by the alphatical order of the name of the monitoring site. To make the sorting more meaningful, the data are sorted by latitude. This arranges the data from north to south.
# Sorting data
wave_SEQ_df=wave_SEQ_df.copy()
wave_SEQ_df = wave_SEQ_df.sort_values(by=["DateTime","Latitude"], ascending=[True, False])
wave_SEQ_df
| Site | DateTime | Hsig | Tz | Longitude | Latitude | |
|---|---|---|---|---|---|---|
| _id | ||||||
| 746 | Mooloolaba | 2025-04-06 00:00:00 | 1.044 | 5.797 | 153.18452 | -26.56684 |
| 1 | Caloundra | 2025-04-06 00:00:00 | 0.714 | 5.479 | 153.15581 | -26.84675 |
| 375 | North Moreton Bay | 2025-04-06 00:00:00 | 0.754 | 5.333 | 153.28159 | -26.90002 |
| 4818 | Brisbane | 2025-04-06 00:00:00 | 1.070 | 5.270 | 153.63188 | -27.48649 |
| 5194 | Gold Coast | 2025-04-06 00:00:00 | 0.980 | 6.090 | 153.43906 | -27.96435 |
| ... | ... | ... | ... | ... | ... | ... |
| 4817 | Palm Beach | 2025-04-13 20:00:00 | 2.030 | 5.770 | 153.48567 | -28.09926 |
| 6561 | Bilinga | 2025-04-13 20:00:00 | 1.720 | 6.160 | 153.51281 | -28.14193 |
| 4103 | Tweed Heads | 2025-04-13 20:00:00 | 2.410 | 6.130 | 153.57637 | -28.17828 |
| 5571 | Gold Coast | 2025-04-13 20:30:00 | 2.330 | 5.840 | 153.43921 | -27.96417 |
| 6562 | Bilinga | 2025-04-13 20:30:00 | 1.950 | 6.170 | 153.51279 | -28.14193 |
3383 rows × 6 columns
Group the filtered data¶
- By Monitoring Site – The wave characteristics may differ by location. Therefore, the data of each site should be analyzed separately. To facilitate comparisons within sites, the data is grouped by monitoring sites.
- By date - The data are recorded every 20-30 minutes. To make the data manageable, the data are grouped by date.
Obtain appropriate aggregates for the groups¶
- Daily maximum Hsig and Tz – The maximum Hsig and Tz are calculated to summarize the daily wave conditions for each monitoring site. The maximum Hsig and Tz represent the highest possible values of the data, which are useful for ensuring safety and understanding peak wave activity. It is important to note that these values are derived from Hsig and Tz rather than the Hmax and Tp. It is because Hmax and Tp represent extreme cases that rarely exist. In contrast, Hsig represents one-third of the wave. It provides a general measure of the wave condition. Calculating the daily maximum of Hsig and Tz helps in capturing the frequently occurring wave conditions while avoiding the influence of extreme cases.
# Grouping the filtered data and obtaining daily maximum Hsig and Tz
wave_SEQ_group_max_df = wave_SEQ_df.groupby(["Site", wave_SEQ_df['DateTime'].dt.date], sort=False).agg({'Hsig': 'max','Tz': 'max','Longitude': 'first','Latitude': 'first'})
wave_SEQ_group_max_df = wave_SEQ_group_max_df.reset_index(names=['Site','DateTime'])
wave_SEQ_group_max_df
| Site | DateTime | Hsig | Tz | Longitude | Latitude | |
|---|---|---|---|---|---|---|
| 0 | Mooloolaba | 2025-04-06 | 1.329 | 6.250 | 153.18452 | -26.56684 |
| 1 | Caloundra | 2025-04-06 | 0.891 | 5.970 | 153.15581 | -26.84675 |
| 2 | North Moreton Bay | 2025-04-06 | 1.053 | 5.797 | 153.28159 | -26.90002 |
| 3 | Brisbane | 2025-04-06 | 1.740 | 6.950 | 153.63188 | -27.48649 |
| 4 | Gold Coast | 2025-04-06 | 1.190 | 7.210 | 153.43906 | -27.96435 |
| ... | ... | ... | ... | ... | ... | ... |
| 67 | Gold Coast | 2025-04-13 | 2.550 | 5.960 | 153.43906 | -27.96420 |
| 68 | Palm Beach | 2025-04-13 | 2.430 | 6.040 | 153.48579 | -28.09923 |
| 69 | Bilinga | 2025-04-13 | 2.190 | 6.540 | 153.51283 | -28.14198 |
| 70 | Tweed Heads | 2025-04-13 | 2.560 | 6.260 | 153.57632 | -28.17834 |
| 71 | Tweed Offshore | 2025-04-13 | 3.040 | 6.570 | 153.68205 | -28.21257 |
72 rows × 6 columns
[Q4] Visualise the data¶
- Visualise the grouped data with an appropriate chart
- Ensure X and Y axes are labelled appropriately
- Add an appropriate title for the chart
Daily maximum Hsig and Tz trends across sites¶
To analyze the temporal trends of wave conditions, two line charts are used to display the daily maximum Hsig and Tz across different monitoring sites. Since the data is time-series, a line chart can capture changes over time.
From these two graphs, the Hsig and Tz can be compared among sites. Each monitoring site is represented by different color to better compare wave trends across locations.
import plotly.express as px
# the range of data
startDate=min(wave_SEQ_group_max_df["DateTime"])
endDate=max(wave_SEQ_group_max_df["DateTime"])
dateRange=f"{startDate} to {endDate}"
# significant wave height across time
lineChartHsig = px.line(
wave_SEQ_group_max_df,
x="DateTime",
y="Hsig",
color="Site",
title=f"Figure 1: Daily maximum significant wave height <br> of South East Queensland: {dateRange}",
labels={"DateTime": "Date", "Hsig": "average sigificant wave height (m)"},
width=750,
height=500
)
lineChartHsig.show()
# zero upcrossing wave period across time
lineChartTz = px.line(
wave_SEQ_group_max_df,
x="DateTime",
y="Tz",
color="Site",
title=f"Figure 2: Daily maximum average zero upcrossing wave period <br> of South East Queensland: {dateRange}",
labels={"DateTime": "Date", "Tz": "Zero Upcrossing Wave Period (s)"},
width=750,
height=500
)
lineChartTz.show()