Database Assembly using the LMRv2.1 pickle file#
This notebook walks through the process of building a cfr
ProxyDatabase object containing only the PAGES 2k (version 2.0.0) proxy records. For reproducibility purposes, our starting point here are the proxy pickle files from the LMR Project Data Page provided in Tardif et al. (2019), from which we clean and reformat the records to be compatible with the cfr
data assimilation workflow.
Key steps include:
Filtering only PAGES2kv2 records
Standardizing proxy record identifiers (short
pid
)Merging proxy time and values with metadata
Assigning standardized
ptype
labels used by cfrFiltering for annual and subannual records with sufficient temporal coverage
The final result is saved as a NetCDF file that can be used in subsequent data assimilation notebooks.
import cfr
import pandas as pd
import numpy as np
import pyleoclim as pyleo
import matplotlib.pyplot as plt
octave not found, please see README
Combining metadata and data pickle files#
Loading metadata and data csv file and filtering only PAGES2kv2 data#
When the raw pickle file (one for metadata, one for data) for the proxies is downloaded from the LMR Project Data Page , they will need to be pre-processed. The pickle file was created using an older version of Python, hence you will need to open the files as pd.sparse_dataframe
objects before converting them to .csv
files to be used for the rest of this notebook.
Firstly load metadata csv and isolate only the columns we need to make a cfr ProxyDatabase
object
# loading metadata
df_meta = pd.read_csv('./proxydb_meta.csv')
df_p2k_meta = df_meta[df_meta['Proxy ID'].str.contains('PAGES2kv2')]
df_p2k_meta.set_index('Proxy ID', inplace=True)
archive_data = df_p2k_meta[['Archive type', 'Proxy measurement', 'Lat (N)', 'Lon (E)', 'Elev']]
archive_data['Proxy ID'] = archive_data.index # recover it as a column
Load in the data csv, which contains arrays for time and proxy value. This step takes all the PAGES2k data and resets the index.
# loading data (time and value)
df = pd.read_csv('./proxydb_data.csv')
col = df.columns
clist = []
for c in col:
if ('PAGES2kv2' in c) or ('Year C.E.' in c):
clist.append(c)
p2k = df[clist]
p2k = p2k.reset_index(drop=True)
Standardizing Proxy IDs (pid)#
The cfr.ProxyDatabase
class expects short proxy IDs (e.g., Ocn_148
) that follow a specific location-based naming convention. We create a function that converts the long-format PAGES2kv2 IDs into these standardized shortened IDs.
time_col = p2k.columns[0]
pids = []
times = []
values = []
for pid in p2k.columns[1:]: # Skip the time column
# Get the time series for this proxy
proxy_data = p2k[[time_col, pid]].dropna()
if len(proxy_data) > 0: # Only process if we have data
pids.append(pid)
times.append(proxy_data[time_col].tolist())
values.append(proxy_data[pid].tolist())
def shorten_pid(pids):
# Split by colon first to handle proxy measurements with underscores
parts_by_colon = pids.split(':')[0] # Take everything before the colon
parts = parts_by_colon.split('_')
# Find the part that contains the number
number = None
for part in parts:
if part.isdigit():
number = part
break
elif 'NAm_' in part:
number = part.split('_')[1]
break
elif 'Asi_' in part:
number = part.split('_')[1]
break
elif 'Arc_' in part:
number = part.split('_')[1]
break
elif 'Ant_' in part:
number = part.split('_')[1]
break
elif 'Ocn_' in part:
number = part.split('_')[1]
break
elif 'Aus_' in part:
number = part.split('_')[1]
break
# Extract region from the second part
region = parts[1].split('-')[0]
# Map the region
region_mapping = {
'Asia': 'Asi',
'Ocean2kHR': 'Ocn',
'Africa': 'Afr',
'Afr': 'Afr',
'NAm': 'NAm',
'Arc': 'Arc',
'Ant': 'Ant',
'Aus': 'Aus',
'SAm': 'SAm'
}
region = region_mapping.get(region, region)
if number is None:
# If we still haven't found a number, look for it specifically
for part in parts:
if any(char.isdigit() for char in part):
number = ''.join(char for char in part if char.isdigit())
break
return f"{region}_{number}"
We apply the naming function to our metadata dataframe so that the new index is the shorted pid
as is used by cfr.
short_pids = []
for p in pids:
lil = shorten_pid(p)
short_pids.append(lil)
df_p2k_meta_short = df_p2k_meta.rename(index=shorten_pid)
archive_data_new_idx = archive_data.rename(index=shorten_pid)
data_df = pd.DataFrame({
'pid': short_pids,
'time': times,
'value': values
})
Now we take the data dataframe with shortened pids and merge it with the metadata dataframe to combine all the columns necessary to make a ProxyDatabase object.
archive_data_new_idx = archive_data_new_idx.reset_index(names=['pid'])
new_pdb = pd.merge(
archive_data_new_idx,
data_df,
on='pid',
how='inner'
)
Assigning Proxy Type Labels (ptype)#
Each record in the cfr.ProxyDatabase
is tagged with a ptype
, which is a combination of the archive and measured variable (e.g., tree.d18O
, marine.MgCa
). We generate these using a mapping based on archive and measurement metadata.
Using this, we want a new column called ‘ptype’. To generate this, we use the function create_ptype
. The mapping for this function was done by hand, observing records from cfr’s built in ProxyDatabase to get an exact match to the metadata from Tardif’s pickle.
def create_ptype(row):
archive_mapping = {
'Tree Rings': 'tree',
'Corals and Sclerosponges': 'coral',
'Lake Cores': 'lake',
'Ice Cores': 'ice',
'Bivalve': 'bivalve',
'Speleothems': 'speleothem',
'Marine Cores': 'marine'
}
measure_mapping = {
'Sr_Ca': 'SrCa',
'trsgi': 'TRW',
'calcification': 'calc',
'd18O': 'd18O',
'dD': 'dD',
'MXD': 'MXD',
'density': 'MXD',
'composite': 'calc',
'massacum': 'accumulation',
'thickness': 'varve_thickness',
'melt': 'melt',
'RABD660_670': 'reflectance',
'X_radiograph_dark_layer': 'varve_property'
}
# Get the shortened archive type
archive = archive_mapping.get(row['Archive type'], row['Archive type'].lower())
proxy = measure_mapping.get(row['Proxy measurement'])
# Combine with proxy measurement
return f"{archive}.{proxy}"
We now apply the function to our full dataframe.
new_pdb['ptype'] = new_pdb.apply(create_ptype, axis=1)
new_pdb = new_pdb.drop(['Archive type', 'Proxy measurement', 'Proxy ID'], axis=1)
Filtering by resolution (as done in LMR)#
We know that LMRv2.1 used only annual and subannually resolved records. In order to do this, we will take our new dataframe that has all the right columns, and turn it into a cfr.ProxyDatabase
object, so we can filter by resolution.
#new_pdb = new_pdb[new_pdb['pid'] != 'Afr_012']
new_pdb = new_pdb.rename(columns={
'Lat (N)': 'lat',
'Lon (E)': 'lon',
'Elev': 'elev'
})
# examine the dataframe
new_pdb
pid | lat | lon | elev | time | value | ptype | |
---|---|---|---|---|---|---|---|
0 | Ocn_148 | 43.6561 | 290.1983 | -30.0 | [1033.0, 1034.0, 1035.0, 1036.0, 1037.0, 1038.... | [1.14, 1.06, 0.8, 0.69, 1.21, 1.4, 1.26, 0.85,... | bivalve.d18O |
1 | Ocn_170 | -21.8333 | 114.1833 | -6.0 | [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... | [-29.80451, 10.97039, 28.05434, 21.96584, -2.7... | coral.calc |
2 | Ocn_174 | -21.9000 | 113.9667 | -6.0 | [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... | [10.976, -4.39066, -3.39162, 16.9882, -9.2862,... | coral.calc |
3 | Ocn_073 | 20.8300 | 273.2600 | -3.1 | [1773.0, 1774.0, 1775.0, 1776.0, 1777.0, 1778.... | [-0.605009101918667, 0.11922153808142, -0.9671... | coral.calc |
4 | Ocn_173 | -17.5167 | 118.9667 | -6.0 | [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... | [-3.65572, 15.4608, 6.62304, -20.1302, 14.6722... | coral.calc |
... | ... | ... | ... | ... | ... | ... | ... |
567 | NAm_139 | 49.7000 | 241.1000 | 2000.0 | [1512.0, 1513.0, 1514.0, 1515.0, 1516.0, 1517.... | [0.87, 0.885, 0.893, 0.893, 0.87, 0.81, 0.826,... | tree.MXD |
568 | NAm_200 | 44.8000 | 252.1000 | 2820.0 | [1508.0, 1509.0, 1510.0, 1511.0, 1512.0, 1513.... | [1.03, 0.928, 0.929, 1.052, 1.001, 0.991, 0.94... | tree.MXD |
569 | NAm_130 | 55.3000 | 282.2000 | 50.0 | [1352.0, 1353.0, 1354.0, 1355.0, 1356.0, 1357.... | [0.98, 0.988, 0.956, 0.845, 1.053, 1.129, 1.07... | tree.MXD |
570 | NAm_199 | 41.3000 | 252.3000 | 3150.0 | [1401.0, 1402.0, 1403.0, 1404.0, 1405.0, 1406.... | [1.042, 0.998, 0.953, 1.079, 1.057, 0.91, 1.03... | tree.MXD |
571 | NAm_185 | 45.3000 | 238.3000 | 1300.0 | [1504.0, 1505.0, 1506.0, 1507.0, 1508.0, 1509.... | [1.016, 0.962, 1.009, 0.94, 0.932, 0.847, 0.97... | tree.MXD |
572 rows × 7 columns
# create the ProxyDatabase object using the correctly labeled column names
blank = cfr.ProxyDatabase()
lmr_pdb = blank.from_df(new_pdb,
pid_column='pid',
lat_column='lat',
lon_column='lon',
elev_column = 'elev',
time_column = 'time',
value_column = 'value',
ptype_column = 'ptype'
)
To filter by resolution, we use cfr’s default cfr.proxydb.filter
function. We filter by ‘dt’, which is the median timestep between time values. The following cell is to ensure that there are no None
type values when computing dt.
missing_dt = [pid for pid, r in lmr_pdb.records.items() if r.dt is None]
print(f"{len(missing_dt)} records missing dt")
1 records missing dt
missing = lmr_pdb.filter(by='pid', keys=missing_dt, mode='exact')
lmr_pdb = lmr_pdb - missing
We want dt <= 1.0, so we expand the range a little bit in case there are any records with slightly greater than annual resolution.
# everything should be annual
filt = lmr_pdb.filter(by='dt', keys= (0.0,1.2))
for record in filt.records:
pobj = filt[record]
print(pobj.pid, pobj.dt)
Ocn_148 1.0
Ocn_170 1.0
Ocn_174 1.0
Ocn_073 1.0
Ocn_173 1.0
Ocn_171 1.0
Ocn_065 1.0
Ocn_158 1.0
Ocn_172 1.0
Ocn_167 1.0
Ocn_121 1.0
Ocn_131 1.0
Ocn_163 1.0
Ocn_069 1.0
Ocn_112 1.0
Ocn_061 1.0
Ocn_182 1.0
Ocn_129 1.0
Ocn_067 1.0
Ocn_160 1.0
Ocn_070 1.0
Ocn_155 1.0
Ocn_153 1.0
Ocn_084 1.0
Ocn_093 1.0
Ocn_141 1.0
Ocn_159 1.0
Ocn_109 1.0
Ocn_161 1.0
Ocn_101 1.0
Ocn_143 1.0
Ocn_096 1.0
Ocn_176 1.0
Ocn_072 1.0
Ocn_106 1.0
Ocn_125 1.0
Ocn_060 1.0
Ocn_157 1.0
Ocn_110 1.0
Ocn_077 1.0
Ocn_166 1.0
Ocn_120 1.0
Ocn_099 1.0
Ocn_083 1.0
Ocn_179 1.0
Ocn_123 1.0
Ocn_130 1.0
Ocn_079 1.0
Ocn_087 1.0
Ocn_115 1.0
Ocn_162 1.0
Ocn_068 1.0
Ocn_075 1.0
Ocn_178 1.0
Ocn_088 1.0
Ocn_074 1.0
Ocn_097 1.0
Ocn_111 1.0
Ocn_139 1.0
Ocn_062 1.0
Ocn_138 1.0
Ocn_119 1.0
Ocn_076 1.0
Ocn_086 1.0
Ocn_181 1.0
Ocn_066 1.0
Ocn_128 1.0
Ocn_080 1.0
Ocn_098 1.0
Ocn_095 1.0
Ocn_116 1.0
Ocn_154 1.0
Ocn_169 1.0
Ocn_107 1.0
Ocn_114 1.0
Ocn_104 1.0
Ocn_127 1.0
Ocn_091 1.0
Ocn_140 1.0
Ocn_108 1.0
Ocn_078 1.0
Ocn_082 1.0
Ocn_081 1.0
Ocn_147 1.0
Ocn_142 1.0
Ocn_168 1.0
Ocn_175 1.0
Ocn_071 1.0
Ocn_122 1.0
Ocn_146 1.0
Ocn_180 1.0
Ocn_118 1.0
Ocn_090 1.0
Ocn_156 1.0
Arc_080 1.0
Ant_020 1.0
Ant_003 1.0
Arc_032 1.0
Arc_011 1.0
Ant_011 1.0
Ant_024 1.0
Arc_078 1.0
SAm_026 1.0
Arc_072 1.0
Ant_005 1.0
Arc_035 1.0
Arc_036 1.0
Ant_008 1.0
Arc_029 1.0
Arc_027 1.0
Ant_006 1.0
Arc_033 1.0
Arc_028 1.0
Ant_002 1.0
Ant_004 1.0
Arc_034 1.0
Ant_021 1.0
Ant_012 1.0
Arc_075 1.0
Arc_018 1.0
Ant_007 1.0
Asi_232 1.0
Arc_005 1.0
Arc_064 1.0
Ant_025 1.0
Ant_010 1.0
Ant_026 1.0
Ant_019 1.0
Ant_017 1.0
Ant_028 1.0
Ant_001 1.0
Arc_004 1.0
Arc_026 1.0
NAm_072 1.0
Arc_020 1.0
Arc_014 1.0
Arc_001 1.0
Arc_025 1.0
Arc_022 1.0
Afr_012 1.0
Arc_22 1.0
NAm_049 1.0
Asi_221 1.0
Asi_153 1.0
Asi_044 1.0
Asi_130 1.0
NAm_030 1.0
NAm_046 1.0
NAm_178 1.0
Asi_142 1.0
Asi_146 1.0
SAm_006 1.0
Asi_143 1.0
Aus_030 1.0
NAm_156 1.0
Asi_137 1.0
NAm_011 1.0
NAm_045 1.0
Asi_119 1.0
Asi_216 1.0
NAm_149 1.0
NAm_032 1.0
NAm_109 1.0
Asi_079 1.0
Asi_021 1.0
Asi_100 1.0
NAm_147 1.0
Asi_010 1.0
Asi_047 1.0
Asi_093 1.0
NAm_019 1.0
Asi_054 1.0
Asi_206 1.0
Asi_065 1.0
NAm_096 1.0
Asi_007 1.0
Asi_134 1.0
Asi_172 1.0
NAm_126 1.0
Asi_184 1.0
NAm_180 1.0
NAm_166 1.0
Eur_005 1.0
NAm_160 1.0
NAm_111 1.0
NAm_018 1.0
NAm_110 1.0
Asi_063 1.0
Asi_131 1.0
Asi_048 1.0
Asi_052 1.0
Asi_068 1.0
Eur_009 1.0
Asi_023 1.0
Aus_005 1.0
Asi_087 1.0
Asi_176 1.0
NAm_099 1.0
Asi_162 1.0
Asi_014 1.0
NAm_182 1.0
Asi_071 1.0
Asi_053 1.0
Asi_219 1.0
Asi_189 1.0
NAm_176 1.0
Asi_029 1.0
Asi_036 1.0
Asi_192 1.0
NAm_105 1.0
Asi_086 1.0
Asi_008 1.0
Asi_180 1.0
NAm_188 1.0
NAm_170 1.0
Aus_007 1.0
Asi_011 1.0
NAm_151 1.0
Asi_220 1.0
Arc_002 1.0
Asi_013 1.0
NAm_173 1.0
Asi_156 1.0
Asi_168 1.0
NAm_002 1.0
Asi_015 1.0
Asi_028 1.0
Asi_195 1.0
Asi_201 1.0
Asi_136 1.0
Asi_096 1.0
Eur_008 1.0
NAm_142 1.0
NAm_177 1.0
Asi_118 1.0
Asi_225 1.0
NAm_009 1.0
SAm_024 1.0
Asi_165 1.0
Asi_148 1.0
Asi_190 1.0
Asi_104 1.0
NAm_148 1.0
Asi_140 1.0
Asi_193 1.0
Asi_103 1.0
Asi_215 1.0
Asi_179 1.0
Asi_135 1.0
Asi_127 1.0
Asi_005 1.0
NAm_136 1.0
Asi_202 1.0
NAm_081 1.0
Asi_102 1.0
NAm_158 1.0
Asi_024 1.0
Asi_126 1.0
NAm_125 1.0
Asi_226 1.0
SAm_029 1.0
Asi_183 1.0
Asi_199 1.0
Asi_059 1.0
Asi_074 1.0
Asi_056 1.0
NAm_196 1.0
Asi_207 1.0
Asi_173 1.0
Asi_090 1.0
Asi_121 1.0
Asi_098 1.0
NAm_128 1.0
Asi_163 1.0
NAm_100 1.0
Asi_040 1.0
Asi_203 1.0
Asi_169 1.0
Asi_078 1.0
NAm_097 1.0
Asi_166 1.0
NAm_161 1.0
NAm_087 1.0
Asi_155 1.0
Arc_024 1.0
Asi_159 1.0
Asi_062 1.0
NAm_135 1.0
NAm_189 1.0
Asi_141 1.0
Asi_161 1.0
Asi_045 1.0
Asi_123 1.0
Asi_129 1.0
NAm_013 1.0
NAm_133 1.0
Asi_209 1.0
Asi_150 1.0
Asi_112 1.0
Asi_196 1.0
Asi_060 1.0
NAm_083 1.0
Asi_083 1.0
Asi_124 1.0
Asi_091 1.0
NAm_060 1.0
Asi_223 1.0
Asi_051 1.0
Asi_122 1.0
Asi_018 1.0
Asi_200 1.0
Asi_089 1.0
Asi_138 1.0
Asi_111 1.0
Asi_105 1.0
NAm_070 1.0
Asi_082 1.0
Asi_228 1.0
NAm_190 1.0
Asi_170 1.0
Asi_076 1.0
NAm_085 1.0
Asi_026 1.0
NAm_091 1.0
Asi_077 1.0
NAm_001 1.0
Asi_016 1.0
NAm_071 1.0
Asi_110 1.0
NAm_007 1.0
Arc_008 1.0
NAm_154 1.0
Asi_160 1.0
NAm_193 1.0
Asi_114 1.0
NAm_203 1.0
NAm_146 1.0
Asi_080 1.0
Asi_147 1.0
Asi_145 1.0
Asi_151 1.0
Asi_058 1.0
NAm_127 1.0
Arc_016 1.0
NAm_114 1.0
Asi_218 1.0
Asi_217 1.0
Asi_164 1.0
NAm_008 1.0
Asi_027 1.0
NAm_132 1.0
Asi_197 1.0
Asi_154 1.0
Asi_064 1.0
Asi_020 1.0
NAm_088 1.0
Asi_003 1.0
Asi_085 1.0
Asi_229 1.0
Asi_106 1.0
NAm_145 1.0
Asi_108 1.0
Asi_004 1.0
Asi_158 1.0
Eur_006 1.0
Asi_210 1.0
Asi_009 1.0
Asi_177 1.0
Asi_174 1.0
Asi_025 1.0
Asi_061 1.0
Asi_191 1.0
Asi_139 1.0
NAm_155 1.0
Asi_069 1.0
Asi_186 1.0
NAm_153 1.0
NAm_050 1.0
Asi_057 1.0
NAm_186 1.0
Asi_032 1.0
Asi_037 1.0
Asi_072 1.0
NAm_183 1.0
Asi_101 1.0
NAm_059 1.0
NAm_029 1.0
Aus_031 1.0
Aus_009 1.0
Asi_043 1.0
Asi_113 1.0
NAm_098 1.0
NAm_120 1.0
Asi_149 1.0
Asi_204 1.0
Asi_128 1.0
Asi_213 1.0
NAm_092 1.0
Asi_088 1.0
Asi_214 1.0
Asi_211 1.0
Asi_050 1.0
Asi_175 1.0
Asi_187 1.0
Asi_049 1.0
NAm_090 1.0
NAm_201 1.0
Asi_095 1.0
Asi_035 1.0
Asi_041 1.0
Asi_030 1.0
Asi_198 1.0
NAm_195 1.0
Asi_185 1.0
NAm_140 1.0
Asi_034 1.0
Aus_004 1.0
SAm_025 1.0
Asi_066 1.0
Asi_002 1.0
Asi_132 1.0
Asi_075 1.0
Asi_073 1.0
NAm_163 1.0
Asi_006 1.0
Asi_067 1.0
Asi_081 1.0
NAm_089 1.0
Asi_046 1.0
NAm_162 1.0
Asi_092 1.0
Arc_073 1.0
Asi_017 1.0
Asi_022 1.0
Asi_182 1.0
Asi_097 1.0
Asi_117 1.0
Asi_031 1.0
NAm_117 1.0
Asi_167 1.0
Asi_115 1.0
Asi_019 1.0
Asi_094 1.0
Asi_133 1.0
Eur_004 1.0
Asi_001 1.0
Asi_178 1.0
NAm_168 1.0
Asi_099 1.0
NAm_003 1.0
Asi_157 1.0
Asi_070 1.0
NAm_066 1.0
NAm_094 1.0
NAm_093 1.0
Asi_116 1.0
NAm_191 1.0
NAm_138 1.0
NAm_144 1.0
Asi_038 1.0
Asi_171 1.0
Asi_055 1.0
Asi_152 1.0
Asi_042 1.0
NAm_129 1.0
NAm_198 1.0
Asi_224 1.0
Asi_222 1.0
Asi_205 1.0
Asi_181 1.0
Asi_039 1.0
Asi_033 1.0
Asi_125 1.0
Asi_109 1.0
NAm_171 1.0
Asi_144 1.0
Asi_120 1.0
NAm_159 1.0
Asi_107 1.0
Asi_208 1.0
Asi_084 1.0
NAm_112 1.0
Asi_194 1.0
Asi_212 1.0
Asi_227 1.0
Asi_188 1.0
NAm_157 1.0
NAm_044 1.0
NAm_179 1.0
NAm_192 1.0
NAm_124 1.0
NAm_113 1.0
NAm_108 1.0
NAm_084 1.0
NAm_181 1.0
NAm_167 1.0
Arc_068 1.0
NAm_102 1.0
NAm_119 1.0
NAm_116 1.0
NAm_041 1.0
NAm_152 1.0
NAm_174 1.0
Arc_074 1.0
Arc_066 1.0
Arc_065 1.0
NAm_104 1.0
NAm_137 1.0
NAm_026 1.0
NAm_101 1.0
NAm_143 1.0
NAm_134 1.0
Eur_007 1.0
Eur_003 1.0
NAm_122 1.0
NAm_086 1.0
NAm_194 1.0
NAm_204 1.0
NAm_115 1.0
NAm_107 1.0
NAm_175 1.0
NAm_064 1.0
NAm_103 1.0
NAm_123 1.0
NAm_121 1.0
Arc_071 1.0
Arc_063 1.0
Arc_061 1.0
NAm_150 1.0
NAm_202 1.0
NAm_184 1.0
NAm_141 1.0
NAm_172 1.0
NAm_197 1.0
NAm_187 1.0
Arc_077 1.0
NAm_106 1.0
NAm_165 1.0
NAm_164 1.0
NAm_118 1.0
NAm_169 1.0
NAm_095 1.0
NAm_139 1.0
NAm_200 1.0
NAm_130 1.0
NAm_199 1.0
NAm_185 1.0
(Optional) Check for sparse records that don’t meet the 0.5 threshold.#
LMRv2.1 uses the parameter valid_frac
to remove records that don’t span more than half of the full length of the record. If the database is cleaned up, chances are this will not be an issue. But it is good to run just in case to see what records get filtered out. Records will get dropped during the calibration step of the data assimilation workflow if they do not meet the overlap requirement as well.
valid_frac = 0.5
filtered_records = []
for pid, record in filt.records.items():
time = np.array(record.time)
value = np.array(record.value)
mask = ~np.isnan(time) & ~np.isnan(value)
time = time[mask]
value = value[mask]
if len(time) < 2:
continue
dt = record.dt
years = time.astype(int)
year_range = years.max() - years.min() + 1
coverage = len(years) / year_range if year_range > 0 else 0
# Keep if:
# - subannual or annual AND coverage ≥ 0.5
# - OR manually specified pid
if coverage >= valid_frac or pid == 'Ocn_148':
filtered_records.append(record)
Final ProxyDatabase Object#
The final ProxyDatabase
object contains only PAGES2kv2 records that:
Are annual or subannual in resolution (dt ≤ 1.2 years)
Have at least 50% coverage over their date range
Include complete time and value arrays
Are labeled with a standardized
ptype
format (e.g.,tree.d18O
,marine.MgCa
)
This cleaned and filtered database is now ready for use in the cfr framework for climate field reconstruction.
# remove dropped records from final database
final_pdb = cfr.ProxyDatabase()
final_pdb += filtered_records
Save Proxydatabase as netCDF file#
final_pdb.to_nc('./prev_data/lmr_pickle_pages2k.nc')
>>> Warning: this is an experimental feature.
>>> Warning: this is an experimental feature.
100%|██████████| 544/544 [00:15<00:00, 35.39it/s]
>>> Data before 1 CE is dropped for records: ['Arc_032', 'Arc_033', 'Ant_007', 'Ant_010', 'Ant_028', 'Arc_004', 'Arc_026', 'Arc_020', 'Arc_022', 'NAm_011', 'NAm_019', 'Arc_002', 'Eur_006', 'Eur_003'].
ProxyDatabase saved to: ./prev_data/lmr_pickle_pages2k.nc
Summary and Output#
We have constructed a cleaned, standardized proxy database containing PAGES2kv2 records. This database has been filtered by resolution and (optionally) coverage and formatted for direct use in the cfr paleoclimate data assimilation workflows.
The result has been saved as a NetCDF file:
./prev_data/lmr_pickle_pages2k.nc