Database Assembly using the LMRv2.1 pickle file#

This notebook walks through the process of building a cfr ProxyDatabase object containing only the PAGES 2k (version 2.0.0) proxy records. For reproducibility purposes, our starting point here are the proxy pickle files from the LMR Project Data Page provided in Tardif et al. (2019), from which we clean and reformat the records to be compatible with the cfr data assimilation workflow.

Key steps include:

  • Filtering only PAGES2kv2 records

  • Standardizing proxy record identifiers (short pid)

  • Merging proxy time and values with metadata

  • Assigning standardized ptype labels used by cfr

  • Filtering for annual and subannual records with sufficient temporal coverage

The final result is saved as a NetCDF file that can be used in subsequent data assimilation notebooks.

import cfr
import pandas as pd
import numpy as np
import pyleoclim as pyleo
import matplotlib.pyplot as plt
octave not found, please see README

Combining metadata and data pickle files#

Loading metadata and data csv file and filtering only PAGES2kv2 data#

When the raw pickle file (one for metadata, one for data) for the proxies is downloaded from the LMR Project Data Page , they will need to be pre-processed. The pickle file was created using an older version of Python, hence you will need to open the files as pd.sparse_dataframe objects before converting them to .csv files to be used for the rest of this notebook.

Firstly load metadata csv and isolate only the columns we need to make a cfr ProxyDatabase object

# loading metadata

df_meta = pd.read_csv('./proxydb_meta.csv')
df_p2k_meta = df_meta[df_meta['Proxy ID'].str.contains('PAGES2kv2')]
df_p2k_meta.set_index('Proxy ID', inplace=True)
archive_data = df_p2k_meta[['Archive type', 'Proxy measurement', 'Lat (N)', 'Lon (E)', 'Elev']]
archive_data['Proxy ID'] = archive_data.index  # recover it as a column

Load in the data csv, which contains arrays for time and proxy value. This step takes all the PAGES2k data and resets the index.

# loading data (time and value)

df = pd.read_csv('./proxydb_data.csv')
col = df.columns

clist = []
for c in col:
    if ('PAGES2kv2' in c) or ('Year C.E.' in c):
        clist.append(c)

p2k = df[clist]
p2k = p2k.reset_index(drop=True)

Standardizing Proxy IDs (pid)#

The cfr.ProxyDatabase class expects short proxy IDs (e.g., Ocn_148) that follow a specific location-based naming convention. We create a function that converts the long-format PAGES2kv2 IDs into these standardized shortened IDs.

time_col = p2k.columns[0]

pids = []
times = []
values = []

for pid in p2k.columns[1:]:  # Skip the time column
        # Get the time series for this proxy
        proxy_data = p2k[[time_col, pid]].dropna()
        
        if len(proxy_data) > 0:  # Only process if we have data
            pids.append(pid)
            times.append(proxy_data[time_col].tolist())
            values.append(proxy_data[pid].tolist())
def shorten_pid(pids): 
    # Split by colon first to handle proxy measurements with underscores
    parts_by_colon = pids.split(':')[0]  # Take everything before the colon
    parts = parts_by_colon.split('_')
    
    # Find the part that contains the number
    number = None
    for part in parts:
        if part.isdigit():
            number = part
            break
        elif 'NAm_' in part:
            number = part.split('_')[1]
            break
        elif 'Asi_' in part:
            number = part.split('_')[1]
            break
        elif 'Arc_' in part:
            number = part.split('_')[1]
            break
        elif 'Ant_' in part:
            number = part.split('_')[1]
            break
        elif 'Ocn_' in part:
            number = part.split('_')[1]
            break
        elif 'Aus_' in part:
            number = part.split('_')[1]
            break
    
    # Extract region from the second part
    region = parts[1].split('-')[0]
    
    # Map the region
    region_mapping = {
        'Asia': 'Asi',
        'Ocean2kHR': 'Ocn',
        'Africa': 'Afr',
        'Afr': 'Afr',
        'NAm': 'NAm',
        'Arc': 'Arc',
        'Ant': 'Ant',
        'Aus': 'Aus',
        'SAm': 'SAm'
    }
    region = region_mapping.get(region, region)
    
    if number is None:
        # If we still haven't found a number, look for it specifically
        for part in parts:
            if any(char.isdigit() for char in part):
                number = ''.join(char for char in part if char.isdigit())
                break
    
    return f"{region}_{number}"

We apply the naming function to our metadata dataframe so that the new index is the shorted pid as is used by cfr.

short_pids = []

for p in pids:
    lil = shorten_pid(p)
    short_pids.append(lil)

df_p2k_meta_short = df_p2k_meta.rename(index=shorten_pid)
archive_data_new_idx = archive_data.rename(index=shorten_pid)


data_df = pd.DataFrame({
        'pid': short_pids,
        'time': times,
        'value': values
    })

Now we take the data dataframe with shortened pids and merge it with the metadata dataframe to combine all the columns necessary to make a ProxyDatabase object.

archive_data_new_idx = archive_data_new_idx.reset_index(names=['pid'])

new_pdb = pd.merge(
        archive_data_new_idx,
        data_df,
        on='pid',
        how='inner'
)

Assigning Proxy Type Labels (ptype)#

Each record in the cfr.ProxyDatabase is tagged with a ptype, which is a combination of the archive and measured variable (e.g., tree.d18O, marine.MgCa). We generate these using a mapping based on archive and measurement metadata.
Using this, we want a new column called ‘ptype’. To generate this, we use the function create_ptype. The mapping for this function was done by hand, observing records from cfr’s built in ProxyDatabase to get an exact match to the metadata from Tardif’s pickle.

def create_ptype(row):

    archive_mapping = {
        'Tree Rings': 'tree',
        'Corals and Sclerosponges': 'coral',
        'Lake Cores': 'lake',
        'Ice Cores': 'ice',
        'Bivalve': 'bivalve',
        'Speleothems': 'speleothem',
        'Marine Cores': 'marine'
    }

    measure_mapping = {
        'Sr_Ca': 'SrCa',
        'trsgi': 'TRW',
        'calcification': 'calc',
        'd18O': 'd18O',
        'dD': 'dD',
        'MXD': 'MXD',
        'density': 'MXD',
        'composite': 'calc',
        'massacum': 'accumulation',
        'thickness': 'varve_thickness',
        'melt': 'melt',
        'RABD660_670': 'reflectance',
        'X_radiograph_dark_layer': 'varve_property'
    }
    
    # Get the shortened archive type
    archive = archive_mapping.get(row['Archive type'], row['Archive type'].lower())
    proxy = measure_mapping.get(row['Proxy measurement'])
    
    # Combine with proxy measurement
    return f"{archive}.{proxy}"

We now apply the function to our full dataframe.

new_pdb['ptype'] = new_pdb.apply(create_ptype, axis=1)
new_pdb = new_pdb.drop(['Archive type', 'Proxy measurement', 'Proxy ID'], axis=1)

Filtering by resolution (as done in LMR)#

We know that LMRv2.1 used only annual and subannually resolved records. In order to do this, we will take our new dataframe that has all the right columns, and turn it into a cfr.ProxyDatabase object, so we can filter by resolution.

#new_pdb = new_pdb[new_pdb['pid'] != 'Afr_012']

new_pdb = new_pdb.rename(columns={
    'Lat (N)': 'lat',
    'Lon (E)': 'lon',
    'Elev': 'elev'
})
# examine the dataframe

new_pdb
pid lat lon elev time value ptype
0 Ocn_148 43.6561 290.1983 -30.0 [1033.0, 1034.0, 1035.0, 1036.0, 1037.0, 1038.... [1.14, 1.06, 0.8, 0.69, 1.21, 1.4, 1.26, 0.85,... bivalve.d18O
1 Ocn_170 -21.8333 114.1833 -6.0 [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... [-29.80451, 10.97039, 28.05434, 21.96584, -2.7... coral.calc
2 Ocn_174 -21.9000 113.9667 -6.0 [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... [10.976, -4.39066, -3.39162, 16.9882, -9.2862,... coral.calc
3 Ocn_073 20.8300 273.2600 -3.1 [1773.0, 1774.0, 1775.0, 1776.0, 1777.0, 1778.... [-0.605009101918667, 0.11922153808142, -0.9671... coral.calc
4 Ocn_173 -17.5167 118.9667 -6.0 [1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.... [-3.65572, 15.4608, 6.62304, -20.1302, 14.6722... coral.calc
... ... ... ... ... ... ... ...
567 NAm_139 49.7000 241.1000 2000.0 [1512.0, 1513.0, 1514.0, 1515.0, 1516.0, 1517.... [0.87, 0.885, 0.893, 0.893, 0.87, 0.81, 0.826,... tree.MXD
568 NAm_200 44.8000 252.1000 2820.0 [1508.0, 1509.0, 1510.0, 1511.0, 1512.0, 1513.... [1.03, 0.928, 0.929, 1.052, 1.001, 0.991, 0.94... tree.MXD
569 NAm_130 55.3000 282.2000 50.0 [1352.0, 1353.0, 1354.0, 1355.0, 1356.0, 1357.... [0.98, 0.988, 0.956, 0.845, 1.053, 1.129, 1.07... tree.MXD
570 NAm_199 41.3000 252.3000 3150.0 [1401.0, 1402.0, 1403.0, 1404.0, 1405.0, 1406.... [1.042, 0.998, 0.953, 1.079, 1.057, 0.91, 1.03... tree.MXD
571 NAm_185 45.3000 238.3000 1300.0 [1504.0, 1505.0, 1506.0, 1507.0, 1508.0, 1509.... [1.016, 0.962, 1.009, 0.94, 0.932, 0.847, 0.97... tree.MXD

572 rows × 7 columns

# create the ProxyDatabase object using the correctly labeled column names

blank = cfr.ProxyDatabase()

lmr_pdb = blank.from_df(new_pdb,
    pid_column='pid', 
    lat_column='lat',  
    lon_column='lon',  
    elev_column = 'elev',
    time_column = 'time',
    value_column = 'value',
    ptype_column = 'ptype'

)

To filter by resolution, we use cfr’s default cfr.proxydb.filter function. We filter by ‘dt’, which is the median timestep between time values. The following cell is to ensure that there are no None type values when computing dt.

missing_dt = [pid for pid, r in lmr_pdb.records.items() if r.dt is None]
print(f"{len(missing_dt)} records missing dt")
1 records missing dt
missing = lmr_pdb.filter(by='pid', keys=missing_dt, mode='exact')
lmr_pdb = lmr_pdb - missing

We want dt <= 1.0, so we expand the range a little bit in case there are any records with slightly greater than annual resolution.

# everything should be annual

filt = lmr_pdb.filter(by='dt', keys= (0.0,1.2))

for record in filt.records:
    pobj = filt[record]
    print(pobj.pid, pobj.dt)
Ocn_148 1.0
Ocn_170 1.0
Ocn_174 1.0
Ocn_073 1.0
Ocn_173 1.0
Ocn_171 1.0
Ocn_065 1.0
Ocn_158 1.0
Ocn_172 1.0
Ocn_167 1.0
Ocn_121 1.0
Ocn_131 1.0
Ocn_163 1.0
Ocn_069 1.0
Ocn_112 1.0
Ocn_061 1.0
Ocn_182 1.0
Ocn_129 1.0
Ocn_067 1.0
Ocn_160 1.0
Ocn_070 1.0
Ocn_155 1.0
Ocn_153 1.0
Ocn_084 1.0
Ocn_093 1.0
Ocn_141 1.0
Ocn_159 1.0
Ocn_109 1.0
Ocn_161 1.0
Ocn_101 1.0
Ocn_143 1.0
Ocn_096 1.0
Ocn_176 1.0
Ocn_072 1.0
Ocn_106 1.0
Ocn_125 1.0
Ocn_060 1.0
Ocn_157 1.0
Ocn_110 1.0
Ocn_077 1.0
Ocn_166 1.0
Ocn_120 1.0
Ocn_099 1.0
Ocn_083 1.0
Ocn_179 1.0
Ocn_123 1.0
Ocn_130 1.0
Ocn_079 1.0
Ocn_087 1.0
Ocn_115 1.0
Ocn_162 1.0
Ocn_068 1.0
Ocn_075 1.0
Ocn_178 1.0
Ocn_088 1.0
Ocn_074 1.0
Ocn_097 1.0
Ocn_111 1.0
Ocn_139 1.0
Ocn_062 1.0
Ocn_138 1.0
Ocn_119 1.0
Ocn_076 1.0
Ocn_086 1.0
Ocn_181 1.0
Ocn_066 1.0
Ocn_128 1.0
Ocn_080 1.0
Ocn_098 1.0
Ocn_095 1.0
Ocn_116 1.0
Ocn_154 1.0
Ocn_169 1.0
Ocn_107 1.0
Ocn_114 1.0
Ocn_104 1.0
Ocn_127 1.0
Ocn_091 1.0
Ocn_140 1.0
Ocn_108 1.0
Ocn_078 1.0
Ocn_082 1.0
Ocn_081 1.0
Ocn_147 1.0
Ocn_142 1.0
Ocn_168 1.0
Ocn_175 1.0
Ocn_071 1.0
Ocn_122 1.0
Ocn_146 1.0
Ocn_180 1.0
Ocn_118 1.0
Ocn_090 1.0
Ocn_156 1.0
Arc_080 1.0
Ant_020 1.0
Ant_003 1.0
Arc_032 1.0
Arc_011 1.0
Ant_011 1.0
Ant_024 1.0
Arc_078 1.0
SAm_026 1.0
Arc_072 1.0
Ant_005 1.0
Arc_035 1.0
Arc_036 1.0
Ant_008 1.0
Arc_029 1.0
Arc_027 1.0
Ant_006 1.0
Arc_033 1.0
Arc_028 1.0
Ant_002 1.0
Ant_004 1.0
Arc_034 1.0
Ant_021 1.0
Ant_012 1.0
Arc_075 1.0
Arc_018 1.0
Ant_007 1.0
Asi_232 1.0
Arc_005 1.0
Arc_064 1.0
Ant_025 1.0
Ant_010 1.0
Ant_026 1.0
Ant_019 1.0
Ant_017 1.0
Ant_028 1.0
Ant_001 1.0
Arc_004 1.0
Arc_026 1.0
NAm_072 1.0
Arc_020 1.0
Arc_014 1.0
Arc_001 1.0
Arc_025 1.0
Arc_022 1.0
Afr_012 1.0
Arc_22 1.0
NAm_049 1.0
Asi_221 1.0
Asi_153 1.0
Asi_044 1.0
Asi_130 1.0
NAm_030 1.0
NAm_046 1.0
NAm_178 1.0
Asi_142 1.0
Asi_146 1.0
SAm_006 1.0
Asi_143 1.0
Aus_030 1.0
NAm_156 1.0
Asi_137 1.0
NAm_011 1.0
NAm_045 1.0
Asi_119 1.0
Asi_216 1.0
NAm_149 1.0
NAm_032 1.0
NAm_109 1.0
Asi_079 1.0
Asi_021 1.0
Asi_100 1.0
NAm_147 1.0
Asi_010 1.0
Asi_047 1.0
Asi_093 1.0
NAm_019 1.0
Asi_054 1.0
Asi_206 1.0
Asi_065 1.0
NAm_096 1.0
Asi_007 1.0
Asi_134 1.0
Asi_172 1.0
NAm_126 1.0
Asi_184 1.0
NAm_180 1.0
NAm_166 1.0
Eur_005 1.0
NAm_160 1.0
NAm_111 1.0
NAm_018 1.0
NAm_110 1.0
Asi_063 1.0
Asi_131 1.0
Asi_048 1.0
Asi_052 1.0
Asi_068 1.0
Eur_009 1.0
Asi_023 1.0
Aus_005 1.0
Asi_087 1.0
Asi_176 1.0
NAm_099 1.0
Asi_162 1.0
Asi_014 1.0
NAm_182 1.0
Asi_071 1.0
Asi_053 1.0
Asi_219 1.0
Asi_189 1.0
NAm_176 1.0
Asi_029 1.0
Asi_036 1.0
Asi_192 1.0
NAm_105 1.0
Asi_086 1.0
Asi_008 1.0
Asi_180 1.0
NAm_188 1.0
NAm_170 1.0
Aus_007 1.0
Asi_011 1.0
NAm_151 1.0
Asi_220 1.0
Arc_002 1.0
Asi_013 1.0
NAm_173 1.0
Asi_156 1.0
Asi_168 1.0
NAm_002 1.0
Asi_015 1.0
Asi_028 1.0
Asi_195 1.0
Asi_201 1.0
Asi_136 1.0
Asi_096 1.0
Eur_008 1.0
NAm_142 1.0
NAm_177 1.0
Asi_118 1.0
Asi_225 1.0
NAm_009 1.0
SAm_024 1.0
Asi_165 1.0
Asi_148 1.0
Asi_190 1.0
Asi_104 1.0
NAm_148 1.0
Asi_140 1.0
Asi_193 1.0
Asi_103 1.0
Asi_215 1.0
Asi_179 1.0
Asi_135 1.0
Asi_127 1.0
Asi_005 1.0
NAm_136 1.0
Asi_202 1.0
NAm_081 1.0
Asi_102 1.0
NAm_158 1.0
Asi_024 1.0
Asi_126 1.0
NAm_125 1.0
Asi_226 1.0
SAm_029 1.0
Asi_183 1.0
Asi_199 1.0
Asi_059 1.0
Asi_074 1.0
Asi_056 1.0
NAm_196 1.0
Asi_207 1.0
Asi_173 1.0
Asi_090 1.0
Asi_121 1.0
Asi_098 1.0
NAm_128 1.0
Asi_163 1.0
NAm_100 1.0
Asi_040 1.0
Asi_203 1.0
Asi_169 1.0
Asi_078 1.0
NAm_097 1.0
Asi_166 1.0
NAm_161 1.0
NAm_087 1.0
Asi_155 1.0
Arc_024 1.0
Asi_159 1.0
Asi_062 1.0
NAm_135 1.0
NAm_189 1.0
Asi_141 1.0
Asi_161 1.0
Asi_045 1.0
Asi_123 1.0
Asi_129 1.0
NAm_013 1.0
NAm_133 1.0
Asi_209 1.0
Asi_150 1.0
Asi_112 1.0
Asi_196 1.0
Asi_060 1.0
NAm_083 1.0
Asi_083 1.0
Asi_124 1.0
Asi_091 1.0
NAm_060 1.0
Asi_223 1.0
Asi_051 1.0
Asi_122 1.0
Asi_018 1.0
Asi_200 1.0
Asi_089 1.0
Asi_138 1.0
Asi_111 1.0
Asi_105 1.0
NAm_070 1.0
Asi_082 1.0
Asi_228 1.0
NAm_190 1.0
Asi_170 1.0
Asi_076 1.0
NAm_085 1.0
Asi_026 1.0
NAm_091 1.0
Asi_077 1.0
NAm_001 1.0
Asi_016 1.0
NAm_071 1.0
Asi_110 1.0
NAm_007 1.0
Arc_008 1.0
NAm_154 1.0
Asi_160 1.0
NAm_193 1.0
Asi_114 1.0
NAm_203 1.0
NAm_146 1.0
Asi_080 1.0
Asi_147 1.0
Asi_145 1.0
Asi_151 1.0
Asi_058 1.0
NAm_127 1.0
Arc_016 1.0
NAm_114 1.0
Asi_218 1.0
Asi_217 1.0
Asi_164 1.0
NAm_008 1.0
Asi_027 1.0
NAm_132 1.0
Asi_197 1.0
Asi_154 1.0
Asi_064 1.0
Asi_020 1.0
NAm_088 1.0
Asi_003 1.0
Asi_085 1.0
Asi_229 1.0
Asi_106 1.0
NAm_145 1.0
Asi_108 1.0
Asi_004 1.0
Asi_158 1.0
Eur_006 1.0
Asi_210 1.0
Asi_009 1.0
Asi_177 1.0
Asi_174 1.0
Asi_025 1.0
Asi_061 1.0
Asi_191 1.0
Asi_139 1.0
NAm_155 1.0
Asi_069 1.0
Asi_186 1.0
NAm_153 1.0
NAm_050 1.0
Asi_057 1.0
NAm_186 1.0
Asi_032 1.0
Asi_037 1.0
Asi_072 1.0
NAm_183 1.0
Asi_101 1.0
NAm_059 1.0
NAm_029 1.0
Aus_031 1.0
Aus_009 1.0
Asi_043 1.0
Asi_113 1.0
NAm_098 1.0
NAm_120 1.0
Asi_149 1.0
Asi_204 1.0
Asi_128 1.0
Asi_213 1.0
NAm_092 1.0
Asi_088 1.0
Asi_214 1.0
Asi_211 1.0
Asi_050 1.0
Asi_175 1.0
Asi_187 1.0
Asi_049 1.0
NAm_090 1.0
NAm_201 1.0
Asi_095 1.0
Asi_035 1.0
Asi_041 1.0
Asi_030 1.0
Asi_198 1.0
NAm_195 1.0
Asi_185 1.0
NAm_140 1.0
Asi_034 1.0
Aus_004 1.0
SAm_025 1.0
Asi_066 1.0
Asi_002 1.0
Asi_132 1.0
Asi_075 1.0
Asi_073 1.0
NAm_163 1.0
Asi_006 1.0
Asi_067 1.0
Asi_081 1.0
NAm_089 1.0
Asi_046 1.0
NAm_162 1.0
Asi_092 1.0
Arc_073 1.0
Asi_017 1.0
Asi_022 1.0
Asi_182 1.0
Asi_097 1.0
Asi_117 1.0
Asi_031 1.0
NAm_117 1.0
Asi_167 1.0
Asi_115 1.0
Asi_019 1.0
Asi_094 1.0
Asi_133 1.0
Eur_004 1.0
Asi_001 1.0
Asi_178 1.0
NAm_168 1.0
Asi_099 1.0
NAm_003 1.0
Asi_157 1.0
Asi_070 1.0
NAm_066 1.0
NAm_094 1.0
NAm_093 1.0
Asi_116 1.0
NAm_191 1.0
NAm_138 1.0
NAm_144 1.0
Asi_038 1.0
Asi_171 1.0
Asi_055 1.0
Asi_152 1.0
Asi_042 1.0
NAm_129 1.0
NAm_198 1.0
Asi_224 1.0
Asi_222 1.0
Asi_205 1.0
Asi_181 1.0
Asi_039 1.0
Asi_033 1.0
Asi_125 1.0
Asi_109 1.0
NAm_171 1.0
Asi_144 1.0
Asi_120 1.0
NAm_159 1.0
Asi_107 1.0
Asi_208 1.0
Asi_084 1.0
NAm_112 1.0
Asi_194 1.0
Asi_212 1.0
Asi_227 1.0
Asi_188 1.0
NAm_157 1.0
NAm_044 1.0
NAm_179 1.0
NAm_192 1.0
NAm_124 1.0
NAm_113 1.0
NAm_108 1.0
NAm_084 1.0
NAm_181 1.0
NAm_167 1.0
Arc_068 1.0
NAm_102 1.0
NAm_119 1.0
NAm_116 1.0
NAm_041 1.0
NAm_152 1.0
NAm_174 1.0
Arc_074 1.0
Arc_066 1.0
Arc_065 1.0
NAm_104 1.0
NAm_137 1.0
NAm_026 1.0
NAm_101 1.0
NAm_143 1.0
NAm_134 1.0
Eur_007 1.0
Eur_003 1.0
NAm_122 1.0
NAm_086 1.0
NAm_194 1.0
NAm_204 1.0
NAm_115 1.0
NAm_107 1.0
NAm_175 1.0
NAm_064 1.0
NAm_103 1.0
NAm_123 1.0
NAm_121 1.0
Arc_071 1.0
Arc_063 1.0
Arc_061 1.0
NAm_150 1.0
NAm_202 1.0
NAm_184 1.0
NAm_141 1.0
NAm_172 1.0
NAm_197 1.0
NAm_187 1.0
Arc_077 1.0
NAm_106 1.0
NAm_165 1.0
NAm_164 1.0
NAm_118 1.0
NAm_169 1.0
NAm_095 1.0
NAm_139 1.0
NAm_200 1.0
NAm_130 1.0
NAm_199 1.0
NAm_185 1.0

(Optional) Check for sparse records that don’t meet the 0.5 threshold.#

LMRv2.1 uses the parameter valid_frac to remove records that don’t span more than half of the full length of the record. If the database is cleaned up, chances are this will not be an issue. But it is good to run just in case to see what records get filtered out. Records will get dropped during the calibration step of the data assimilation workflow if they do not meet the overlap requirement as well.

valid_frac = 0.5
filtered_records = []

for pid, record in filt.records.items():
    time = np.array(record.time)
    value = np.array(record.value)

    mask = ~np.isnan(time) & ~np.isnan(value)
    time = time[mask]
    value = value[mask]

    if len(time) < 2:
        continue

    dt = record.dt
    years = time.astype(int)
    year_range = years.max() - years.min() + 1
    coverage = len(years) / year_range if year_range > 0 else 0

    # Keep if:
    # - subannual or annual AND coverage ≥ 0.5
    # - OR manually specified pid
    if coverage >= valid_frac or pid == 'Ocn_148':
        filtered_records.append(record)

Final ProxyDatabase Object#

The final ProxyDatabase object contains only PAGES2kv2 records that:

  • Are annual or subannual in resolution (dt ≤ 1.2 years)

  • Have at least 50% coverage over their date range

  • Include complete time and value arrays

  • Are labeled with a standardized ptype format (e.g., tree.d18O, marine.MgCa)

This cleaned and filtered database is now ready for use in the cfr framework for climate field reconstruction.

# remove dropped records from final database

final_pdb = cfr.ProxyDatabase()
final_pdb += filtered_records

Save Proxydatabase as netCDF file#

final_pdb.to_nc('./prev_data/lmr_pickle_pages2k.nc')
>>> Warning: this is an experimental feature.
>>> Warning: this is an experimental feature.

100%|██████████| 544/544 [00:15<00:00, 35.39it/s]
>>> Data before 1 CE is dropped for records: ['Arc_032', 'Arc_033', 'Ant_007', 'Ant_010', 'Ant_028', 'Arc_004', 'Arc_026', 'Arc_020', 'Arc_022', 'NAm_011', 'NAm_019', 'Arc_002', 'Eur_006', 'Eur_003'].
ProxyDatabase saved to: ./prev_data/lmr_pickle_pages2k.nc

Summary and Output#

We have constructed a cleaned, standardized proxy database containing PAGES2kv2 records. This database has been filtered by resolution and (optionally) coverage and formatted for direct use in the cfr paleoclimate data assimilation workflows.

The result has been saved as a NetCDF file:

./prev_data/lmr_pickle_pages2k.nc