Database Assembly using the LMRv2.1 pickle file

Database Assembly using the LMRv2.1 pickle file#

This notebook walks through the process of building a cfr ProxyDatabase object containing only the PAGES 2k (version 2.0.0) proxy records. For reproducibility purposes, our starting point here are the proxy pickle files from the LMR Project Data Page provided in Tardif et al. (2019), from which we clean and reformat the records to be compatible with the cfr data assimilation workflow.

Key steps include:

Filtering only PAGES2kv2 records
Standardizing proxy record identifiers (short pid)
Merging proxy time and values with metadata
Assigning standardized ptype labels used by cfr
Filtering for annual and subannual records with sufficient temporal coverage

The final result is saved as a NetCDF file that can be used in subsequent data assimilation notebooks.

import cfr
import pandas as pd
import numpy as np
import pyleoclim as pyleo
import matplotlib.pyplot as plt

octave not found, please see README

Combining metadata and data pickle files#

Loading metadata and data csv file and filtering only PAGES2kv2 data#

When the raw pickle file (one for metadata, one for data) for the proxies is downloaded from the LMR Project Data Page , they will need to be pre-processed. The pickle file was created using an older version of Python, hence you will need to open the files as pd.sparse_dataframe objects before converting them to .csv files to be used for the rest of this notebook.

Firstly load metadata csv and isolate only the columns we need to make a cfr ProxyDatabase object

# loading metadata

df_meta = pd.read_csv('./proxydb_meta.csv')
df_p2k_meta = df_meta[df_meta['Proxy ID'].str.contains('PAGES2kv2')]
df_p2k_meta.set_index('Proxy ID', inplace=True)

archive_data = df_p2k_meta[['Archive type', 'Proxy measurement', 'Lat (N)', 'Lon (E)', 'Elev']]
archive_data['Proxy ID'] = archive_data.index  # recover it as a column

Load in the data csv, which contains arrays for time and proxy value. This step takes all the PAGES2k data and resets the index.

# loading data (time and value)

df = pd.read_csv('./proxydb_data.csv')
col = df.columns

clist = []
for c in col:
    if ('PAGES2kv2' in c) or ('Year C.E.' in c):
        clist.append(c)

p2k = df[clist]
p2k = p2k.reset_index(drop=True)

Standardizing Proxy IDs (pid)#

The cfr.ProxyDatabase class expects short proxy IDs (e.g., Ocn_148) that follow a specific location-based naming convention. We create a function that converts the long-format PAGES2kv2 IDs into these standardized shortened IDs.

time_col = p2k.columns[0]

pids = []
times = []
values = []

for pid in p2k.columns[1:]:  # Skip the time column
        # Get the time series for this proxy
        proxy_data = p2k[[time_col, pid]].dropna()
        
        if len(proxy_data) > 0:  # Only process if we have data
            pids.append(pid)
            times.append(proxy_data[time_col].tolist())
            values.append(proxy_data[pid].tolist())

def shorten_pid(pids): 
    # Split by colon first to handle proxy measurements with underscores
    parts_by_colon = pids.split(':')[0]  # Take everything before the colon
    parts = parts_by_colon.split('_')
    
    # Find the part that contains the number
    number = None
    for part in parts:
        if part.isdigit():
            number = part
            break
        elif 'NAm_' in part:
            number = part.split('_')[1]
            break
        elif 'Asi_' in part:
            number = part.split('_')[1]
            break
        elif 'Arc_' in part:
            number = part.split('_')[1]
            break
        elif 'Ant_' in part:
            number = part.split('_')[1]
            break
        elif 'Ocn_' in part:
            number = part.split('_')[1]
            break
        elif 'Aus_' in part:
            number = part.split('_')[1]
            break
    
    # Extract region from the second part
    region = parts[1].split('-')[0]
    
    # Map the region
    region_mapping = {
        'Asia': 'Asi',
        'Ocean2kHR': 'Ocn',
        'Africa': 'Afr',
        'Afr': 'Afr',
        'NAm': 'NAm',
        'Arc': 'Arc',
        'Ant': 'Ant',
        'Aus': 'Aus',
        'SAm': 'SAm'
    }
    region = region_mapping.get(region, region)
    
    if number is None:
        # If we still haven't found a number, look for it specifically
        for part in parts:
            if any(char.isdigit() for char in part):
                number = ''.join(char for char in part if char.isdigit())
                break
    
    return f"{region}_{number}"

We apply the naming function to our metadata dataframe so that the new index is the shorted pid as is used by cfr.

short_pids = []

for p in pids:
    lil = shorten_pid(p)
    short_pids.append(lil)

df_p2k_meta_short = df_p2k_meta.rename(index=shorten_pid)
archive_data_new_idx = archive_data.rename(index=shorten_pid)


data_df = pd.DataFrame({
        'pid': short_pids,
        'time': times,
        'value': values
    })

Now we take the data dataframe with shortened pids and merge it with the metadata dataframe to combine all the columns necessary to make a ProxyDatabase object.

archive_data_new_idx = archive_data_new_idx.reset_index(names=['pid'])

new_pdb = pd.merge(
        archive_data_new_idx,
        data_df,
        on='pid',
        how='inner'
)

Assigning Proxy Type Labels (ptype)#

Each record in the cfr.ProxyDatabase is tagged with a ptype, which is a combination of the archive and measured variable (e.g., tree.d18O, marine.MgCa). We generate these using a mapping based on archive and measurement metadata.
Using this, we want a new column called ‘ptype’. To generate this, we use the function create_ptype. The mapping for this function was done by hand, observing records from cfr’s built in ProxyDatabase to get an exact match to the metadata from Tardif’s pickle.

def create_ptype(row):

    archive_mapping = {
        'Tree Rings': 'tree',
        'Corals and Sclerosponges': 'coral',
        'Lake Cores': 'lake',
        'Ice Cores': 'ice',
        'Bivalve': 'bivalve',
        'Speleothems': 'speleothem',
        'Marine Cores': 'marine'
    }

    measure_mapping = {
        'Sr_Ca': 'SrCa',
        'trsgi': 'TRW',
        'calcification': 'calc',
        'd18O': 'd18O',
        'dD': 'dD',
        'MXD': 'MXD',
        'density': 'MXD',
        'composite': 'calc',
        'massacum': 'accumulation',
        'thickness': 'varve_thickness',
        'melt': 'melt',
        'RABD660_670': 'reflectance',
        'X_radiograph_dark_layer': 'varve_property'
    }
    
    # Get the shortened archive type
    archive = archive_mapping.get(row['Archive type'], row['Archive type'].lower())
    proxy = measure_mapping.get(row['Proxy measurement'])
    
    # Combine with proxy measurement
    return f"{archive}.{proxy}"

We now apply the function to our full dataframe.

new_pdb['ptype'] = new_pdb.apply(create_ptype, axis=1)
new_pdb = new_pdb.drop(['Archive type', 'Proxy measurement', 'Proxy ID'], axis=1)

Filtering by resolution (as done in LMR)#

We know that LMRv2.1 used only annual and subannually resolved records. In order to do this, we will take our new dataframe that has all the right columns, and turn it into a cfr.ProxyDatabase object, so we can filter by resolution.

#new_pdb = new_pdb[new_pdb['pid'] != 'Afr_012']

new_pdb = new_pdb.rename(columns={
    'Lat (N)': 'lat',
    'Lon (E)': 'lon',
    'Elev': 'elev'
})

# examine the dataframe

new_pdb

	pid	lat	lon	elev	time	value	ptype
0	Ocn_148	43.6561	290.1983	-30.0	[1033.0, 1034.0, 1035.0, 1036.0, 1037.0, 1038....	[1.14, 1.06, 0.8, 0.69, 1.21, 1.4, 1.26, 0.85,...	bivalve.d18O
1	Ocn_170	-21.8333	114.1833	-6.0	[1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905....	[-29.80451, 10.97039, 28.05434, 21.96584, -2.7...	coral.calc
2	Ocn_174	-21.9000	113.9667	-6.0	[1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905....	[10.976, -4.39066, -3.39162, 16.9882, -9.2862,...	coral.calc
3	Ocn_073	20.8300	273.2600	-3.1	[1773.0, 1774.0, 1775.0, 1776.0, 1777.0, 1778....	[-0.605009101918667, 0.11922153808142, -0.9671...	coral.calc
4	Ocn_173	-17.5167	118.9667	-6.0	[1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905....	[-3.65572, 15.4608, 6.62304, -20.1302, 14.6722...	coral.calc
...	...	...	...	...	...	...	...
567	NAm_139	49.7000	241.1000	2000.0	[1512.0, 1513.0, 1514.0, 1515.0, 1516.0, 1517....	[0.87, 0.885, 0.893, 0.893, 0.87, 0.81, 0.826,...	tree.MXD
568	NAm_200	44.8000	252.1000	2820.0	[1508.0, 1509.0, 1510.0, 1511.0, 1512.0, 1513....	[1.03, 0.928, 0.929, 1.052, 1.001, 0.991, 0.94...	tree.MXD
569	NAm_130	55.3000	282.2000	50.0	[1352.0, 1353.0, 1354.0, 1355.0, 1356.0, 1357....	[0.98, 0.988, 0.956, 0.845, 1.053, 1.129, 1.07...	tree.MXD
570	NAm_199	41.3000	252.3000	3150.0	[1401.0, 1402.0, 1403.0, 1404.0, 1405.0, 1406....	[1.042, 0.998, 0.953, 1.079, 1.057, 0.91, 1.03...	tree.MXD
571	NAm_185	45.3000	238.3000	1300.0	[1504.0, 1505.0, 1506.0, 1507.0, 1508.0, 1509....	[1.016, 0.962, 1.009, 0.94, 0.932, 0.847, 0.97...	tree.MXD

572 rows × 7 columns

# create the ProxyDatabase object using the correctly labeled column names

blank = cfr.ProxyDatabase()

lmr_pdb = blank.from_df(new_pdb,
    pid_column='pid', 
    lat_column='lat',  
    lon_column='lon',  
    elev_column = 'elev',
    time_column = 'time',
    value_column = 'value',
    ptype_column = 'ptype'

)

To filter by resolution, we use cfr’s default cfr.proxydb.filter function. We filter by ‘dt’, which is the median timestep between time values. The following cell is to ensure that there are no None type values when computing dt.

missing_dt = [pid for pid, r in lmr_pdb.records.items() if r.dt is None]
print(f"{len(missing_dt)} records missing dt")

1 records missing dt

missing = lmr_pdb.filter(by='pid', keys=missing_dt, mode='exact')
lmr_pdb = lmr_pdb - missing

We want dt <= 1.0, so we expand the range a little bit in case there are any records with slightly greater than annual resolution.

# everything should be annual

filt = lmr_pdb.filter(by='dt', keys= (0.0,1.2))

for record in filt.records:
    pobj = filt[record]
    print(pobj.pid, pobj.dt)

Ocn_148 1.0
Ocn_170 1.0
Ocn_174 1.0
Ocn_073 1.0
Ocn_173 1.0
Ocn_171 1.0
Ocn_065 1.0
Ocn_158 1.0
Ocn_172 1.0
Ocn_167 1.0
Ocn_121 1.0
Ocn_131 1.0
Ocn_163 1.0
Ocn_069 1.0
Ocn_112 1.0
Ocn_061 1.0
Ocn_182 1.0
Ocn_129 1.0
Ocn_067 1.0
Ocn_160 1.0
Ocn_070 1.0
Ocn_155 1.0
Ocn_153 1.0
Ocn_084 1.0
Ocn_093 1.0
Ocn_141 1.0
Ocn_159 1.0
Ocn_109 1.0
Ocn_161 1.0
Ocn_101 1.0
Ocn_143 1.0
Ocn_096 1.0
Ocn_176 1.0
Ocn_072 1.0
Ocn_106 1.0
Ocn_125 1.0
Ocn_060 1.0
Ocn_157 1.0
Ocn_110 1.0
Ocn_077 1.0
Ocn_166 1.0
Ocn_120 1.0
Ocn_099 1.0
Ocn_083 1.0
Ocn_179 1.0
Ocn_123 1.0
Ocn_130 1.0
Ocn_079 1.0
Ocn_087 1.0
Ocn_115 1.0
Ocn_162 1.0
Ocn_068 1.0
Ocn_075 1.0
Ocn_178 1.0
Ocn_088 1.0
Ocn_074 1.0
Ocn_097 1.0
Ocn_111 1.0
Ocn_139 1.0
Ocn_062 1.0
Ocn_138 1.0
Ocn_119 1.0
Ocn_076 1.0
Ocn_086 1.0
Ocn_181 1.0
Ocn_066 1.0
Ocn_128 1.0
Ocn_080 1.0
Ocn_098 1.0
Ocn_095 1.0
Ocn_116 1.0
Ocn_154 1.0
Ocn_169 1.0
Ocn_107 1.0
Ocn_114 1.0
Ocn_104 1.0
Ocn_127 1.0
Ocn_091 1.0
Ocn_140 1.0
Ocn_108 1.0
Ocn_078 1.0
Ocn_082 1.0
Ocn_081 1.0
Ocn_147 1.0
Ocn_142 1.0
Ocn_168 1.0
Ocn_175 1.0
Ocn_071 1.0
Ocn_122 1.0
Ocn_146 1.0
Ocn_180 1.0
Ocn_118 1.0
Ocn_090 1.0
Ocn_156 1.0
Arc_080 1.0
Ant_020 1.0
Ant_003 1.0
Arc_032 1.0
Arc_011 1.0
Ant_011 1.0
Ant_024 1.0
Arc_078 1.0
SAm_026 1.0
Arc_072 1.0
Ant_005 1.0
Arc_035 1.0
Arc_036 1.0
Ant_008 1.0
Arc_029 1.0
Arc_027 1.0
Ant_006 1.0
Arc_033 1.0
Arc_028 1.0
Ant_002 1.0
Ant_004 1.0
Arc_034 1.0
Ant_021 1.0
Ant_012 1.0
Arc_075 1.0
Arc_018 1.0
Ant_007 1.0
Asi_232 1.0
Arc_005 1.0
Arc_064 1.0
Ant_025 1.0
Ant_010 1.0
Ant_026 1.0
Ant_019 1.0
Ant_017 1.0
Ant_028 1.0
Ant_001 1.0
Arc_004 1.0
Arc_026 1.0
NAm_072 1.0
Arc_020 1.0
Arc_014 1.0
Arc_001 1.0
Arc_025 1.0
Arc_022 1.0
Afr_012 1.0
Arc_22 1.0
NAm_049 1.0
Asi_221 1.0
Asi_153 1.0
Asi_044 1.0
Asi_130 1.0
NAm_030 1.0
NAm_046 1.0
NAm_178 1.0
Asi_142 1.0
Asi_146 1.0
SAm_006 1.0
Asi_143 1.0
Aus_030 1.0
NAm_156 1.0
Asi_137 1.0
NAm_011 1.0
NAm_045 1.0
Asi_119 1.0
Asi_216 1.0
NAm_149 1.0
NAm_032 1.0
NAm_109 1.0
Asi_079 1.0
Asi_021 1.0
Asi_100 1.0
NAm_147 1.0
Asi_010 1.0
Asi_047 1.0
Asi_093 1.0
NAm_019 1.0
Asi_054 1.0
Asi_206 1.0
Asi_065 1.0
NAm_096 1.0
Asi_007 1.0
Asi_134 1.0
Asi_172 1.0
NAm_126 1.0
Asi_184 1.0
NAm_180 1.0
NAm_166 1.0
Eur_005 1.0
NAm_160 1.0
NAm_111 1.0
NAm_018 1.0
NAm_110 1.0
Asi_063 1.0
Asi_131 1.0
Asi_048 1.0
Asi_052 1.0
Asi_068 1.0
Eur_009 1.0
Asi_023 1.0
Aus_005 1.0
Asi_087 1.0
Asi_176 1.0
NAm_099 1.0
Asi_162 1.0
Asi_014 1.0
NAm_182 1.0
Asi_071 1.0
Asi_053 1.0
Asi_219 1.0
Asi_189 1.0
NAm_176 1.0
Asi_029 1.0
Asi_036 1.0
Asi_192 1.0
NAm_105 1.0
Asi_086 1.0
Asi_008 1.0
Asi_180 1.0
NAm_188 1.0
NAm_170 1.0
Aus_007 1.0
Asi_011 1.0
NAm_151 1.0
Asi_220 1.0
Arc_002 1.0
Asi_013 1.0
NAm_173 1.0
Asi_156 1.0
Asi_168 1.0
NAm_002 1.0
Asi_015 1.0
Asi_028 1.0
Asi_195 1.0
Asi_201 1.0
Asi_136 1.0
Asi_096 1.0
Eur_008 1.0
NAm_142 1.0
NAm_177 1.0
Asi_118 1.0
Asi_225 1.0
NAm_009 1.0
SAm_024 1.0
Asi_165 1.0
Asi_148 1.0
Asi_190 1.0
Asi_104 1.0
NAm_148 1.0
Asi_140 1.0
Asi_193 1.0
Asi_103 1.0
Asi_215 1.0
Asi_179 1.0
Asi_135 1.0
Asi_127 1.0
Asi_005 1.0
NAm_136 1.0
Asi_202 1.0
NAm_081 1.0
Asi_102 1.0
NAm_158 1.0
Asi_024 1.0
Asi_126 1.0
NAm_125 1.0
Asi_226 1.0
SAm_029 1.0
Asi_183 1.0
Asi_199 1.0
Asi_059 1.0
Asi_074 1.0
Asi_056 1.0
NAm_196 1.0
Asi_207 1.0
Asi_173 1.0
Asi_090 1.0
Asi_121 1.0
Asi_098 1.0
NAm_128 1.0
Asi_163 1.0
NAm_100 1.0
Asi_040 1.0
Asi_203 1.0
Asi_169 1.0
Asi_078 1.0
NAm_097 1.0
Asi_166 1.0
NAm_161 1.0
NAm_087 1.0
Asi_155 1.0
Arc_024 1.0
Asi_159 1.0
Asi_062 1.0
NAm_135 1.0
NAm_189 1.0
Asi_141 1.0
Asi_161 1.0
Asi_045 1.0
Asi_123 1.0
Asi_129 1.0
NAm_013 1.0
NAm_133 1.0
Asi_209 1.0
Asi_150 1.0
Asi_112 1.0
Asi_196 1.0
Asi_060 1.0
NAm_083 1.0
Asi_083 1.0
Asi_124 1.0
Asi_091 1.0
NAm_060 1.0
Asi_223 1.0
Asi_051 1.0
Asi_122 1.0
Asi_018 1.0
Asi_200 1.0
Asi_089 1.0
Asi_138 1.0
Asi_111 1.0
Asi_105 1.0
NAm_070 1.0
Asi_082 1.0
Asi_228 1.0
NAm_190 1.0
Asi_170 1.0
Asi_076 1.0
NAm_085 1.0
Asi_026 1.0
NAm_091 1.0
Asi_077 1.0
NAm_001 1.0
Asi_016 1.0
NAm_071 1.0
Asi_110 1.0
NAm_007 1.0
Arc_008 1.0
NAm_154 1.0
Asi_160 1.0
NAm_193 1.0
Asi_114 1.0
NAm_203 1.0
NAm_146 1.0
Asi_080 1.0
Asi_147 1.0
Asi_145 1.0
Asi_151 1.0
Asi_058 1.0
NAm_127 1.0
Arc_016 1.0
NAm_114 1.0
Asi_218 1.0
Asi_217 1.0
Asi_164 1.0
NAm_008 1.0
Asi_027 1.0
NAm_132 1.0
Asi_197 1.0
Asi_154 1.0
Asi_064 1.0
Asi_020 1.0
NAm_088 1.0
Asi_003 1.0
Asi_085 1.0
Asi_229 1.0
Asi_106 1.0
NAm_145 1.0
Asi_108 1.0
Asi_004 1.0
Asi_158 1.0
Eur_006 1.0
Asi_210 1.0
Asi_009 1.0
Asi_177 1.0
Asi_174 1.0
Asi_025 1.0
Asi_061 1.0
Asi_191 1.0
Asi_139 1.0
NAm_155 1.0
Asi_069 1.0
Asi_186 1.0
NAm_153 1.0
NAm_050 1.0
Asi_057 1.0
NAm_186 1.0
Asi_032 1.0
Asi_037 1.0
Asi_072 1.0
NAm_183 1.0
Asi_101 1.0
NAm_059 1.0
NAm_029 1.0
Aus_031 1.0
Aus_009 1.0
Asi_043 1.0
Asi_113 1.0
NAm_098 1.0
NAm_120 1.0
Asi_149 1.0
Asi_204 1.0
Asi_128 1.0
Asi_213 1.0
NAm_092 1.0
Asi_088 1.0
Asi_214 1.0
Asi_211 1.0
Asi_050 1.0
Asi_175 1.0
Asi_187 1.0
Asi_049 1.0
NAm_090 1.0
NAm_201 1.0
Asi_095 1.0
Asi_035 1.0
Asi_041 1.0
Asi_030 1.0
Asi_198 1.0
NAm_195 1.0
Asi_185 1.0
NAm_140 1.0
Asi_034 1.0
Aus_004 1.0
SAm_025 1.0
Asi_066 1.0
Asi_002 1.0
Asi_132 1.0
Asi_075 1.0
Asi_073 1.0
NAm_163 1.0
Asi_006 1.0
Asi_067 1.0
Asi_081 1.0
NAm_089 1.0
Asi_046 1.0
NAm_162 1.0
Asi_092 1.0
Arc_073 1.0
Asi_017 1.0
Asi_022 1.0
Asi_182 1.0
Asi_097 1.0
Asi_117 1.0
Asi_031 1.0
NAm_117 1.0
Asi_167 1.0
Asi_115 1.0
Asi_019 1.0
Asi_094 1.0
Asi_133 1.0
Eur_004 1.0
Asi_001 1.0
Asi_178 1.0
NAm_168 1.0
Asi_099 1.0
NAm_003 1.0
Asi_157 1.0
Asi_070 1.0
NAm_066 1.0
NAm_094 1.0
NAm_093 1.0
Asi_116 1.0
NAm_191 1.0
NAm_138 1.0
NAm_144 1.0
Asi_038 1.0
Asi_171 1.0
Asi_055 1.0
Asi_152 1.0
Asi_042 1.0
NAm_129 1.0
NAm_198 1.0
Asi_224 1.0
Asi_222 1.0
Asi_205 1.0
Asi_181 1.0
Asi_039 1.0
Asi_033 1.0
Asi_125 1.0
Asi_109 1.0
NAm_171 1.0
Asi_144 1.0
Asi_120 1.0
NAm_159 1.0
Asi_107 1.0
Asi_208 1.0
Asi_084 1.0
NAm_112 1.0
Asi_194 1.0
Asi_212 1.0
Asi_227 1.0
Asi_188 1.0
NAm_157 1.0
NAm_044 1.0
NAm_179 1.0
NAm_192 1.0
NAm_124 1.0
NAm_113 1.0
NAm_108 1.0
NAm_084 1.0
NAm_181 1.0
NAm_167 1.0
Arc_068 1.0
NAm_102 1.0
NAm_119 1.0
NAm_116 1.0
NAm_041 1.0
NAm_152 1.0
NAm_174 1.0
Arc_074 1.0
Arc_066 1.0
Arc_065 1.0
NAm_104 1.0
NAm_137 1.0
NAm_026 1.0
NAm_101 1.0
NAm_143 1.0
NAm_134 1.0
Eur_007 1.0
Eur_003 1.0
NAm_122 1.0
NAm_086 1.0
NAm_194 1.0
NAm_204 1.0
NAm_115 1.0
NAm_107 1.0
NAm_175 1.0
NAm_064 1.0
NAm_103 1.0
NAm_123 1.0
NAm_121 1.0
Arc_071 1.0
Arc_063 1.0
Arc_061 1.0
NAm_150 1.0
NAm_202 1.0
NAm_184 1.0
NAm_141 1.0
NAm_172 1.0
NAm_197 1.0
NAm_187 1.0
Arc_077 1.0
NAm_106 1.0
NAm_165 1.0
NAm_164 1.0
NAm_118 1.0
NAm_169 1.0
NAm_095 1.0
NAm_139 1.0
NAm_200 1.0
NAm_130 1.0
NAm_199 1.0
NAm_185 1.0

(Optional) Check for sparse records that don’t meet the 0.5 threshold.#

LMRv2.1 uses the parameter valid_frac to remove records that don’t span more than half of the full length of the record. If the database is cleaned up, chances are this will not be an issue. But it is good to run just in case to see what records get filtered out. Records will get dropped during the calibration step of the data assimilation workflow if they do not meet the overlap requirement as well.

valid_frac = 0.5
filtered_records = []

for pid, record in filt.records.items():
    time = np.array(record.time)
    value = np.array(record.value)

    mask = ~np.isnan(time) & ~np.isnan(value)
    time = time[mask]
    value = value[mask]

    if len(time) < 2:
        continue

    dt = record.dt
    years = time.astype(int)
    year_range = years.max() - years.min() + 1
    coverage = len(years) / year_range if year_range > 0 else 0

    # Keep if:
    # - subannual or annual AND coverage ≥ 0.5
    # - OR manually specified pid
    if coverage >= valid_frac or pid == 'Ocn_148':
        filtered_records.append(record)

Final ProxyDatabase Object#

The final ProxyDatabase object contains only PAGES2kv2 records that:

Are annual or subannual in resolution (dt ≤ 1.2 years)
Have at least 50% coverage over their date range
Include complete time and value arrays
Are labeled with a standardized ptype format (e.g., tree.d18O, marine.MgCa)

This cleaned and filtered database is now ready for use in the cfr framework for climate field reconstruction.

# remove dropped records from final database

final_pdb = cfr.ProxyDatabase()
final_pdb += filtered_records

Save Proxydatabase as netCDF file#

final_pdb.to_nc('./prev_data/lmr_pickle_pages2k.nc')

>>> Warning: this is an experimental feature.
>>> Warning: this is an experimental feature.

100%|██████████| 544/544 [00:15<00:00, 35.39it/s]

>>> Data before 1 CE is dropped for records: ['Arc_032', 'Arc_033', 'Ant_007', 'Ant_010', 'Ant_028', 'Arc_004', 'Arc_026', 'Arc_020', 'Arc_022', 'NAm_011', 'NAm_019', 'Arc_002', 'Eur_006', 'Eur_003'].
ProxyDatabase saved to: ./prev_data/lmr_pickle_pages2k.nc

Summary and Output#

We have constructed a cleaned, standardized proxy database containing PAGES2kv2 records. This database has been filtered by resolution and (optionally) coverage and formatted for direct use in the cfr paleoclimate data assimilation workflows.

The result has been saved as a NetCDF file:

./prev_data/lmr_pickle_pages2k.nc