Annual Cost of Living Monte Carlo Models

Jun 1, 2022

Cost of Living Projections

Introduction

I do not like negotiating for salary. Especially, without valid projections to determine a range.

I prepared this report to estimate a salary expectation that will maintain my current standard of living.

I present two Monte Carlo models of Houston and NYC annual living costs. The data is somewhat dated and --particularly in the case of houston-- are high level estimates.

In order to produce a better report, I am currently scraping data from the internet for more accurate sample distributions. I will be able to present that soon.

With that said, the model should not deviate by more than about 5-10 percent from what is presented in below.

Findings

An annual salary of $90,000 would be sufficient to qualify for rent in Houston and most likely the median level income neighbors of NYC.

I came about this number by quantifying a confidence inverval of annual rent costs in boths cities across a normal distribution. I then simply multiplied that number by 3 in order to meet the lease qualifications of most landlords.

Limitations of the Model

Old Nyc Data

The data I am using was sourced from 2018. I will be updating it soon.

Houston Data

The houston estimate is based an estimate to stay in the property I am currently staying in. The rent is 2400 a month. I estimated that it could raise at maximum to about 2600 in the next year. If I were to move similiar housing goes for around 2200 to about 2600 a month. I used these as the bounds of my estimates

Houston Cost of Living Expenses

I intend to stay in Houston for the next year. I would like to move to NY eventually to be nearer to a central office, but not in the near future.

lower_bound = int(2400)
upper_bound = int(2600)

median = 2500
standard_dev = 100  #file:///Users/jnapolitano/Downloads/LNG_Shipping_a_Descriptive_Analysis.pdf

cap_range = range(lower_bound, upper_bound)

rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

rent_sample = choice(rent_distribution,12)

Houston Monthly food costs

lower_bound = int(300)
upper_bound = int(500)

median = 400
standard_dev = 50 

food_range = range(lower_bound, upper_bound)

food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

food_sample = choice(food_distribution, 12)

Houston Insurance Costs

lower_bound = int(200)
upper_bound = int(300)

median = 250
standard_dev = 25

insurance_range = range(lower_bound, upper_bound)

insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

The Houston Cost of Living DF

cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df

	rent	food	insurance	monthly_cost
0	2472.688851	334.419350	231.162225	3038.270426
1	2399.284893	444.677340	248.645107	3092.607340
2	2684.456976	430.277801	252.578613	3367.313390
3	2478.390464	360.661703	291.989836	3131.042002
4	2513.324309	429.771020	252.866861	3195.962190
5	2501.390892	413.121444	243.717854	3158.230190
6	2554.433859	363.994333	226.672435	3145.100627
7	2530.369935	299.997467	239.663510	3070.030911
8	2635.681318	394.667441	241.502045	3271.850803
9	2596.457738	513.944623	229.362551	3339.764912
10	2455.017883	371.266360	283.637179	3109.921421
11	2427.449703	485.960065	276.488430	3189.898198

Houston Costs Per Annum Algorithm

The algorithm below calculates the annual cost of rent, food, and insurance to determine total cost per year. Rent, food, and insurance are set by random choice based on the distributions defined in the functions above.

I run the simulation 10,000 times which in theory corresponds to 10,000 random samples of annual costs. The point in doing this is to create a random normal distribution to define convidence intervals of my total annual costs.


years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0


for year in range(years):
    # Define a New DataFrame. It should fall out of scope with each iteration 
    cost_of_living_df = pd.DataFrame()
    #random choice of rent 
    cost_of_living_df['rent']= choice(rent_distribution,12)
    #random choice of food
    cost_of_living_df['food'] = choice(food_distribution, 12)
    #random Choice of Insurance
    cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
    #Random Choice of total annual cost
    cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
    # must use apply to account for multiple 0 conditions.  If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day 
    #calculate cost per day for fun...
    # query all that are = o.  Summate the capacities deduct the total 
    annual_cost = cost_of_living_df['monthly_cost'].sum()
    annual_rent = cost_of_living_df.rent.sum()
    annual_food = cost_of_living_df.food.sum()
    annual_insurance = cost_of_living_df.insurance.sum()
    cycle_price_samples[year] = annual_cost
    cycle_food_samples[year] = annual_food
    cycle_insurance_samples[year] = annual_insurance
    cycle_rent_samples[year] = annual_rent
    #print(carbon_total_millions_metric_tons)
    year_counter = year_counter+1

Houston Prediction Df

prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples

prediction_df.describe()

	rent	food	insurance	total
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	30003.016272	4800.864106	2997.910667	37801.791045
std	344.473477	171.736899	86.991071	394.976839
min	28586.298471	4159.970425	2699.038887	36163.596078
25%	29771.562236	4683.226307	2940.117598	37537.005225
50%	30003.442289	4800.664909	2997.584664	37797.598919
75%	30234.927776	4915.307716	3056.853675	38072.961560
max	31370.239418	5495.020896	3314.016695	39469.935965

Houston Annual Cost Histogram

prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Houston: Calculating the Confidence Interval For Total Costs

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))

(37795.2942543157, 37808.287836034055)

Houston Annual Rent Histogram

### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Houston: Calculating the Confidence Interval For Annual Rent

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))

(29996.264715447538, 30009.767827637417)

New York Cost of Living Expenses

For the sake of comparison, the New York Expense distributions are calculated below. I assume that everything but rent will be equivalent to Houston. A more accurate model would account for insurance, food, and incidental differences.

I am assuming the rent of a two bedroom apartment.

The data i am using was scraped from craigslist in 2018. I will redo it later for 2022 data to get a better model.

nyc_df = pd.read_csv("/Users/jnapolitano/Projects/cost-of-living-projections/nyc-housing.csv", encoding="unicode-escape")

#assuiming a two bedroom
nyc_df = nyc_df[nyc_df['Bedrooms']== '2br']

nyc_df.describe()

	Zipcode	Price
count	2626.000000	2625.000000
mean	10845.203351	2755.018286
std	556.758722	7465.827048
min	10001.000000	16.000000
25%	10065.000000	1950.000000
50%	11210.000000	2330.000000
75%	11231.000000	2922.000000
max	11697.000000	378888.000000

The price is about 2800 with a std of 7,465. Which is absurd. To do a better analysis, I need to clean the data.


idx = (nyc_df.Price > 500) & (nyc_df.Price < 4500)
nyc_df = nyc_df[idx]

nyc_df.describe()

	Zipcode	Price
count	2441.000000	2441.00000
mean	10881.331422	2435.25891
std	541.102216	728.96291
min	10001.000000	600.00000
25%	10302.000000	1950.00000
50%	11211.000000	2300.00000
75%	11233.000000	2750.00000
max	11697.000000	4495.00000

When accounting for outliers the data is far more managable. I'm surprised by the mean price. Again this data is old, but it is also does not accout for neighborhoods. I will redo the analysis at a later data filtered by neighborhoods.

Creating the NYC Distributions

lower_bound = int(600)
upper_bound = int(4500)

median = 2435
standard_dev = 729 

cap_range = range(lower_bound, upper_bound)

rent_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

rent_sample = choice(rent_distribution,12)

NYC Monthly food costs

lower_bound = int(300)
upper_bound = int(500)

median = 400
standard_dev = 50 

food_range = range(lower_bound, upper_bound)

food_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

food_sample = choice(food_distribution, 12)

NYC Insurance Costs

lower_bound = int(200)
upper_bound = int(300)

median = 250
standard_dev = 25

insurance_range = range(lower_bound, upper_bound)

insurance_distribution = np.random.normal(loc=median , scale=standard_dev, size=10000)

NYC Cost of Living Distribution

cost_of_living_df = pd.DataFrame()
cost_of_living_df['rent']= choice(rent_distribution,12)
cost_of_living_df['food'] = choice(food_distribution, 12)
cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
cost_of_living_df

	rent	food	insurance	monthly_cost
0	2440.594149	404.104193	263.802114	3108.500457
1	3509.157666	399.234822	206.641152	4115.033640
2	3351.649621	297.314475	284.177204	3933.141300
3	1977.607960	359.872656	255.831381	2593.311996
4	2169.224724	386.271512	244.469415	2799.965652
5	2661.843885	356.660878	218.425732	3236.930495
6	3595.833071	385.012912	273.882653	4254.728637
7	1765.419028	404.770447	236.665360	2406.854835
8	1708.955308	348.178355	231.690103	2288.823766
9	3227.258413	392.787025	252.315570	3872.361007
10	1941.492537	404.384587	247.628257	2593.505381
11	2081.218740	416.678465	213.204362	2711.101567

NYC Costs Per Annum Algorithm


years = 10000
year_counter = 0
#carbon_total_millions_metric_tons = 300000000
#total_tons_shipped = 0
total_price = 0
cycle_price_samples = np.zeros(shape=years)
cycle_rent_samples = np.zeros(shape=years)
cycle_food_samples = np.zeros(shape=years)
cycle_insurance_samples = np.zeros(shape=years)
annual_cost = 0


for year in range(years):
    # Define a New DataFrame. It should fall out of scope with each iteration 
    cost_of_living_df = pd.DataFrame()
    #random choice of rent 
    cost_of_living_df['rent']= choice(rent_distribution,12)
    #random choice of food
    cost_of_living_df['food'] = choice(food_distribution, 12)
    #random Choice of Insurance
    cost_of_living_df['insurance'] = choice(insurance_distribution, 12)
    #Random Choice of total annual cost
    cost_of_living_df['monthly_cost'] = cost_of_living_df.rent + cost_of_living_df.food + cost_of_living_df.insurance
    # must use apply to account for multiple 0 conditions.  If i simply vectorized the function across the dataframe in a single call i would assign the the same values each day 
    #calculate cost per day for fun...
    # query all that are = o.  Summate the capacities deduct the total 
    annual_cost = cost_of_living_df['monthly_cost'].sum()
    annual_rent = cost_of_living_df.rent.sum()
    annual_food = cost_of_living_df.food.sum()
    annual_insurance = cost_of_living_df.insurance.sum()
    cycle_price_samples[year] = annual_cost
    cycle_food_samples[year] = annual_food
    cycle_insurance_samples[year] = annual_insurance
    cycle_rent_samples[year] = annual_rent
    #print(carbon_total_millions_metric_tons)
    year_counter = year_counter+1

NYC Prediction Df

prediction_df = pd.DataFrame()
prediction_df['rent'] = cycle_rent_samples
prediction_df['food'] = cycle_food_samples
prediction_df['insurance'] = cycle_insurance_samples
prediction_df['total'] = cycle_price_samples

prediction_df.describe()

	rent	food	insurance	total
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	29219.509691	4797.809482	3004.224643	37021.543816
std	2532.300418	172.673041	87.221734	2542.267617
min	18744.517281	4116.639699	2574.323735	26447.949901
25%	27545.387716	4678.877662	2945.270499	35351.052672
50%	29244.878069	4797.251203	3005.337764	37034.425389
75%	30915.545611	4915.266687	3062.210984	38722.269645
max	38516.336096	5429.519670	3327.233629	46383.324453

NYC Annual Cost Histogram

prediction_df.total.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.xlabel('Annual Total Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

NYC: Calculating the Confidence Interval For Total Costs

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.90, loc=np.mean(prediction_df.total), scale=st.sem(prediction_df.total))

(36979.727235126586, 37063.36039733022)

NYC Annual Rent Histogram

### Annual Cost Histogram Histogram
prediction_df.rent.plot.hist(grid=True, bins=20, rwidth=0.9,
                   color='#607c8e')
plt.title('Annual Rent Cost Distribution ')
plt.xlabel('Annual Rent Costs Price USD')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)

png

Calculating the Confidence Interval For Annual Rent

The data is nearly normal. Greater samples sizes would produce a graph of nearly perfect normality


st.norm.interval(alpha=0.95, loc=np.mean(prediction_df.rent), scale=st.sem(prediction_df.rent))

(29169.877514702926, 29269.14186706609)

NYC Closing Remarks

The rent distribution in NYC with 2018 data is actually nearly comparible to my houston estimate. An annual salary of 90,000 would permit me to live at about the median level in the city. I will be redoing this report soon as the data is old. I am currently scraping data in houston and nyc to produce a better analysis.

Imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
from shapely.geometry import Point
from numpy.random import choice
import warnings

warnings.filterwarnings('ignore')