My Data Science coding learning journey in Python

Francisco Ebeling
8 min readJan 12, 2022

--

I started my Data journey in Berlin’s Ironhack Data Analytics bootcamp back in March 2021 practically without any coding skills. Needless to say, I was scared when I saw more complex things like functions, for loops and list comprehension. When someone from the bootcamp told me that so-called “coding Katas” where part of possible technical challenges or interviews I was almost scared to death.

Because of this, I decided to create a parallel project one of my own making to practice my coding skills, in parallel to my work in Ironhack’s projects. Back then I was interested in the calculation of so-called Rasmussen-Hirschman linkage indexes, which is an economic topic that I explored in my Thesis. While calculating these indexes with Excel for one country and one year is fairly straightforward, doing that for the multiple years (2000–2014) and multiple countries (44 in total) that are available in the World’s Input Output Database would be a lot of tedious manual work. By contrast, managing to calculate them with Python would surely save me a lot of time.

If you are not interested in knowing more about the theoretical part of Rasmussen Hirschman Linkages, you can safely skip the next section, where I return to a description of my coding journey.

Rasmussen Hirschman Linkages — A theoretical overview

Rasmussen Hirschman Linkages indexes are based on Wassily Leontief’s pioneer input-output model (Leontief, 1951). In his model, the economy’s total production (X) is the sum of the total production that is devoted to intermediate consumption of the economy’s different sectors (Z) or (AY) and that of demand (Y). Here is how an input output table looks like:

As can be seen in the equation below, A is the technical coefficients matrix any Y the vector that represents final demand.

X also can be calculated as follows, where I is an identity matrix and (I — A)-1 Leontief’s inverse matrix:

After Rasmussen’s (1956) and Hirschman’s (1958) seminal works, what has been defined as Rasmussen-Hirschman backward linkage (BL) index is a measure of how much sector j demands from all sectors. By their turn, forward linkages (FL) are measures how sector j’s production is demanded by the others. To estimate the Hirschman-Rasmussen indexes, it is necessary to calculate the matrix of technical coeffcients A, and then take its average. The indexes are calculated in relation to their total average, as can be seen in equation below:

where Uj is the sector’s backward linkage index, Xj the sum of a given line, n the number of sectors, X* the total sum of coefficient, Un the sector’s forward linkage index, and Xi the sum of a given column. If the index is higher than one, its contribution to the economy is above average.

In a time when my coding skills were still in their infancy, although the calculation of Rasmussen Hirschman Linkages with Python turned out to be doable and yielded satisfactory results, my end code was messy and gigantic. The main reason for this was that at that point I still was not comfortable in working with more complex things such as functions and for loops.

Here is how my code looked back then, which computed said linkages for only one country! Notice that each of the larger blocks of coded did the calculation for only one year, and thus had to be repeated 15 times! After that, I still manually concatenated the created csv files in Excel by dragging and dropping. Crazy, right?

import pandas as pd
import numpy as np
from sklearn import preprocessing
from numpy.linalg import inv
data = pd.read_csv('usa.csv', low_memory=False)
data['Year'] = pd.to_datetime(data['Year'], format='%Y')
2000
data1 = pd.DataFrame(data, columns = ['Year','Code', 'Origin','A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'
])
data2 = data1.loc[(data1['Year'] == '2000-01-01') & (data1['Origin'] == 'Domestic')]
data3 = data2.drop(['Year', 'Code', 'Origin'], axis=1)
data3 = data3.apply(pd.to_numeric, errors='coerce')
flow_tbl = data3.to_numpy()
mx_A = np.linalg.inv(np.identity(56) - flow_tbl)
novo = pd.DataFrame(data=mx_A)
a = novo.values.sum()
total_avg = a/(56**2)
soma_colunas = np.sum(mx_A, axis=1).tolist()
para_tras = pd.DataFrame(data=soma_colunas, columns=["values"])
para_tras.insert(1, "codes", ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
para_tras = para_tras.set_index('codes')
data10 = pd.DataFrame(data, columns = ['Year','Code', 'Origin', 'GO'])
data11 = data10.loc[(data10['Year'] == '2000-01-01') & (data10['Origin'] == 'Domestic')]
data11 = data11.drop(['Year', 'Origin', 'Code'], axis=1)
data11 = data11.apply(pd.to_numeric, errors="coerce")
sumcolgo = data11['GO'].sum()
data12 = 1 + (data11/sumcolgo)
data12.insert(1, "codes", ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
data12 = data12.set_index('codes')
para_trasb = para_tras*data12.values
para_trasb['2000']=(para_trasb['values']/56)/total_avg
para_trasb = para_trasb.drop(['values'], axis=1)
soma_linhas = np.sum(mx_A,axis=0).tolist()
para_frente = pd.DataFrame(data=soma_linhas, columns=["values"])
para_frente.insert(1, "codes", ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
para_frente = para_frente.set_index('codes')
para_frenteb = para_frente*data12.values
para_frenteb['2000']=(para_frenteb['values']/56)/total_avg
para_frenteb = para_frenteb.drop(['values'], axis=1)
Join tables (backward linkages) and export file to csv
backward_linkages = pd.merge(para_trasb, para_trasb_1, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_2, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_3, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_4, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_5, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_6, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_7, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_8, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_9, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_10, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_11, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_12, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_13, on="codes")
backward_linkages = pd.merge(backward_linkages, para_trasb_14, on="codes")
backward_linkages.to_csv('usa_backward_w.csv', index = True)
Join tables (forward linkages) and export file to csv
forward_linkages = pd.merge(para_frenteb, para_frenteb_1, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_2, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_3, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_4, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_5, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_6, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_7, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_8, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_9, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_10, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_11, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_12, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_13, on="codes")
forward_linkages = pd.merge(forward_linkages, para_frenteb_14, on="codes")
forward_linkages.to_csv('usa_forward_w.csv', index = True)

Things started to change for me when I decided to enrol Spiced’s Academy Data Science bootcamp back in September 2021. As they say in Portugal, Brazil and Italy, I don’t want to spit on the plate that I ate from because Ironhack did teach me a lot of valuable things, and was an overall nice experience. However, I couldn’t have dramatically improved my coding skills without some of the things Spiced taught, the motivation they gave me, and, some code snippets that are worth gold which they generously shared with us.

Code snippets such as the ones below hugely facilitate the task of extracting data from several data files and transforming them into a ready to use data frame.

path = r'data' 
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
data = pd.read_csv(filename)
filenames = glob.glob(path + "/*.csv.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))

By the end of the day, my code to calculate said linkages for all 44 countries of WIOD’s database — without additional processing steps in Excel — was substantially improved and optimized, and looks like that.

Much better, right?

Importing libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing
from numpy.linalg import inv
import glob
For loops that generate calculations for all years and countries
#Specifiy data path and filename
path = r'data'
all_files = glob.glob(path + "/*.csv")
#First for loop that specifies initial dataframe with read-in data
for filename in all_files:
data = pd.read_csv(filename)

#Specifiy target dataframe
first_year = ['1999']
codes = ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U']
forward_linkages = pd.DataFrame(index=codes, columns=first_year)
forward_linkages.insert(1, "codes", ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
forward_linkages = forward_linkages.set_index('codes')

#Second for loop that performs actual calculation
year_list = list(range(2000, 2015))
for j in year_list:

#Data preparation and cleaning
data1 = pd.DataFrame(data, columns = ['Year','Code', 'Origin','A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
data2 = data1.loc[(data1['Year'] == j) & (data1['Origin'] == 'Domestic')]
data3 = data2.drop(['Year', 'Code', 'Origin'], axis=1)
data3 = data3.apply(pd.to_numeric, errors='coerce')

#Matricial calculation steps
flow_tbl = data3.to_numpy()
mx_A = np.linalg.inv(np.identity(56) - flow_tbl)
new = pd.DataFrame(data=mx_A)
new_2 = new.values.sum()
total_avg = new_2/(56**2)
sum_lines = np.sum(mx_A,axis=0).tolist()
forward = pd.DataFrame(data=sum_lines, columns=["values"])
forward.insert(1, "codes", ['A01', 'A02', 'A03', 'B', 'C10-C12', 'C13-C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31_C32', 'C33', 'D35', 'E36', 'E37-E39', 'F', 'G45', 'G46', 'G47', 'H49', 'H50', 'H51', 'H52', 'H53', 'I', 'J58', 'J59_J60', 'J61', 'J62_J63', 'K64', 'K65', 'K66', 'L68', 'M69_M70', 'M71', 'M72', 'M73', 'M74_M75', 'N', 'O84', 'P85', 'Q', 'R_S', 'T', 'U'])
forward = forward.set_index('codes')
forward[j]=(forward['values']/56)/total_avg
forward = forward.drop(['values'], axis=1)

#Final steps
forward_linkages = pd.merge(forward_linkages, forward, on="codes")
forward_linkages['Country'] = filename
forward_linkages.Country = forward_linkages.Country.str[5:]
forward_linkages.Country = forward_linkages.Country.str[:-4]
forward_linkages.to_csv(filename+'.csv')
Creating dataframe for forward linkages
# get data file names
path =r'data'
filenames = glob.glob(path + "/*.csv.csv")
# Create dfs list and append lists
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
df_final = pd.concat(dfs, ignore_index=True)
df_final.drop(columns=['1999'], inplace=True)
df_final=df_final[['codes','Country','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014']]
df_final

My coding journey with Python appears to confirm something that I have heard a few times already, namely that everyone can code. While this is of course an exaggeration, there is indeed a grain of truth in that saying. Learning how to code is mostly not about what you have learned and achieved before, but by the amount of effort you are willing to invest into that process.

Here is the link to the Rasmussen-Hirschman Index Calculation project repo.

--

--