Legal Research with AI Part 7: Wrangling Data with Julia
Intro
In a previous post, I seperated all of the results returned from the Library of Congress API into individual JSON documents to be imported as nodes into a neo4j graph.
In this post, I filter the LOC data against another data set from Oyez that will be integrated in the next post.
Filtering Data
Both data sets have been seperated into individual case nodes stored in the json format as a file with the format :
The Library of Congress data contains indices, admonitions, briefs, and other data that I will not yet be incorporating into my data set.
In order to find only the case data I will be creating a dataframe containing the paths of json files with matching citations.
Using Julia Instead of Python
I love Python, but I want to try something new. Julia's multiple dispatch design tempted me to try it out. This is my first Julia program. I will be documenting the work more so than usual.
Julia "import" Functions
Coming from Python, I typically import libraries/packages with an import call. Something like:
In Julia, we use the using call to import the package. Like:
using DataFrames
using CSV
A package can also be imported, but this does not instantiate the methods and functions within it (As far as I understand it).
For instance import CSV would only load the package but I would have to call CSV.method to actually do something. Something like from pandas import to_csv in Python.
import DataFrames
import CSV
The Main Function
Just like in C -and like we should in Python-, I declared a main function to run the program. I call it with main(). I do not know if there is a similar convention to Python's if __name__ == "__main__". I will find out soon.
The main difference in function declaration between Python and Julia is the inclusion of the end keyword and the end of the function.
For instance review the main function below :
main
Creating an outpath
The main function creates an outpath to write the resultant master df to file by calling joinpath(pwd(), "case_files.csv").
The Get Files Function
Next, the get_files function is called to create two data frames: the loc_df and the oyez_df.
Declaring empty string arrays
Each file name is appended to a file_name array declared with <array_name> = String[]
Reading Files with readdir()
File names are from from a directory passed to the built in readdir() function.
Appending Files to file_name Array
Each file name is appended to a file_name array declared with file_name = String[] and appended to with the push!(file_name,f) call. Note the ! following push. This typically means that the function is operating on the data in memory and will not return a new value.
Appending File Paths to file_path Array
I also include the file path by appending what is returned by path = joinpath(working_path, f) to the file_path list.
I love the built in joinpath function. Pythons os.sep.join() works well, but I really like Julia's implementation.
Sorting the Arrays with Merge Sort
Arrays are soreted by call sort_array(<array>). It returns a sorted array using the merge sort alogorithm.
Crating a Dataframe with the Arrays
Finally a dataframe containing the sorted file_name and file_path lists as the columns file and path is created and then returned.
A note on refactoring
This function should be refactored into seperate ones, but it works well enough with this workflow that I am going to leave it.
Joining Data Frames by Citation
Julia's DataFrames package can easily join dataframes on a column. In this workflow the file which is titled after a case citation is used.
# Join on File excluding extraneous data not in the oyez dataset
master_df = innerjoin
Filtering the DF for Extraneous Files
The master_df is filtered to remove .DS_Store from the list of files to be processed. Below notice the ! in this case it will return all a data frame of values that are not equal to .DS_Store in the File column.
#Select every file but the .DS_Store from the dataframe.
master_df = filter
The df_to_file Function
Finally the df is written to file.
#Write to file
outpath = df_to_file
The Complete Program
using DataFrames
using CSV
main