Sussy Stock Trades

Prabina Pokharel, Atharva Kulkarni

Summary of Findings

Introduction

This dataset has been maintained by Timothy Carambat as part of the House Stock Watcher project. Each row in this dataset corresponds to one stock trasanction that was carried out by a member of the U.S. House of Representatives.

There are several features, such as:

disclosure_date transaction_date
ticker
asset_description
amount
representative

and several more...

For reference, the question we are going to be exploring is:

Are representatives from the state of Michigan more likely to have an average trade proportion that result in capital gains over 200 usd higher than the rest of the population?

So in order to be able to answer this question, let's identify a few important columns:

cap_gains_over_200_usd

and

state

Cleaning and EDA

Here's an overview of our cleaning steps:

Assessment of Missingness

We picked owner has the column to analyze missingness for because it contained the most missing values. We ran permutation testing to analyze whether the missigness of owner is depended on any columns.

We used ks as test stat for permutation test for two numerical columns disclosure_year and amount_cleaned and we used TVD as test stat for one categorical column state.

For all the three columns, our p-values were very close to 0.0, which suggests that the distribution of all three columns when owner is missing and the distribution of all three columns when owner is not missing are most likely different. This means that the missingness of owner depends on those columns individually.

Hypothesis Test

The reps from the state of Michigan have 77% of trades that result in the cap_gains_over_200_usd column having a True value.

Is this just by chance, or are the representatives from some states, like The Great Lakes State, just that damn much better at trading then representatives from other states?

Here's our setup for the hypothesis (permutation) test:

Describe the setup and results of your hypothesis test. Make sure to explicitly state your null and alternative hypotheses, test statistic, the significance level you used, and what conclusions you can draw from the results.

Code

Cleaning and EDA

Data Cleaning

A note for this section: we decided to carry out cleaning and EDA before formulating our question, hoping that an interesting question would dawn upon us while doing EDA, so our cleaning will include more steps than necessary than required for eventual hypothesis testing.

Let's look at the types of the columns to see which ones may need to have their types changed.

disclosure_data and transaction_date can be represented in date time, but they are normal objects right now.

Let's change them to pandas's datetime objects

This next step isn't super important for data analysis, but I'd say it is very important for some readability. If you can't read your data, what will you do??

Let's clean the 'Hon. ' out of every rep's name so it becomes easier to read.

Let's create a state column so we can do groupby state if we need to in the future.

Now let's deal with the ranges in the amount column. The ranges are an issue for 2 reasons:

So let's create a new column, amount_cleaned, that consists of the mean of the range given in amount.

And for values without a range, e.g. '$1,001 -', we'll deal with those after we fill in the ranges with averages.

Now let's deal with the non-ranges, i.e., '$1,001 -' and '$1,000,000 +'

Let's find the average of amount_cleaned column when values are above $1,001 and $1,000,000 seperately, and then replace the respective values.

What if we wanted to examine how long it takes for reps to disclose their trade? Let's create a new column called non_disclosure_period(days).

Our previous work of converting disclosure_date and transaction_date to datetime objects will also come in handy here, since we can simply subtract the 2 columns.

Let's see if there are any weird values in this new column.

There are 13 transactions where non disclosure days are negative. We will exclude those.

This concludes our data cleaning section.

EDA - Univariate Analysis

Note: value_counts() will help us understand the values of a column by showing us how many times each entry occurs.

Let's explore the disclosure_year columm first.

Clearly 2020 was the most popular year in this data set. This doesn't mean that the reps just traded more in 2020 than in 2021 or 2022, they probably just have not disclosed trades from those two years yet.

Now let's look at the ticker column. There are too many individiual tickers (abbreviations used to uniquely identify shares of a particular stock), so we will only look at the top 20.

Clearly, the most popular stock trades are in tech:

But there is even a Fed Fund:

Surprised rich people are investing in fed funds?

Hey, maybe the old people in the House of Representatives really do care about their retirement funds!

The biggest surprise, however, is the most popular ticker, '--', which isn't even a stock. What is '--'? Let's clean that up and convert to np.nan.

Let's make sure the -- characters are gone.

Now let's look at the amount_cleaned column, to see how much of that 💰💸🤑 reps bring in.

Wow! Such an overwhelming majority of stock trades are averaged out to 8,000 usd that the graph gets stretched vertically so much that we cannot even estimate how many transactions are done for any transaction above 53267.76 usd.

EDA - Bivariate Analysis

We want to see whether reps have many counts of capital gains of over $200 or few. Let's find that out by grouping amount_cleaned and cap_gains_over_200_usd.

Let's see if the non_disclosure_period(days) column has any correlation with amount_cleaned.

Does a bigger stock transaction (higher amount_cleaned) lead to a longer non_disclosure_period(days) period? Let's find out.

Before analyzing anything, an important point: this graph seems kinda weird right? Why are all the points jumbled around a few x values?

Well, this is the result of replacing a range with the average of the range.

As most of our data is seems to be less than trade transaction amount of 2 million, let's zoom in on its left to see if we notice anything.

Huh!

We notice that there doesn't seem to be much relationship between transaction amount and the non disclosure period. Our hypothesis was that reps that have invested more in trades delay disclosing it, but here we see the opposite, where, if the transaction was high, they seem to be disclosing it sooner than if the transaction was low.

This makes sense, they probably don't want to get into legal trouble. They are politicians after all, and they will do anything to make sure they get reelected. I'm sure the headlines of

Rep. X. arrested after failing to disclose a trade of $5,000,000.

won't be great for their political career.

Interesting Aggregates / Pivot Tables

Let's find out if there are any reps with a 100% cap_gains_over_200_usd proportion score.

It looks like Reps. Patrick T. McHenry, Mr. TJ John (Tj) Cox, and Tim Burchett are pros, with a 100% cap_gains_over_200_usd proportion score. But, are they really?"

Upon further review, it seems that they each have only traded once so we don't have enough data to conclude that they are really pros at stock trading.

Ok, that was a flop.

Here's a new one to aggregate: Which state has the highest average number of trades that result in capital gains over $200?

Michigan! With a pretty mind blowing result of 77% 🤯

Does the value of 77% seem sussy? Well, we do. (Yes, this is foreshadowing for our Hypothesis Testing)

Note: The first that may come to mind is that Michigan may have had a high cap_gains_over_200_usd value for relatively small trades, in the $8000.5 category, and low numbers in the rest. This would have induced Simpson's paradox. But as shown below, we checked with another pivot table, and it is really the case that Michigan has high cap_gains_over_200_usd values for all amount categories.

Let's look at non_disclosure_period(days) values indexed by the stock transaction amount itself, for each type of exchange.

This table is confusing, so let's look at just the means of each of the columns.

For some reason, or just sheer random chance, exchanges take on average almost 12 days longer to be disclosed than purchases, full sales, and partial sales. Interesting 🤔🤔

This concludes our EDA section.

Assessment of Missingness

Follwing are all the columns that have one or more null values:

Because owner has the most amount of missing values, we will use that column to analyze if the missingness of owner is depended on any columns. In order to find out, we have to run permutation testing. Permutation testing will allow us to analyze whether the missingness of owner is depended on any other columns.

Let's begin permutation testing!

We will use ks_2samp from SciPy library as a test stat for our permutation testing. But ks is used on columns that are only numerical so let's focus on two numerical columns for now.

The p-val for disclosure_year and amount_cleaned is extremely low. This means that the distribution of disclosure_year, for instance, when owner is missing and the distribution of disclosure_year when owner is not missing are likely different, which means that the missingness of owner likely depends on disclosure_year. Same goes for amount_cleaned.

Let's find one more column where the depended-on column for missingness of owner is categorical. The test stat we need to use if the depended-on column is categorical is TVD.

Here, we see that the p-val is 0.0, implying that the distribution of state when owner is missing and the distribution of state when owner is not missing are likely different, so the missingness of owner likely depends on state.

Hypothesis Testing

The question we've decided to explore is the case of why are Michigan reps so damn good at trading stocks?

Here's a little bit of background again to refresh your memory.

In our interesting aggregates / pivot tables section, we wanted to answer: Which state has the highest average number of trades that result in capital gains over $200?

Formalized...

The reps from the state of Michigan have 77% of trades that result in the cap_gains_over_200_usd column having a True value.

Is this just by chance, or are the representatives from some states, like The Great Lakes State, just that damn much better at trading then representatives from other states?

There's an easy solution to this problem - Permutation Testing!

We're choosing permutation testing over hypothesis testing because we are given 2 observed samples (versus the only one that hypothesis testing works on). One group is all trades from reps from Michigan, and the other group are all trades that are not from reps from Michigan. We need to see if they are they fundamentally different, or could they have been generated by the same process?

Let's begin the permutation testing.

First, let's find the observed test statistic: difference in means.

So the reps from Michigan have an average value of cap_gains_over_200_usd substantially higher than non-Michigan reps, by about 72%.

Now let's run the permutation test 500 times.

Now let's calculate the p-value!

The p-value represents what proportion of our observed_diff_means list had more extreme values than our observed test-statistic, diff_means.

0!

So this means that our result is highly significant, and we can reject the null hypothesis since our p-value (0.0) is lower than our significance level (0.01).

I.E., Reps from Michigan did indeed have an average trade proportion that resulted in capital gains over 200 usd higher than the rest of the population, and this was not due by chance.

If it was only due by chance, our p-value would not be so low, and we would not be rejecting the null hypothesis.

A few implications from our results: