Mastering marketplace dynamics: Transforming transaction price analytics with ultra-unique Tick History – PCAP and Amazon Athena for Apache Spark

 Mastering marketplace dynamics: Transforming transaction price analytics with ultra-unique Tick History – PCAP and Amazon Athena for Apache Spark





Transaction fee evaluation (TCA) is broadly used by buyers, portfolio managers, and brokers for pre-exchange and post-exchange analysis, and helps them measure and optimize transaction expenses and the effectiveness of their buying and selling techniques. In this publish, we analyze options bid-ask spreads from the LSEG Tick History – PCAP dataset the usage of Amazon Athena for Apache Spark. We display you how to get right of entry to facts, outline custom features to apply on information, question and filter out the dataset, and visualize the consequences of the evaluation, all without having to worry approximately putting in infrastructure or configuring Spark, even for huge datasets.

Background

Options Price Reporting Authority (OPRA) serves as a important securities facts processor, collecting, consolidating, and disseminating final sale reviews, rates, and pertinent facts for US Options. With 18 lively US Options exchanges and over 1.Five million eligible contracts, OPRA plays a pivotal role in offering comprehensive marketplace records.

On February five, 2024, the Securities Industry Automation Corporation (SIAC) is ready to upgrade the OPRA feed from 48 to ninety six multicast channels. This enhancement goals to optimize symbol distribution and line potential utilization in response to escalating trading hobby and volatility inside the US alternatives market. SIAC has advocated that firms put together for height records prices of up to 37.Three GBits consistent with second.

Despite the upgrade no longer right now changing the whole volume of published statistics, it permits OPRA to disseminate information at a considerably quicker charge. This transition is vital for addressing the needs of the dynamic alternatives market.

OPRA stands proud as one the most voluminous feeds, with a height of one hundred fifty.4 billion messages in a unmarried day in Q3 2023 and a potential headroom requirement of four hundred billion messages over a single day. Capturing every single message is important for transaction value analytics, marketplace liquidity monitoring, trading approach evaluation, and marketplace studies.

About the facts

LSEG Tick History – PCAP is a cloud-primarily based repository, exceeding 30 PB, housing extremely-terrific international market information. This records is meticulously captured without delay in the exchange information centers, using redundant capture approaches strategically located in principal number one and backup trade statistics centers global. LSEG’s seize generation guarantees lossless records seize and makes use of a GPS time-supply for nanosecond timestamp precision. Additionally, sophisticated records arbitrage techniques are employed to seamlessly fill any facts gaps. Subsequent to capture, the statistics undergoes meticulous processing and arbitration, and is then normalized into Parquet layout using LSEG’s Real Time Ultra Direct (RTUD) feed handlers.

The normalization technique, that is vital to getting ready the records for evaluation, generates up to 6 TB of compressed Parquet files in step with day. The big volume of information is attributed to the encompassing nature of OPRA, spanning multiple exchanges, and presenting numerous options contracts characterised via diverse attributes. Increased marketplace volatility and market making interest on the alternatives exchanges further make contributions to the quantity of statistics posted on OPRA.

The attributes of Tick History – PCAP permit corporations to conduct diverse analyses, which include the subsequent:

  • Pre-trade analysis – Evaluate potential trade impact and discover one-of-a-kind execution strategies based totally on historical statistics
  • Post-trade assessment – Measure real execution costs in opposition to benchmarks to evaluate the performance of execution techniques
  • Optimized execution – Fine-tune execution strategies primarily based on historical marketplace styles to minimize market impact and decrease general trading charges
  • Risk management – Identify slippage styles, perceive outliers, and proactively control risks related to trading activities
  • Performance attribution – Separate the effect of buying and selling choices from investment choices when analyzing portfolio performance
The LSEG Tick History – PCAP dataset is available in AWS Data Exchange and can be accessed on AWS Marketplace. With AWS Data Exchange for Amazon S3, you can get right of entry to PCAP data at once from LSEG’s Amazon Simple Storage Service (Amazon S3) buckets, disposing of the want for corporations to keep their very own reproduction of the information. This method streamlines records control and storage, presenting clients on the spot get entry to to first-rate PCAP or normalized information without difficulty of use, integration, and extensive statistics storage savings.

Athena for Apache Spark

For analytical endeavors, Athena for Apache Spark gives a simplified pocket book enjoy handy thru the Athena console or Athena APIs, permitting you to construct interactive Apache Spark programs. With an optimized Spark runtime, Athena facilitates the evaluation of petabytes of data by dynamically scaling the number of Spark engines is less than a 2d. Moreover, not unusual Python libraries along with pandas and NumPy are seamlessly incorporated, bearing in mind the advent of tricky software good judgment. The flexibility extends to the importation of custom libraries for use in notebooks. Athena for Spark comprises maximum open-data formats and is seamlessly included with the AWS Glue Data Catalog.

Dataset

For this evaluation, we used the LSEG Tick History – PCAP OPRA dataset from May 17, 2023. This dataset comprises the following additives:

  • Best bid and offer (BBO) – Reports the very best bid and lowest ask for a security at a given trade
  • National best bid and offer (NBBO) – Reports the best bid and lowest ask for a protection across all exchanges
  • Trades – Records completed trades across all exchanges
  • The dataset involves the subsequent statistics volumes:
  • Trades – 160 MB allotted throughout about 60 compressed Parquet documents
  • BBO – 2.4 TB allotted throughout approximately three hundred compressed Parquet files
  • NBBO – 2.Eight TB dispensed across approximately 200 compressed Parquet documents

Analysis evaluation

Analyzing OPRA Tick History information for Transaction Cost Analysis (TCA) involves scrutinizing marketplace fees and trades around a particular change occasion. We use the following metrics as a part of this look at:

  • Quoted spread (QS) – Calculated because the distinction between the BBO ask and the BBO bid
  • Effective spread (ES) – Calculated because the difference between the change price and the midpoint of the BBO (BBO bid + (BBO ask – BBO bid)/2)
  • Effective/quoted spread (EQF) – Calculated as (ES / QS) * a hundred
We calculate those spreads earlier than the alternate and moreover at 4 intervals after the trade (simply after, 1 2nd, 10 seconds, and 60 seconds after the alternate).


Configure Athena for Apache Spark

To configure Athena for Apache Spark, complete the subsequent steps:

  • On the Athena console, below Get started out, pick out Analyze your facts the usage of PySpark and Spark SQL.                                                       
  • If this is your first time using Athena Spark, select Create workgroup.                                                         
  • For Workgroup call¸ input a call for the workgroup, along with tca-evaluation.
  • In the Analytics engine section, choose Apache Spark.



  • In the Additional configurations section, you may select Use defaults or provide a custom AWS Identity and Access Management (IAM) role and Amazon S3 region for calculation outcomes.
  • Choose Create workgroup.


  • After you create the workgroup, navigate to the Notebooks tab and pick out Create pocket book.
  • Enter a call on your pocket book, inclusive of tca-analysis-with-tick-records.
  • Choose Create to create your notebook.













Launch your notebook

If you have got already created a Spark workgroup, pick Launch pocket book editor underneath Get started out.















After your notebook is created, you will be redirected to the interactive notebook editor.














Now we can upload and run the following code to our pocket e-book.

Create an assessment

Complete the following steps to create an analysis:

  • Import common libraries:

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
  • Create our data frames for BBO, NBBO, and trades:
bbo_quote = spark.read.parquet(f"s3://<bucket>/mt=bbo_quote/f=opra/dt=2023-05-17/*")
bbo_quote.createOrReplaceTempView("bbo_quote")
nbbo_quote = spark.read.parquet(f"s3://<bucket>/mt=nbbo_quote/f=opra/dt=2023-05-17/*")
nbbo_quote.createOrReplaceTempView("nbbo_quote")
trades = spark.read.parquet(f"s3://<bucket>/mt=trade/f=opra/dt=2023-05-17/29_1.parquet")
trades.createOrReplaceTempView("trades")
  • Now we can identify a trade to use for transaction cost analysis:
filtered_trades = spark.sql("select Product, Price,Quantity, ReceiptTimestamp, MarketParticipant from trades")

We get the following output:

+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|Product |Price |Quantity |ReceiptTimestamp |MarketParticipant| 
+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|QQQ 230518C00329000|1.1700000000000000000|10.0000000000000000000|1684338565538021907,NYSEArca|
|QQQ 230518C00329000|1.1700000000000000000|20.0000000000000000000|1684338576071397557,NASDAQOMXPHLX|
|QQQ 230518C00329000|1.1600000000000000000|1.0000000000000000000|1684338579104713924,ISE|
|QQQ 230518C00329000|1.1400000000000000000|1.0000000000000000000|1684338580263307057,NASDAQOMXBX_Options|
|QQQ 230518C00329000|1.1200000000000000000|1.0000000000000000000|1684338581025332599,ISE|
+---------------------+---------------------+---------------------+-------------------+-----------------+

We use the highlighted trade information going forward for the trade product (tp), trade price (tpr), and trade time (tt).

  • Here we create a number of helper functions for our analysis
def calculate_es_qs_eqf(df, trade_price):
    df['BidPrice'] = df['BidPrice'].astype('double')
    df['AskPrice'] = df['AskPrice'].astype('double')
    df["ES"] = ((df["AskPrice"]-df["BidPrice"])/2) - trade_price
    df["QS"] = df["AskPrice"]-df["BidPrice"]
    df["EQF"] = (df["ES"]/df["QS"])*100
    return df

def get_trade_before_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].groupby(groupby_col).last()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_trade_after_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].groupby(groupby_col).first()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_nbbo_trade_before_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].iloc[-1:]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

def get_nbbo_trade_after_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].iloc[:1]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df
  • In the following function, we create the dataset that contains all the quotes before and after the trade. Athena Spark automatically determines how many DPUs to launch for processing our dataset.
def get_tca_analysis_via_df_single_query(trade_product, trade_price, trade_time):
    # BBO quotes
    bbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, MarketParticipant FROM bbo_quote where Product = '{trade_product}';")
    bbos = bbos.toPandas()

    bbo_just_before = get_trade_before_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_just_after = get_trade_after_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_1s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=1, groupby_col='MarketParticipant')
    bbo_10s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=10, groupby_col='MarketParticipant')
    bbo_60s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=60, groupby_col='MarketParticipant')
    
    all_bbos = pd.concat([bbo_just_before, bbo_just_after, bbo_1s_after, bbo_10s_after, bbo_60s_after], ignore_index=True, sort=False)
    bbos_calculated = calculate_es_qs_eqf(all_bbos, trade_price)

    #NBBO quotes
    nbbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, BestBidParticipant, BestAskParticipant FROM nbbo_quote where Product = '{trade_product}';")
    nbbos = nbbos.toPandas()

    nbbo_just_before = get_nbbo_trade_before_n_seconds(trade_time,nbbos, seconds=0)
    nbbo_just_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=0)
    nbbo_1s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=1)
    nbbo_10s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=10)
    nbbo_60s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=60)

    all_nbbos = pd.concat([nbbo_just_before, nbbo_just_after, nbbo_1s_after, nbbo_10s_after, nbbo_60s_after], ignore_index=True, sort=False)
    nbbos_calculated = calculate_es_qs_eqf(all_nbbos, trade_price)

    calc = pd.concat([bbos_calculated, nbbos_calculated], ignore_index=True, sort=False)
    
    return calc
  • Now let’s call the TCA analysis function with the information from our selected trade:
tp = "QQQ 230518C00329000"
tpr = 1.16
tt = 1684338579104713924
c = get_tca_analysis_via_df_single_query(tp, tpr, tt)

Visualize the analysis results

Now let’s create the data frames we use for our visualization. Each data frame contains quotes for one of the five time intervals for each data feed (BBO, NBBO):

bbo = c[c['MarketParticipant'].isin(['BBO'])]
bbo_bef = bbo[bbo['ReceiptTimestamp'] < tt]
bbo_aft_0 = bbo[bbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
bbo_aft_1 = bbo[bbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
bbo_aft_10 = bbo[bbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
bbo_aft_60 = bbo[bbo['ReceiptTimestamp'] > (tt+60000000000)]

nbbo = c[~c['MarketParticipant'].isin(['BBO'])]
nbbo_bef = nbbo[nbbo['ReceiptTimestamp'] < tt]
nbbo_aft_0 = nbbo[nbbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
nbbo_aft_1 = nbbo[nbbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
nbbo_aft_10 = nbbo[nbbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
nbbo_aft_60 = nbbo[nbbo['ReceiptTimestamp'] > (tt+60000000000)]

  • In the following sections, we provide example code to create different visualizations.

Plot QS and NBBO before the trade

Use the following code to plot the quoted spread and NBBO before the trade:

fig = px.bar(title="Quoted Spread Before The Trade",
    x=bbo_bef.MarketParticipant,
    y=bbo_bef['QS'],
    labels={'x': 'Market', 'y':'Quoted Spread'})
fig.add_hline(y=nbbo_bef.iloc[0]['QS'],
    line_width=1, line_dash="dash", line_color="red",
    annotation_text="NBBO", annotation_font_color="red")
%plotly fig



Plot QS for each market and NBBO after the change

Use the subsequent code to plot the quoted unfold for each market and NBBO immediately after the trade:


fig = px.bar(title="Quoted Spread After The Trade", x=bbo_aft_0.MarketParticipant, y=bbo_aft_0['QS'], labels={'x': 'Market', 'y':'Quoted Spread'}) fig.add_hline( y=nbbo_aft_0.iloc[0]['QS'], line_width=1, line_dash="dash", line_color="red", annotation_text="NBBO", annotation_font_color="red") %plotly fig




Plot QS for every time c language and every market for BBO

Use the subsequent code to devise the quoted spread for each time c programming language and every market for BBO:


fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['QS']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['QS']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['QS']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['QS']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['QS'])])
fig.update_layout(barmode='group',title="BBO Quoted Spread Per Market/TimeFrame",
    xaxis={'title':'Market'},
    yaxis={'title':'Quoted Spread'})
%plotly fig




Plot ES for whenever c language and market for BBO

Use the subsequent code to plan the powerful spread for every time c programming language and market for BBO:

fig = go.Figure(data=[ go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['ES']), go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['ES']), go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['ES']), go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['ES']), go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['ES'])]) fig.update_layout(barmode='group',title="BBO Effective Spread Per Market/TimeFrame", xaxis={'title':'Market'}, yaxis={'title':'Effective Spread'}) %plotly fig





Plot EQF for every time c programming language and marketplace for BBO

Use the following code to devise the powerful/quoted spread for every time c language and market for BBO:

fig = go.Figure(data=[ go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['EQF']), go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['EQF']), go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['EQF']), go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['EQF']), go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['EQF'])]) fig.update_layout(barmode='group',title="BBO Effective/Quoted Spread Per Market/TimeFrame", xaxis={'title':'Market'}, yaxis={'title':'Effective/Quoted Spread'}) %plotly fig





Athena Spark calculation overall performance

When you run a code block, Athena Spark automatically determines what number of DPUs it requires to complete the calculation. In the last code block, in which we call the tca_analysis function, we are in reality educating Spark to method the records, and we then convert the resulting Spark dataframes into Pandas dataframes. This constitutes the most extensive processing a part of the analysis, and when Athena Spark runs this block, it shows the development bar, elapsed time, and what number of DPUs are processing statistics currently. For instance, within the following calculation, Athena Spark is utilizing 18 DPUs.





When you configure your Athena Spark notebook, you have got the choice of placing the maximum quantity of DPUs that it may use. The default is 20 DPUs, however we examined this calculation with 10, 20, and 40 DPUs to demonstrate how Athena Spark robotically scales to run our evaluation. We observed that Athena Spark scales linearly, taking 15 mins and 21 seconds whilst the notebook changed into configured with a most of 10 DPUs, eight mins and 23 seconds whilst the notebook was configured with 20 DPUs, and four mins and 44 seconds whilst the notebook turned into configured with 40 DPUs. Because Athena Spark expenses primarily based on DPU usage, at a in step with-2d granularity, the price of these calculations is similar, however in case you set a better most DPU value, Athena Spark can return the result of the analysis plenty quicker. For greater details on Athena Spark pricing please click on right here.

Conclusion

In this post, we established how you may use excessive-constancy OPRA records from LSEG’s Tick History-PCAP to perform transaction fee analytics using Athena Spark. The availability of OPRA information in a timely manner, complemented with accessibility improvements of AWS Data Exchange for Amazon S3, strategically reduces the time to analytics for companies seeking to create actionable insights for crucial trading selections. OPRA generates approximately 7 TB of normalized Parquet facts every day, and dealing with the infrastructure to offer analytics primarily based on OPRA statistics is tough.

Athena’s scalability in coping with large-scale statistics processing for Tick History – PCAP for OPRA data makes it a compelling desire for groups looking for fast and scalable analytics solutions in AWS. This post indicates the seamless interplay between the AWS ecosystem and Tick History-PCAP records and how economic establishments can take gain of this synergy to drive information-driven choice-making for vital trading and funding techniques.




Post a Comment

0 Comments