Blog on Healthcare Efficiency & Transparency

Learnings from MRF Land

In this post, we’ll lay out some of the key learnings and gotchas from the past few months of MRF processing, so you and your team don’t have to learn the hard way (like we did). 

Background

In 2021 and 2022 before the payer data was released, Serif Health did the work to ingest hospital price transparency data across a quarter of the country. We were excited to gather the data so we could offer simple and impactful analytics to our customers, and enable them to run effective healthcare businesses. It was a ton of work (and a story for another day) - but ultimately there were enough reporting gaps with professional fees and validity concerns with the institutional rates posted that we concluded the hospital data alone wouldn’t suffice for our analytics needs.

We attempted to license a multi-state claims dataset to gap-fill and QA the hospital data we did have, when we learned firsthand how expensive, cumbersome, and restrictive it was to obtain bulk data in the healthcare industry. Licensing claims data at national scale from a clearinghouse or data vendor, at the low end, would cost us in the millions of dollars annually. The data would come with highly restrictive clauses around eligible use cases, audit rights, no caching clauses, IP retention, anonymization requirements, and so on that would make using and incorporating the claims data to our customer portal extremely difficult and expensive.

The payer data release was a second chance to leverage price transparency data for our customers. Again, we were excited to offer effective analytics without all the pain and costs of claims. And our sense was that CMS had learned from the hospital price transparency initiative, and done a better job declaring and defining clear standards for the payer machine readable files (MRFs). We talked to the authors of the Final Rule, we followed along with the CMS regulations and Price Transparency Guide as the rules were being finalized, we re-tooled our hospital ingestion code to match the new JSON schemas, and we felt prepared for release day.

What we weren’t prepared for was the sheer scale and volume of the files posted. In the first few days after release, regardless of which part of the MRF puzzle we looked at - the payer directories, the index files, the provider_references blocks, the uncompressed file sizes - each large payer introduced new edge cases and situations where simple, straightforward approaches and engineering decisions that had worked for hospital price ingestion weren’t going to get the job done. 

After a few months of iterating refining our approach, we are now able to ingest the 650GB compressed Cigna PPO file in about 12 hours of compute runtime. Humana’s national network(s) of 15000+ small files takes a bit longer to pull in due to the time required to make all those parallel network requests, but we can easily query the data, extract, aggregates, and pull samples for interested parties in just a minute or two once we have it processed. Here's a sample of UT Southwestern's rates, extracted from the Humana NPOS network (512GB of data scanned in 87 seconds!):

We were able to get through all the major payers and produce cost-effective, usable MRF data with a small team - scrappy, but efficient - and we’re excited to bring that efficiency and expertise to your organization.

 

Learning #1: Data sizes and volumes are obscene. 

To do even the most basic tasks with the MRFs, you’ll need tools that don’t assume you’ll be loading a full JSON file into memory. A streaming decompressor & JSON token parser is essential so you can take a quick peek inside the Anthem index or a national PPO rate file and test assumptions or verify contents. Keep in mind that gzip compression means you won’t be able to seek ahead in the file; each file scan starts with byte zero and goes from there.  Our custom JSON parser library reduced 500GB compressed file scans from tens of minutes down to one, and enabled faster coding and testing.

Leverage streams - ideally, you’re not downloading or unpacking any of these files on disk in order to work with them, and you’re never trying to load any full files into memory. An on-the-fly stream of bytes, coming down from the payer’s S3 bucket and pushing transformed buffers up into your own, allows you to handle these massive files on a c6.large ec2 instance without it collapsing.  

Finally, take the time to define some small, simple test cases based on reasonably sized files for each payer. After processing you’ll still have a many GB CSV or parquet file with a few hundred million rows. Spot checking individual values here and there for outliers or coding bugs becomes a challenge. If a payer doesn’t have any files with reasonable sizes, make one by sampling a few lines out of a big in-network file. That way, you’ll be testing against their actual schema and can verify data integrity (and catch regressions) easily.

Learning #2: The data is good, but it isn’t always clean and doesn’t always make sense. 

Unlike the hospital transparency data, we were able to verify the data in the payer files by identifying various provider group TIN and NPIs we work with, finding those values in the MRF files, and verifying that their rates match what is stated their contracts. 

In every case we’ve tested so far, the contracts were listed and the amounts were correct down to the penny. So we know the files contain good data and are based on verifiable reality (a concern we’d heard and been unable to resolve with the hospital transparency data).

But that doesn’t mean the data is perfect or simple to reason about. We’ve noted all of the following:

  • Many MRF files have provider reference blocks containing a ‘location’ - a remote file URL - which 404’s or times out
  • Some MRF files have provider reference IDs that don’t exist anywhere in the file
  • Aetna lists multiple prices, with significant spread, for the exact same arrangement and place of service codes - with no explanation why or when the different rates should be applied.
  • Many MRFs have some number of rows where the NPI or TIN is listed as null, 0 or 99999999. Does this represent valid data, a dead contract, or something else?
  • Physicians show rates for procedures that they can’t perform (see this excellent post at Dolthub for potential explanations why)
  • Place of service codes can be spotty (a concrete example was noted in our previous blog post). Humana does not include place of service codes in any of their rate files, despite a column header for it existing in the MRF files.
  • Anthem only lists NPIs in the TIN field and repeats that same single value in the NPI field, making it difficult to understand the structure or size of groups being contracted.
  • Several payers posted rate files filled with lists of codes and negotiated rate or percentage rate = 0.00. What does that 0.00 rate mean?
  • Some payers’ in-network files look woefully small for the expected size of their network
  • The majority of out of network allowed rates files are empty

We’ve begun reaching out to payer PR contacts on all of these fronts, and thankfully, have gotten generally fast responses from their PR teams and some new postings from the plans themselves (Special Kudos to the teams at Oscar and MVP Healthcare)!  We’ll share any clarifications or responses from plans via our blog as we get them.

Learning #3: Finding (and using) the correct index file is critical…but non-obvious

When we started our ingestion project, we started with two states where our customers had immediate and consistent demand - Texas and New York. We looked at the payer directories and were tempted to try and just grab every data file we could find; that would have been an enormous waste of time and money. Choosing the plan you want from the appropriate index file means you’ll only pull in the data you need and can ignore all the other files you don’t - especially critical when 95% of the files you see are redundant postings purely there for compliance reasons. 

For all payers - use fully insured directories and indexes before TPA directories. They tend to be more comprehensive - especially since an individual company might not buy up ancillary benefits or network extensions like dental or behavioral health that are posted alongside the fully-insured plans’ core in-network file(s).

Many payers have contracted a company called Healthsparq to host their directories (Aetna, BCBS, MVP), and Healthsparq does a pretty good job separating self-funded plans where the payer acts as a TPA or MRF file host from the fully-insured plans the payers are directly operating. 

Edge cases:

Cigna just listed one global index file, and United and Humana built their own directory pages that are really tough to navigate. 

For Cigna - you’re pretty much going to have to solve by looking at the nationwide OAP or PPO network. The local / state networks they list are paltry, and there is no middle ground.

For United - you want to command-f on their page and look for 2022-10-01_UnitedHealthcare-Insurance-Company_index.json. That’s the global index file, with all the national plans. If you want state specific plan indexes, command-f and look for ‘of-State’ where State is the state you’re searching for. Keep in mind there may be multiple entities, like UnitedHealthcare-of-Illinois-Inc and UnitedHealthcare-Insurance-Company-of-Illinois. One of these entities seems to post HMO and individual exchange plans, the other seems to have nationwide PPO and large group offerings. But that’s certainly not clear from the data in the files or the directory itself.

For Humana - navigating the directory manually is slow and impossible since the indexes start at offset 448,00. Use a direct API request to offset 448,000, as follows: curl -X GET "https://developers.humana.com/Resource/GetData?fileType=innetwork&sEcho=3&iColumns=3&sColumns=%2C%2C&iDisplayStart=448000&iDisplayLength=1500&mDataProp_0=name&sSearch_0=&bRegex_0=false&bSearchable_0=false&mDataProp_1=modifiedDate&sSearch_1=&bRegex_1=false&bSearchable_1=false&mDataProp_2=sizeToDisplay&sSearch_2=&bRegex_2=false&bSearchable_2=false&sSearch=&bRegex=false

Pick the specific humana index file you want from there. Note that some of these are nationwide networks even if a state abbreviation is in the index file name.

For Aetna, use https://health1.aetna.com/app/public/#/one/insurerCode=AETNACVS_I&brandCode=ALICFI/machine-readable-transparency-in-coverage but fall back to self-insured here: https://health1.aetna.com/app/public/#/one/insurerCode=AETNACVS_I&brandCode=ALICSI/machine-readable-transparency-in-coverage

For BCBS, get your index from the state entity and just ignore anything with mrf.bcbs.com in the URL (see next point for why).

For Anthem / Elevance - we plan to post a separate list of URLs for each state’s network files so you don’t have to parse a 30GB index to find them. :)

Learning #4: Networks are overlapping and highly redundant, but file level hashing won’t work.

If you open UHC’s national index file, you’ll see lots of references to Optum Behavioral Health P3 and PPO_P3 in network files. Open UHC Illinois’ index and you’ll see each plan also has a Behavioral_Health_P3 and the PPO_P3 network but the file starts with uhc_of_illinois. If you hash the file, the hashes are different - because each file has a starting header where the plan name and EIN are different. Otherwise, the provider_reference and in_network sections are the same. Don’t waste time and money pulling this data in multiple times! From a coding perspective, you’ll need to isolate these network-level identifiers and make sure you don’t pull in redundant files. 

For BCBS index files, you’ll notice for each state you have some local files like Vermont_HMO_1_of_4 and then a national set of ‘in network nationwide files’ hosted at mrf.bcbs.com - these files represent the BCBS PPO reciprocity agreement across their different entities. You don’t want to lump in-state networks with out-of-state data! In the nationwide filenames, the _xxx_ in the file name tells you which host network is participating in reciprocity. A lookup table is available here: https://www.health.ny.gov/statistics/sparcs/sysdoc/appl.htm.

But that’s not enough. At the individual data point level, you’ll need another level of hashing. That’s because of the shape of PPO vs HMO networks. Most providers sign a contract enrolling them to all of the EPO / PPO / HMO networks offered, but they only get paid one rate regardless of which plan is selected by the patient. If you pull in each of those networks, combine them together, and then try to build a regional distribution across them, you’re biasing the regional distribution in favor of payers that have numerous plans that repeat rates. It’s better to only compare specific plans (say, HMO vs HMO), or ingest any unique rates offered exactly once.

A concrete example of this - Anthem has many different networks in New York, all with different shapes and sizes and distribution medians if you ingest any one of them.

When we started, we only ingested ONE (ideally the largest) network in a given region - but that’s flawed as wraparound networks, carveouts, and extensions can be unique to some networks. 

The (up for debate) appropriate approach here is to hash each tuple or do a DISTINCT query on the ein/npi/code/rate for a region, and make sure you don’t doubly ingest repeats if you see them in another file. We’re continuing to refine our approach here and will share more as we go.

Learning #5: Normalizing and Aggregating data is critical for structured storage

Denormalizing and storing every EIN, NPI, site of service, and rate permutation in the files would mean petabytes of data - possible, but practically unqueryable with reasonable speed or cost constraints. And certainly not something you can put an API directly on top of. 

Normalizing the data into separate tables with uniqueness constraints saves lots of space, but leads to very complex insert logic that crosses multiple tables, and slows bulk data loading down unacceptably. Our learnings here could fill another ten page post. 

You can define space efficient, queryable schemas that 1) store a contracted entity’s rate as a percentage of CMS across an entity’s codeset, dropping code and rate specificity 2) aggregate rate values by count of ein and npi per region / code / payer / tier / site tuple, dropping ein and npi specificity  3) store a limited number of code:modifier:rate tuples across the codeset for each ein / npi / place of service tuple, leading to partial coverage. DoltHub has a great idea of dropping ineligible codes, which would lead to substantial space and processing time savings, if a canonical list can be defined. 

At nationwide scale, each of these are multi-billion row tables on the order of 100’s of GB each; it requires some tooling, but is workable for getting data in and out quickly and directly servable as an API with sub 100ms response times.

Learning #6: You’ll need lookup tables and other APIs to make the data useful

So you’ve made it this far, you have your parsers, you’ve extracted data, you’re querying it and you’re reading some data out for your use case. You’ll quickly find that real-world use cases don’t naturally lend themselves to the data solely located in the MRF. 

First, TINs aren’t that useful for anything but hospitals. Most small practices don’t publish their TINs, so you can’t match on them, and reverse lookup services for TINs are decently expensive pay-to-play APIs. You can find lookup lists for hospitals easily enough, but that’s a small percentage of the entities in most files.

NPIs are much easier to work with thanks to NPPES. In fact, you may find (like we did) that a state, type1/type2, and primary taxonomy code lookup table is essential at the processing stage, so you can correctly bucketize data and calculate counts and aggregations. We initially tried hitting an internal lookup API service, then directly querying a database, then eventually created an in-memory system so each of our processing machines can do direct lookups to keep this super fast at the scale of data ingest (avoiding another expensive post-processing or join step later). 

Next, you’ll hit the location assignment problem. You are parsing a nationwide network file and you have a provider group reference consisting of a TIN you can’t look up and a list of physician NPIs registered in multiple different states. WHERE should you geolocate that ‘group’? You can’t tag them as every state, and you don’t want to duplicate the data 50 times. Do you explode the data out, so each NPI is its own listing? That causes data size issues and breaks efficient collapsing and counting. Do you pick the most commonly listed state? What if it’s a telehealth group? We’re actively seeking solutions from a provider directory partner, but this is a hard one to solve.

Finally, you’ll hit the taxonomy issues. As noted previously, you’ll see some providers listed with codes and contracted rates that don’t make a ton of sense. Alec Stein atDoltHub documented evidence of this and the explanation why it happens in their blog post on C-section rates - in short, a contract for 175% of CMS PFS means a bunch of irrelevant CPT and HCPCS codes will be listed with rates that provider cannot bill, skewing global distributions if included. Similarly, just looking at all rates for a common E/M code like 99214 would place NPs at family practice PCP groups in a distribution alongside board certified specialist MD visits, whose rates will be substantially higher. The best workaround here is to only compare like physicians’ rates - opthamologists to other opthamologists - to ensure irrelevant data from other physicians aren’t being included. But this requires exploding the NPI list in the MRF data and joining each to a taxonomy code and/or credential enumeration, so you can compare apples to apples. Obviously, this creates challenges for space/time efficiency that are tricky to solve.

We’re continuing to refine our approach and will post engineering updates as we go!

Learning #7: Off the shelf (EMR / Glue Data Studio) won’t work…and is quite expensive

To de-risk the bespoke parsing and coding we were doing, another team at Serif Health attempted to solve the ingestion pipeline problem entirely through a big data pipeline/warehouse approach using AWS Glue, EMR, Athena and Redshift.

While these platforms are more powerful, they can be tougher to customize and substantially more expensive to run. And that power is often wasted, especially when the source files are large, compressed and require preprocessing. The provider reference schemas and remote lookups embedded in the MRFs don’t work out of the box with Glue, large single files don’t benefit from the concurrency offered by EMR jobs, and we saw parquet conversion jobs timeout and fail due to lumpy file sizes and schema violations in the source MRFs.

From an engineering system perspective, logging and reporting of progress across Spark jobs is harder, making custom code hard to test and debug quickly. A lot of one-off configuration was required to connect notebooks and clusters to our other internal systems and data, and common workflows with our other codebases (like source code management and CI/CD) were hard to establish. The litter of intermediate storage formats and files these tools generated was very difficult to manage and maintain. Net, we were spending hundreds of dollars a day on glue jobs that cost us just a few dollars with a EC2, job-queue-based pipeline. 

There are places where big data approaches with Spark / S3 + Athena or Redshift must be used. Efficiently joining the underlying large dataset to lookup tables and re-calculating aggregations is one of them. Each time we need to do a schema change or join in a new piece of data, re-processing is way more effective with Spark/EMR. For querying and aggregating Humana, this approach was essential - using individual jobs / workers makes little sense when files are already chunked like Humana’s are and the chunk sizes are ideal for super-fast processing in Spark. An aggregation and extraction job that took a day across all the files in a more traditional job queue took a few minutes in Athena. 

Conclusions

In the end, we landed on a hybrid approach. We preprocess with our stream-based worker queue cluster, and convert raw MRF data into multiple different efficient schemas warehoused in S3. Once there, a job queue loads new data into Postgres for serving through the API, while our team leverages Spark / Athena / Redshift for easy bulk extractions, re-pulls, transformations, and aggregations on data sitting in the warehouse as new use cases arise or customer requests come in.

We hope this helps you and your MRF processing efforts - contact us (engineering@serifhealth.com) if you have any questions or want to leverage our expertise with the MRFs.