Saturday, January 21, 2017

Working with the Subsystems functional hierarchy, Part 1

Note: I've been struggling for the last few days to work with the SEED Subsystems database, so I'm writing this blog post in hopes that others, when searching the internet for a similar issue, will be able to use my discoveries for their own benefit.

The idea behind the SEED Subsystems database is a fascinating one; by grouping microbial functions under different groups and hierarchies, it's possible to look at a microbiome in broad terms.  How much respiration is happening in this microbiome?  How much of the total energy is being put towards one-carbon metabolism?  How, at a glance, is the functional activity of this diverse environment changing between sample collection times?

At a database level, the Subsystems hierarchy makes some intrinsic sense.  In a hierarchy database, every classified function is listed, along with its higher-level groupings, like such:

Respiration     Electron donating reactions     Methanophenazine_hydrogenase    Methanophenazine hydrogenase large subunit (EC 1.12.98.3)
In this case, the large subunit of Methanophenazine hydrogenase is grouped under "Methanophenazine hydrogenase", which is in turn grouped under "Electron donating reactions", which is finally grouped under the top-level category of "Respiration."

However, the organization of the actual Subsystems database leaves quite a lot to be desired.  One of the issues, to begin, is that it links to the NMPDR website, which has been defunct for more than half a decade at the time of this post (early 2017).  But really, it's just not laid out at all for incorporation into a bioinformatics pipeline.

The database can be found from this FTP link: ftp://ftp.theseed.org/subsystems/

In the largest file, labeled as "subsystems.complex", specific sequences are linked to an NMPDR ID, which is generally known as a Fig ID, since they all start with "fig" as the first three letters.  These aren't very useful, as NMPDR is dead.  But they're needed for linking to other subsystems entries.

In the file  "subsystems2peg", these Fig IDs are linked to their hierarchy, as listed above.  For most of them.  Some of them aren't present in subsystems2peg, but that Fig ID is linked to a specific function in the subsystems.complex file, and that function name is linked to a hierarchy in "subsystems2role", or sometimes instead in "subsys.txt"...

...yeah, it's confusing.  Want me to make it worse?

NMPDR is dead, but the data has been migrated over to a new location, PATRIC (https://www.patricbrc.org/).  Unfortunately, this new website isn't much better than NMPDR; their search function searches only by keyword, and can't locate a specific Fig ID.  Really, paste a Fig ID into the search bar, and it searches for every Fig ID with any part of your search query in it at all.  Useless.  Furthermore, it seems that they haven't preserved all Fig IDs from Subsystems, so some of the Subsystems IDs aren't on PATRIC at all.

Frustrating.

Still, I think I've worked out a solution.  There are a few steps, but it lets a bioinformatician get the Subsystems raw database into a useful configuration for searches, and then allows for some interpretation of the results.

Relevant GitHub repository with the useful scripts I've built for this purpose: http://github.com/transcript/subsystems

Step 1: Get the Subsystems database into something more useful.

You'll need to download "subsystems.complex" from TheSEED's FTP site: ftp://ftp.theseed.org/subsystems/ .  Take note, this is a nearly 10 gigabyte file, a decent amount of which is garbage.  It will take a while.

Once this monster file is finally downloaded, run the Python program Subsystems_simplifier.py from the command line, with the subsystems.complex file specified as ARGV1.  Note: currently, these programs are not optimized, and lack important features like a usage statement and other reporting stats.  That may come later.

This will create a new file named "subsystems.complex.modified" which contains the Subsystems protein sequences, along with a header on each sequence that matches it to its Fig ID.  This is in FASTA format, and can be used for annotation algorithms, such as BLAST and DIAMOND.

Step 2: Get a database of function matches for Fig IDs.

Now, here's the flow of the next step.  We need to swap the Fig IDs in our results file for their Subsystems function match (the lowest, and most diverse, level of hierarchy).  All of those function names are in the big "subsystems.complex" file - but at a different spot than the actual sequence, so they got stripped out by Subsystems_simplifier.py.  The FIG_extractor.py program can get them out.

Run FIG_extractor.py on the command line, with subsystems.complex as ARGV1.

This will create a file called subsystems.complex.figs that contains every Fig ID and it's function match.

Step 3: in your results file, match Fig IDs to their functions and hierarchy.

Back at TheSeed's FTP site, download "subsystems2peg.gz".  This contains (most of) the Fig IDs and their match in the hierarchy, at the lowest level (function).

While here, you'll probably also want to download "subsys.txt", as this will be needed for building the hierarchy.  You could go ahead and grab "subsystems2role.gz" as well, just so that you've downloaded everything, because putting everything in one database file makes too much sense.

Our goal now is to go from the Fig ID to having the full hierarchy listed, if possible.  This next program, fig_swapper.py, will do all of this, in a few steps.  It will:

1. Read the "subsystems.complex.figs" file and build a dictionary of all Fig IDs and their functional matches.
2. Open "subsystems2role" and get the hierarchy for all functions listed in that file.
3. Open "subsys.txt" and get the hierarchy for the (different) functions listed in THAT file.
4. Open a results file that contains Fig IDs and, for each ID, add additional columns with that ID's functional match, and with the hierarchy, if it's available.

Usage looks like:
python fig_swapper.py subsystems.complex.figs my_results_file.tab subsystems2role subsys.txt
This will generate a results file that has the suffix ".converted", which should contain the Fig IDs, their associated function, and that function's hierarchy, if available.

*****

Hooray!  We've finally managed to get something sensible out of the Subsystems database!

Thursday, September 25, 2014

Hello, World!

print ("Hello, World!");
return;

Seems fitting.

Welcome to Sam Westreich's personal page!  To find out more about me, check out the links at the top, just below the header.