EMCmodule FAQs

Please read the FAQs carefully to check whether there is a solution to your problem before emailing us with problems regarding reading in the data, runtime issues, etc. Most problems with EMCmodule in the past have been with incorrect or inconsistent input formats, so make sure you have the EXACT format before running EMCmodule. The authors are not responsible for mistakes caused by incorrect usage of the program. Please send in questions that you do not find answered, though- so that we can add them to the FAQs to help other users. Enjoy!

Input format

Q. I tried to run EMCmodule under different combinations, including the suggested suffix "20 40 5000 1 200 200 214748361" as in (./emcmodule myfile.sites myfile.fasta 20 40 5000 1 200 200 214748361) but the result is always the same. I get:

Error: Motif matrix file not consistent with sequence file!

I verified that the sites, which I got from MDscan, match the FG in the same way that the sites in your site file (test.sites) match the fasta file you provided (test.fasta). I tried adding spaces and cutting chunks from the file to identify the problem, but could not find it.

A. Tabs and spaces should not matter. However, the motif sites need to be consistent with the sequences, in the sense that you cannot have overlapping sites (The reason being that background model is estimated from the nucleotides which are not in the motif sites, so none of the nucleotides should be counted twice) In short, you need to remove redundancies from the motif input file. For example, looking at the first three sites of your motif file, for motifs indexed 1 and 2, they are identical:

1 1 6 TAGTGCCG
1 1 29 TCGTGCCG
1 1 87 GAGTGACG

2 1 6 TAGTGCCG
2 1 29 TCGTGCCG
2 1 87 GAGTGACG

Q. I run the example file in the package and got the same result. However, when I run my file, which has 50 sequences with average length of 1000 bps, I got a memory problem. I tried to change the number of motifs, it did not help. I also changed other parameters in the command line, it did not help either. The memory required is 2.9GB.

In your input motif file I noticed you had sequence names instead of numbers- which is the required form of the input. You need to change sequence names to numbers.

Q. I would like to try the strategy outlined in your paper (using SDDA to find motif candidates and then running EMCMODULE to find cis-regulatory module). I am wondering whether it is possible to run SDDA first to find possible motif candidate, combine the most likely candidate motif (such as top 10 motifs) from several SDDA results, and then run EMCmodule. I tried this approach, but EMCMS gave me error message as following:

Error: Motif matrix file not consistent with sequence file!

Is this the results file from combining several runs of SDDA? You need to get rid of overlapping and duplicate sites before using this file as input to EMCmodule.

A general tip: If you are moving files between platforms (e.g. Windows/unix/Mac OSX), or even machines, please run the "dos2unix" command on all input files before using EMCmodule.

Method/Algorithm implementation

Q. Motif length. I tried, and didn't succeed, to run the program with an input file containing motifs of different length. Is it an absolute requirement for all motifs to have the same length? could I "cheat" by completing shorter motifs with N's or * ?

Unfortunately, the default option right now is to use the same length motifs. I haven't yet had the time to incorporate variable width motifs into the program- I don't think using 'N' or special symbols would work as motifs are not allowed to contain any symbols outside A, C, G, T in the current version.

Q. Updating matrices. When I select the "Updating matrices" option, with for example 5000 iterations, first I got a message about completing the 5000 iterations, and then about updating the matrices. However, when I read the paper, it looks like the matrices and the parameters should be optimised at each iteration. If I am wrong, could you tell me precisely which steps are completed at each iteration?

This is mainly an implementation issue- just to speed up the algorithm. The assumption is that since the motif selection is based on a more dependent MCMC chain, more iterations are needed by it for achieving convergence than the parameter updating step. So the selection, based on EMC, is done at every iteration, while the motifs are updated every few hundredth steps.

Q. What are the prior specifications for K?

Currently the prior on K is set to be uniform.

Q. Is the distance between motifs optimized separately for each pair of motifs, or is the same value used for all motifs in a module?

The same distribution is used for all motifs irrespective of type- and the posterior probability of a module as a whole is optimized.

General tip: Most importantly- please try to look at all your files carefully before using the program. The program is not made for automated use- you will get the best results only if (i) you understand the basics of how it works and (ii) read the help files and format specifications carefully and (iii) make informed judgements where needed (e.g. prior specifications) based on your specific problem. Good luck!