Feature selection from gene expression data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Feature selection from gene expression data
Molecular signatures for breast cancer prognosis and inference of gene
regulatory networks.
Anne-Claire Haury
Centre for Computational Biology
Mines ParisTech, INSERM U900, Institut Curie
PhD Defense, December 14, 2012
1Gene expression
Figure: Central Dogma of Molecular Biology (Source: Wikipedia)
=> RNA as a proxy to measure gene expression.
3Microarrays: measuring gene expression
Microarrays: one option.
RNA hybridized onto a chip.
Quantification of gene activity in different conditions at the
genome scale.
Resulting data: gene expression data.
High expression
samples
Low expression
genes
4Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
5Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
Mathematically:
Explain response Y using variables (Xj )j=1...p
Y = f (X1 , X2 , X3 , X4 , . . . , Xp ) + ε
5Feature selection: extract relevant information
Genome ≈ 25,000 genes
From gene expression, explain a phenomenon
Feature selection: deciding which variables/genes are relevant.
Mathematically:
Explain response Y using relevant variables amongst (Xj )j=1...p
Y = f (X1 , X2 , X3 , X4 . . . , Xp ) + ε
Objective 1: more accurate predictors
Objective 2: more interpretable predictors
Objective 3: faster algorithms
5Contributions of this thesis
Gene Regulatory Network Inference
TIGRESS: new method based on local feature selection.
Ranked 3rd/29 at DREAM5 challenge.
Linear method, competitive with more complex algorithms
Molecular signatures for breast cancer prognosis
Select biomarkers to predict metastasis/relapse in breast cancer
patients.
Complete benchmark of feature selection methods.
Investigation of the stability issue.
6alkB
alkA
ada
metI
metQ
yeiB
metC
metF
metN
folE
asnA
metA
metE
gudP
Gene Regulatory Network
gudX
purL garD
purB gudD metJ
mntR
purC phoQ
prs
cvpA ompT
purE ubiX glyA mqsR
metR pagP aidB
speB
purF
speA
emrB purK ahpC glgB metB
yjeP
glnB hflD qseC metL yegR rseB
yrbL mgrB ahpF cheW
ybjG
asnCmgtA
emrA gcvB mdtA dsbA nadB
slyB hemH frc ung
borD yneM glgX yfdX pnuC
motB
envR rstB purH flu ybaS mdtD cpxA yqjA
exbB spy rseC
Inference with TIGRESS
purD metH
gmr fhuD tonB fhuB
hslJ gor tsr nadA
mprA oxyS rcnR fhuE fepC mdtC
cysN ybaO trxC mnmG fepG yidQ
baeS rdoA motA
dsbG ygaC
fliE fhuC gpmA
lon cspD baeR
nemR acrA cdaR grxA mntH livK psd
codB mdtB ydeH rseA
mltF nemA mioC yhhY
cysD pgi purR qseB codA bacA
asnB ryhB
ybiS fliG metK
cysC pqiB fhuA acrD yebE nadR
yncE ygbK fliH pyrC fepD rcnA proV rpoE
yhdY avtA ydfN lpxA ftnB
purA fecI livJ livH yccA
fecR chiA
fldB gltK mukB
potH fliO ycgRnfsB
flgH tdh livF
uof cysP lysU fabZ ycfS cheA
potI poxB ppdC wzb mukF
rfaZ flhB fepB nrdF uspE
cpxP
agaD
tauB inaA flgI fliJ acrR exbD cysW ade yjjZ gnd csgA cpxR
ribA ygiB
nfsA
argT fliR kbl fhuF ppiD
ddpD fliT rimK
cysU flhA ftnA nrdH mukE evgS lpxD
fldA csgC mcbR dsbC agaI
evgA
potF cysK
tauA
ppdB ybjNacrB
tolC ybjCflhE gltI garK agaS
ddpA fimE
tauD
amtB flgJ fliK phoP cysM
fliC sdaA
yciF
potG flgE ybaT emrY yojI proX degP pgaD
flgG purM gcvH
lrhA thrV gadY kbaY
pqiA pepDadrA
yeaH fpr
fliI
purN recC cysH hdeBhdeA wza ilvH proW yciG
argI fliN nrdE agaB
yhdX flgD hmp yhiM
fliL gcvP flgN aslB livG wcaB
tauC ppdA hisP nac hdeD nhaA pgaB
yhdZ ddpX astB ygiC osmC csgB yciE
fliF oxyR ecnB agaC
cysB hisM nrdI hemL gspO
ddpC artM yqjH fliA ybdZ ilvI rpmA pgaA
ygdB soxS rob hisJ
argG entS smtA nhaR
ddpB garLglgC dctR
astA flgMsoxR cbl yhiD omrA wzc seqA aroG
argA rutR ycaC tppB iraP
argD ccmA fimD agaR
yhdW fliP fliS fliD marA entB mdtE
ydeO envY rcsD
gltJ livMmdtF hha rplU pgaC
ddpF cysA emrK
garP hdfR rprA
argE glnK fur hchA trpL
artP rutFflgB fepE fiu trpC
astC glnG gltL yhjH gcvT slp fimA
garR smpA csgD
astE pepAflgA hisQyjbF dppC entC tomB
yeaG pyrD gadX gadE
astD hydN fliM rutD
yhjA flhC fimF wcaA dadX yrbA
fes rcsA glgP
argB ssuB rutA nfo
flhD hypB
rutB pnp rpsO lrp adiY gspB pncB trpE
artQ fliQ rrfA yjbG
ilvL fimIgadW gabT trpB
argH rutG uxuAgabD ydiV
putA fliZ yfiD fliY ilvDhycE ompC
rcsB serA ydeP
argR torC rutC ssuD gltB ompF
csiD yjjP
rnpB micF oppA
yjbE
cydD pspC katG ilvM cadA
hns srlD entD trpR
argF flgC oppF trpA
carA cirA
ilvA sdhB
artI argC
yqjI fdnH ccmG
mglC
acrF nupC bdm argO mlrA
sdiA aroL
aroH
artJ treR omrB
carB
putP pspE pspG katE
gltF nrdR cyoC gdhA galK
ssuCccmF sucA guaA rstAmglB csgG csiR lacA kbaZ
fumC torD hycG glnQ torR narH csgF relAgalM hofO ftsA lipA
ssuE
glnP caiD galR osmB stpA trpD
yecR fimC gadC
hupA rrsH rrsG aroM
ccmB marB dps csgE fecC uxuB entE galE dadA
trmD
glnH rpsS pspB
zwf glnLsufS rutE dcuD fixC
nuoB marR gadB srlR serC lacI
rtcR nrdD rrlA gltD gabP pspD ompR fecD srlB ftsZ
nrdG sufC sufA oppB entA yaiA
nikD napB ccmH cyoB ndh sucC gntP pdhR
ulaF ansB fecB tnaC lacY
mtr agaV tyrA
wzyE ompW fdnI cysJ mglA cydA
iscA napC hypD sufB hyfB aroP
napG hypE glnA metY nagA yjbHfecE rpoS melA pgm
amiA acs dppDhycF guaB gutQ aroA
iscU tar rtcA nrfA ssuA ccmD lpd adiA glgA torA tyrR
rnlA bcsZ pspF nrfGgcd sodA ilvE feoB hupB tnaB
rtcB ibpB glcG ydhT glpA nuoA sufD aroF
rffA rpsP feaB napA osmE sra
rybB tehA napF nrfF glpB oppD caiA fimH cadB nrdB gutM uxaB sohB rbsK dinF
paaJ entF bglB ppiA chbG lpxC
rplD cheBdmsD glcF galS
bcsB ccmC nuoI feaR hypA nuoF sodB rrsC
sdhD malQ deoA fepA uxaC
ftsQ
napD paaI dmsC nirD ynfK bolA tyrB symE
dkgB infC yfgF proL glcA infB truB feoC malS nagDglpG hokE
wzzE ynfE hypF osmY
cyoD malP fimB folA srlE galP dinB
pheS hcr hyfG hycI nrfB nrfD
fumB
tdcC nusA rbfAyiaN sdhA uxuR uidR malZ murG
thrS cheZ glpC nsrR cysI narP nrfE dppF
caiB ihfA treB sgbH tdcF fimGtyrP bglF tnaA uidC
rffH atoD ccmE napH ihfB feoA nuoM gadA fixB cyoE cadC chbA uidA ftsI ydjM
ytfE dmsB cyoAsucD hyfC galT rpoH csiE
ygbA hybB fumA cspA cyaA
cheR nrfC melR rbsR yafO sulA ftsK
tehByjgI hybA pspA nuoK focB fnr glmY fadL hyfR aceE
sufE
rfe hycB caiF tdcE sgbE srlA ptsI lacZ chbF uvrB ybfE
ydhX
ydhU
ynfG
rplP hycC dppB acnA modE fhlA tdcD fecA malM flgF creD
moeBmoaA ydhV
cydCiscR mdh frdB uxaA ulaA hlyE umuD
hyaC dmsA nuoG
caiEptsG hyfF chpA
rpmI rffD rffC scpA modA cysG modB
oppC yiaK gltA exuR araF mhpA
uidB ddlB
rffM norR glcB atoE tdcA bglG rbsD malY yafQ
ydiU ydhW narL norW paaC fucO araD recA
uraA dcuR paaX
adhE araG mhpF creB
xdhC hemF prfA treC aldB nanC creA mraY
narK nikR arcA glcChpt yjcH paaH melB
cheY narI sdhC fis paaA paaF aceB tdcRulaC crp nanR lsrC lsrG nagC cstA
sbmC uvrC
nikA narX frdD
yeaR hycD hyfD glpT cytR malE araE ivbL
ydeJ norV ulaE uvrD
yeaE grxD atoB glcD
appY nuoN
fixA sucB nagB rbsC dinI lexA dnaG
iscS ynfF hipB chpR
acrEactP rbsA chbR uvrA phr
rffG erpA
moaE ynfH
ygjG hipA hemA hypC cydB cdd
yiaO glpR paaE aceK gyrAmanZudp lsrD nanM
grpE
rbsB dacC rpsU murD
pheM narG frdC deoDhyfI yeiL dgsA malT
hyaFdctA nirC
nirB icd dusB mhpE mhpD
rffE nikC uspB nikB aceF aldA
yiaJ yhcH malG dnaA mhpB mhpR
ydhO fdnG hybF fdhF hyfA paaD ulaG manX fucP malI ruvA
ackA hyaB yoaG acnB bglJ xylR proP murE
ydbD nuoE hyaD asr lamB malK creC dinG ftsW cho
moaB hcp nuoL dcuB ulaR yiaL nanE agp chbB polB
mhpT ydhY tpx nuoJ
hycA paaG trg sfsA
pheT rplC tdcB
rpmC aceA yjjQ hofB glmS insK
wzxE ybdN nagE ssb dinD
atoA appC nuoH nanK xseA leuO argP dksA
tadA ulaD
hyfH paaB mhpC umuC uvrY ftsL
tsgA nikE glpD focA
ubiC ubiA gatC ulaBpstC nfuA yiaM ecpD nanT glpX xylA chbC murF
moaC narJ tam crr fucR lsrB glmU
dppA gatZ deoB fucI fucU recX
upp pflB hybC appAprmC glpQ deoC idnR
rplT betT lldP hyfE pstB
rpsJ dcuA caiC gatD lsrR speC ilvB ybfN ampC ruvB
cueR rplB moeA rpsQ
phoU nuoChyfJ rhaT nrdA fruR malF manY
fucA yafN
rpsC trmA rpoD
dcuShyaA hyaE
aspA araJ trxA tisB yafP
copA thrT deoR xylF gntT hofN araC
serT yccBhycH rplW modC pstA prpE gcvA yebG
tapmoaDatoC appB exuTpaaK tsx araH
hybErimM htrE lsrA prpD malX
pitA rplV fadD fadH araB dnaN dinQ dinJ
puuA leuW fixX xylH recF
envZ mtlA nanA hofM ubiG polA recN
hybD caiT gatY gapA ptsH lsrK ppdD
argK xdhA dcuC rpsI flxA murC
cueO
puuR
ychO rplS uspA
fadB aer gatA xylG murQ rhaS araA
prpR idnK
hybG fadI
ysgA queA pstSrplM idnD yhfA
hybO dsdX iclR tdcG gntK idnO
fadA mtlD lsrF
fucK
pepT glpF glpK gntX
mpl ychH ascF
xdhB betA dsdC ydeA cspI msrA gatB fbaA spf
scpC betI thrU ugpC
lldR ogt argU frdA
fadJ apaH pgk gntR idnT
apaG glgS prpB nupG hofP
ompX mazG cpdB ugpE rhaB ompA
argW mtlRzraSprpC gntU
rhaD
proM fadR yibD topA glpE dapB
murP leuL
fadE lldD cyaR ascB
hisR zraR psiE xylB
gltX thrW gyrB ugpQ ugpA
fruK uhpT
betB proK sgbU epd
puuD glyU
hofC ugpB leuC
leuP dsdA envC rhaA dsrA
pdxA eda ascG
dpiA ilvN
slyA rhaR
argX
leuX
scpB yahA
fruA leuA yibQ
edd
tyrU
ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: Trustful
fruB fbaB glk leuB
pfkApykF
gpmM
leuD
tpiA eno pck phoB
glyT
uhpA
setA
citC sgrS
htpG
citG
Inference of Gene REgulation using Stability Selection, BMC Systems
citX kdgR
dpiB
phnE
ppc
citF citD fabA zraP
phoE
phnH
citE fabB phnO
phnJ phnM
phoA
pitB
phoH phnN
phoR
Biology, 2012, 6:145.
psiF phnI
phnD
phnG amn
phnL
fabR sgrR
phnF
phnC
phnK phnP
tbpA
thiP
thiQ
7Gene Regulatory Networks
Gene regulation: control gene expression.
Transcription factors (TF) activate or repress target genes (TG).
Gene Regulatory Network (GRN): representation of the regulatory
interactions between genes.
alkB
alkA
ada
metI
metQ
yeiB
metC
metF
metN
folE
asnA
metA
metE
gudP
gudX
purL garD
purB gudD metJ
mntR
purC phoQ
prs
cvpA ompT
purE ubiX glyA mqsR
metR pagP aidB
speB
purF
speA
emrB purK ahpC glgB metB
yjeP
glnB hflD qseC metL yegR rseB
yrbL mgrB ahpF cheW
ybjG
asnCmgtA
emrA gcvB mdtA dsbA nadB
slyB hemH frc ung
borD yneM glgX yfdX pnuC
motB
envR rstB purH flu ybaS mdtD cpxA yqjA
purD exbB spy rseC
metH
gmr fhuD tonB fhuB
hslJ gor tsr nadA
mprA oxyS rcnR fhuE fepC mdtC
cysN ybaO trxC mnmG fepG yidQ
baeS rdoA motA
dsbG ygaC
fliE fhuC gpmA
lon cspD baeR
nemR acrA cdaR grxA mntH livK psd
codB mdtB ydeH rseA
mltF nemA mioC yhhY
cysD pgi purR qseB codA bacA
asnB ryhB
ybiS fliG metK
cysC pqiB fhuA acrD yebE nadR
yncE ygbK fliH pyrC fepD rcnA proV rpoE
yhdY avtA ydfN lpxA ftnB
gcvA flgF purA fecI yccA
livJ livH chiA
fldB fecR
gltK fliO mukB
potH ycgRnfsB flgH tdh livF ycfS
potI uof cysP lysU mukF fabZ cheA
poxB ppdC wzb
rfaZ flhB fepB nrdF uspE
cpxP
agaD
tauB inaA flgI fliJ acrR exbD cysW ade yjjZ gnd csgA cpxR
ribA ygiB
nfsA
argT fliR k b l fhuF ppiD
ddpD fliT rimK
cysU flhA ftnA nrdH mukEevgS lpxD
fldA csgC mcbR dsbC agaI
ppdB acrB ybjC gltI garK evgA
potF cysK ybjN tolC flhE agaS
ddpA tauA fimE
tauD
amtB flgJ fliK phoP cysM
fliC sdaA
yciF
potG flgE ybaT emrY yojI proX degP pgaD
flgG purM gcvH lrhA thrV gadY kbaY
yeaH fpqiA
pr hdeBhdeA wza proW yciG
pepDadrA
fliI
purN recC fliN cysH ilvH
argI hmp nrdE yhiM agaB
yhdX flgD gcvP flgN livG wcaB
tauC fliL ppdA hisP nac aslB hdeD nhaA pgaB
yhdZ ddpX astB ygiC osmC csgB yciE
fliF oxyR ecnB agaC
cysB hisM nrdI hemL gspO
ddpC artM yqjH fliA ybdZ ilvI rpmA pgaA
ygdB soxS r o b hisJ
argG entS smtA nhaR
ddpB garL g lgC dctR
astA flgMsoxR cbl yhiD omrA wzc seqA aroG
argA rutR ycaC tppB iraP
argD ccmA fimD agaR
yhdW fliP fliS gltJ fliD marA entB mdtE
ydeO envY rcsD
ddpF emrK
garP livMmdtF rprA hha rplU pgaC
argE cysA hdfR hchA trpL
artP glnK rutFflgB fepE fur fiu trpC
astC yhjH fimA
garR slp csgD
astE pepAflgA glnG gltL
hisQ gcvT yjbF dppC smpA
entC tomB
yeaG pyrD gadX gadE
astD fliM hydN rutD
yhjAflhC fimF wcaA dadX yrbA
fes rcsA glgP
argB n f o ssuB rutA
flhD hypB
rutB p n p rpsO l r p adiY gspB pncB trpE
artQ fliQ rrfA ilvL fimI yjbG gabT trpB
argH rutG gadW uxuA gabD ydiV
putA fliZ yfiD fliY ilvD ompC
rcsB serA ydeP
argR torC rutC ssuD hycE gltB ompF
csiD yjjP
rnpB micF oppAyjbE
cydD pspC katG ilvM cadA
hns srlD entD trpR
argF flgC oppF trpA
carA cirA
ilvA sdhB
artI argC
yqjI fdnH ccmG mglC
acrF nupC b d m argO mlrA
sdiA aroL
aroH
artJ treR omrB
putP carB pspEpspGkatE gltF nrdR cyoC gdhA galK
ssuCccmF sucA guaA rstAm glB csgG csiR lacA kbaZ
glnQ ftsA
fumC ssuE torD hycG torR narH csgF osmB relAgalM stpA hofO lipA trpD
yecR glnP caiD galR
gadC
hupA rrsH rrsG aroM
ccmB marB dps fimC csgE fecC uxuB entE galE dadA
trmD
glnH rpsS pspB rutE
z w f glnLsufS dcuD fixC
nuoB
gltD
marR
pspD gadB srlR serC lacI
rtcR nrdD rrlA gabP ompR fecD srlB ftsZ
nrdG sufC sufA oppB entA yaiA
napB ccmH cyoB ndh sucC gntP pdhR
ulaF ansB fecB tnaC mlacY
tr agaV tyrA
wzyE ompW fdnI cysJnikD mglA cydA
iscA napC hypD sufB hyfB aroP
napG hypE glnA metY nagA yjbHfecE rpoS melA pgm
amiA guaB gutQ
iscU tar rtcA nrfA ssuA acs
pspF nrfG ccmD
gcd l pdppD
d hycF adiA glgA
ilvE torA aroA tyrR
rnlA bcsZ glcG ydhT sodA nuoA sufD feoB hupB tnaB aroF
rpsP feaB rtcB ibpB napA glpA osmE sra
rffA
rybB tehA napF nrfF glpB oppD caiA fimH cadB nrdB gutM uxaB sohB rbsK dinF
paaJ entF bglB ppiA chbG lpxC
rplD cheB dmsD glcF galS
bcsB ccmC nuoI feaR hypA nuoF sodB rrsC
sdhD malQ deoA fepA uxaC
ftsQ
napD paaI dmsC nirD ynfK bolA tyrB symE
dkgB infC yfgF proL glcAinfB truB feoC malS nagDglpG hokE
wzzE ynfE hypF cyoD malP fimB
osmY folA srlE galP dinB
hyfG fumB rbfA yiaN sdhA uxuR uidR malZ murG
pheS hcr hycI nrfB nrfD tdcC nusA
thrS cheZ glpC nsrR cysI narP nrfE dppF
caiB ihfA treB sgbH tdcF fimG tyrP bglF tnaA uidC
rffH atoD ccmE napH ihfB feoA nuoM gadA fixB cadC chbA uidA ftsI ydjM
ytfE dmsB cyoAcyoE
sucD hyfC galT rpoH csiE
ygbA hybB fumA cspA cyaA
cheR nrfC melR rbsR yafO sulA ftsK
tehByjgI hybA pspA nuoK focB fnr sufE
glmY fadL hyfR aceE
rfe hycB caiF tdcE sgbE srlA ptsI lacZ chbF uvrB ybfE
ydhX
ydhU
ynfG
rplP hycC dppB acnA modE fhlA tdcD fecA malM creD
moeBmoaA ydhViscR
cydC hyaC mdh frdB nuoG uxaA ulaA hlyE hyfF umuD
dmsA caiEptsG chpA
rpmI rffD rffC scpA modA cysG modBoppC tdcA yiaK gltA exuR araF mhpA
uidB ddlB
rffM norR glcB atoE bglG rbsD malY yafQ
ydiU ydhW narL norW paaC fucO araD recA
uraA dcuR adhEpaaX araG mhpF creB
xdhC hemF prfA treC aldB nanC creA mraY
narK nikR arcA glcCh p t yjcH paaH melB
cheY narI sdhC fis paaA paaF aceB tdcRulaC crp nanR lsrC lsrG nagC cstA
sbmC uvrC
nikA narX frdD
yeaR hycD hyfD glpT cytR malE araE ivbL
ydeJ norV ulaE uvrD
yeaE grxD atoB
appY
glcD
nuoN
fixA sucB nagB rbsC dinI lexA dnaG
iscS ynfF hipB chpR
acrE actP rbsA chbR uvrA phr
rffG erpA
moaE ynfH
ygjG hipA hemA hypC cydB cdd
yiaO glpR paaE aceK gyrAmanZu d p lsrD nanM
grpE
rbsB dacC rpsU murD
pheM narG frdC deoD
hyfI yeiL dgsA malT
hyaFdctA nirC
nirB icd dusB mhpE mhpD
rffE nikC uspB nikB aceF aldA
yiaJ yhcH malG dnaA mhpB mhpR
ydhO fdhF ulaG fucP malI
fdnG hybF
ackA hyaB yoaG hyfA paaD acnB bglJxylR
manX
proP
ruvA
murE
ydbD nuoE hyaD asr lamB malK creC dinG ftsW cho
moaB hcp nuoL dcuB ulaRyiaL nanE agp chbB polB
ydhY t p x nuoJ hycA paaG sfsA
wzxE pheT mhpT rplC rpmCtdcB
ybdN
trg
aceA yjjQ hofB
ssb
glmS insK dinD
atoA appC nuoH argP nagE dksA
tadA ulaD nanK xseA leuO
hyfH paaB umuC uvrY
tsgA nikE ubiC
glpD focA
ubiA gatC ulaB pstC nfuA yiaM ecpD nanT glpX xylA murF ftsL
moaC narJ tam crr fucR lsrBmhpC chbC glmU
dppA gatZ deoB fucI fucU recX
upp pflB hybC appAprmC glpQ deoC idnR
rplT betT lldP hyfE pstB
rpsJ dcuA caiC gatD lsrR speCilvB ybfN ampC ruvB
cueR rplB moeA rpsQ
phoU nuoC hyfJ rhaT nrdA fruR malF manY
fucA yafN
rpsC trmA rpoD
dcuShyaA hyaE
aspA araJ t r x A tisB yafP
copA thrT deoR xylF gntT hofN araC
serT yccBhycH rplW modCpstA yebG
tapmoaDatoC appB
htrE exuTpaaK tsx lsrA
prpE
prpD malX araH
hybErimM rplV fadH araB dnaN dinQ dinJ
pitA puuA leuW xylH recF
fixXfadD envZ mtlA nanA hofM ubiG polA
hybD caiT gatY gapA ptsH lsrK ppdD recN
argK xdhA dcuC rpsI flxA murC
cueO
puuR
ychO rplS uspA
fadB aer gatA xylG murQ rhaS araA
prpR idnK
hybG fadI
ysgA queA pstSrplM idnD yhfA
hybO dsdX iclR tdcG gntK idnO
fadA mtlD lsrF
fucK
pepT glpF glpK gntX
mpl ychH ascF
xdhB betA dsdC ydeA cspI msrA gatB fbaA spf
scpC betI thrU ugpC
lldR ogt argU frdA
fadJ apaH pgk gntR idnT
apaG glgS prpB nupG hofP
ompX mazG cpdB ugpE rhaB ompA
argW mtlRzraSprpC gntU
rhaD
proM fadR yibD topA glpE dapB
murP leuL
fadE lldD cyaR ascB
hisR zraR psiE xylB
gltX thrW gyrB ugpQ ugpA
fruK uhpT
betB proK sgbU epd
puuD glyU
hofC ugpB leuC
leuP dsdA envC rhaA dsrA
pdxA eda ascG
dpiA ilvN
slyA rhaR
leuXargX
scpB yahA
fruA leuA yibQ
edd
tyrU glk leuB
fruB fbaB
pfkApykF
gpmM
leuD
tpiA eno pck phoB
glyT
uhpA
setA
citC sgrS
htpG
citG
citX kdgR
dpiB
phnE
ppc
citF citD fabA zraP
phoE
phnH
citE fabB phnO
phnJ phnM
phoA
pitB
phoH phnN
phoR
psiF phnI
phnD
phnG amn
phnL
fabR sgrR
phnF
phnC
phnK phnP
thiQ
thiP
tbpA
Figure: E. coli regulatory network
8DREAM network inference challenge
Network inference challenge:
DREAM5 results:
Method Network 1 Network 3 Network 4 Overall
AUPR AUROC AUPR AUROC AUPR AUROC
GENIE31 0.291 0.815 0.093 0.617 0.021 0.518 40.28
ANOVerence2 0.245 0.780 0.119 0.671 0.022 0.519 34.02
Naive TIGRESS 0.301 0.782 0.069 0.595 0.020 0.517 31.1
1
Huynh-Thu et al., 2010
2
Kueffner et al., 2012
9Purposes
Introduce TIGRESS: Trustful Inference of Gene REgulation using
Stability Selection.
Assess the impact of the parameters.
Test and benchmark TIGRESS on several datasets.
10Outline
1 Methods
Regression-based inference
TIGRESS
Material
2 Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3 Conclusion
11Regression-based inference: hypotheses
Notations
ntf transcription factors (TF), ntg target genes (TG), nexp experiments
Expression data: X (nexp × (ntf + ntg )).
Xg : expression levels of gene g.
XG : expression levels of genes in G.
Tg : candidate TFs for gene g.
Hypotheses
1 The expression level Xg of a TG g is a function of the expression
levels XTg of Tg :
Xg = fg (XTg ) + ε.
2 A score sg (t) can be derived from fg , for all t ∈ Tg to assess the
probability of the interaction (t, g).
12Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
TF 2
...
TF ntf
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1
TF 2
...
TF ntf
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 -
TF 2 0.97
... ...
TF ntf 0
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23
TF 2 0.97 -
... ... ...
TF ntf 0 0
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0
TF 2 0.97 - 0.03
... ... ... ...
TF ntf 0 0 0
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11
TF 2 0.97 - 0.03 ... 0
... ... ... ... ... ...
TFntf 0 0 0 ... 0.76
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11
TF 2 0.97 - 0.03 ... 0
... ... ... ... ... ...
TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11
TF 2 0.97 - 0.03 ... 0
... ... ... ... ... ...
TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Regression-based inference: main steps
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
TF 1 - 0.23 0 ... 0.11
TF 2 0.97 - 0.03 ... 0
... ... ... ... ... ...
TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
13Adding a linearity assumption
TIGRESS’ first hypothesis: regulations are linear
Xg = fg (XTg ) + ε = XTg ω g + ε
g
Consequence: if ωt = 0, no edge between g and t.
14Adding a sparsity assumption
TIGRESS’ second hypothesis: few TFs regulate each TG
g
∀g, ]{ωt 6= 0} L TFs selected.
15Stability Selection
Problem: LARS efficiency is limited:
bad response to correlation;
no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;
Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1, 000) times
with different training sets.
“Resample” the variables: also weight the variables
Xit ← Wt Xit (1)
where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more
randomized the variables.
Get a frequency of selection for each TF.
16Stability Selection
Problem: LARS efficiency is limited:
bad response to correlation;
no confidence score for each TF.
Solution: Stability Selection with randomized LARS (Bach, 2008 ;
Meinshausen and Bühlmann, 2009):
Resample the experiments: run LARS many (e.g. 1, 000) times
with different training sets.
“Resample” the variables: also weight the variables
Xit ← Wt Xit (1)
where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more
randomized the variables.
Get a frequency of selection for each TF.
16Stability Selection path
1
0.9
Frequency of selection of each TF
0.8
0.7
over the subsamplings
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
Number L of LARS steps
(example for one target gene)
17Scoring
1
0.9
Frequency of selection of each TF
0.8
0.7
over the subsamplings
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
Number L of LARS steps
18Scoring
Choose L, then:
Original scoring
Area scoring (contribution)
18Scoring
Let Ht be the rank of TF t. Then,
score = E[φ(Ht )]
with
Original: φ(h) = 1 if h ≤ L , 0 otherwise
Area: φ(h) = L + 1 − h if h ≤ L , 0 otherwise
=> Area scoring takes the value of the rank into account.
18TIGRESS Summary
Idea: consider as many problems as TGs (ntg subproblems)
subproblem g ⇔ find regulators TFs(g) of gene g
1 For each TG, score all ntf candidate interactions:
TG 1 TG 2 TG 3 ... TG ntg
LARS TF 1 - 0.23 0 ... 0.11
+ Stab. Selection TF 2 0.97 - 0.03 ... 0
=>
+ Choose L ... ... ... ... ... ...
+ Score TF ntf 0 0 0 ... 0.76
2 Rank the scores altogether:
TF 12 → TG 17 1
TF 23 → TG 5 0.99
TF 2 → TG 1 0.97
... ... ... ...
3 Threshold to a value or a given number N of edges.
19Parameters
TIGRESS needs four parameters to be set:
scoring method (original, area, ...);
number of runs R: large;
randomization level α: between 0 and 1;
number of LARS steps L: not obvious.
20Data
Network ] TF ] Genes ] Chips ] Edges
DREAM5 Net 1 (in-silico) 195 1643 805 4012
DREAM5 Net 3 (E. coli) 334 4511 805 2066
DREAM5 Net 4 (S. cerevisiae) 333 5950 536 3940
E. coli Net from Faith et al., 2007 180 1525 907 3812
DREAM4 Multifactorial Net 1 100 100 100 176
DREAM4 Multifactorial Net 2 100 100 100 249
DREAM4 Multifactorial Net 3 100 100 100 195
DREAM4 Multifactorial Net 4 100 100 100 211
DREAM4 Multifactorial Net 5 100 100 100 193
21Outline
1 Methods
Regression-based inference
TIGRESS
Material
2 Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3 Conclusion
22Impact of the parameters: results on in silico network
Area (1,000 runs) Original (1,000 runs)
20 20
15 15
L
10 10
5 5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
20
Area (4,000 runs)
20
Original (4,000 runs) Area less sensitive than
original to α and L.
15 15
Area systematically
L
10 10
5 5
outperforms original.
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
The more runs, the better
Area (10,000 runs) Original (10,000 runs)
20 20
Best values:
15 15
α = 0.4, L = 2, R = 10, 000.
L
10 10
5 5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
α α
40 60 80 100
Overall Score
231000
200
How
500
to choose L? 100
0 0
0 0 2 4 6 8 0 5 10 15
0 2 4 6 8
First 50,000
First 100,000 edges
edges First 100,000 edges
800 200 150
600 150
100
400 100
50
200 50
0 0 0
0 10 1020 3020 40 30 50 0 50 100 150
L=2: number of TFs/TG smaller L=20: greater number of TFs/TG,
and more variable. less sparsity.
=> L should depend on the expected network’s topology
24TIGRESS vs state-of-the-art
1 1
0.8 0.8
Precision
0.6 GENIE3 0.6
TPR
ANOVerence
0.4 CLR 0.4
ARACNE
0.2 Naive TIGRESS 0.2
TIGRESS
0 0
0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15
FPR Recall
TIGRESS is competitive with state-of-the-art.
25Results on E. coli network
1 1
0.8 0.8
Precision
0.6 0.6
TPR
GENIE3
0.4 0.4
CLR
0.2 ARACNE
0.2
TIGRESS
0 0
0 0.5 1 0 0.05 0.1 0.15
FPR Recall
Random Forests-based and DREAM winner GENIE3 overperforms all
methods.
26False discovery analysis on E. coli
Main false positive pattern found by TIGRESS:
Should find Does find
Good news: spuriously inferred edges close to true edges
Bad news: confusion (due to linear model?)
27Undirected case: DREAM4 challenge
DREAM4: undirected networks (TFs not known in advance)
A posteriori comparison of Default TIGRESS and GENIE3:
Method Network 1 Network 2 Network 3 Network 4 Network 5
AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC
GENIE3 0.154 0.745 0.155 0.733 0.231 0.775 0.208 0.791 0.197 0.798
TIGRESS 0.165 0.769 0.161 0.717 0.233 0.781 0.228 0.791 0.234 0.764
Overall scores:
GENIE3: 37.48
TIGRESS: 38.85
28Outline
1 Methods
Regression-based inference
TIGRESS
Material
2 Results
In silico network results
In vitro networks results
Undirected case: DREAM4
3 Conclusion
29Conclusion
Contributions:
Automatization and adaptation of Stability Selection to GRN
inference.
Area scoring setting: better results and less elasticity to parameters.
Insights on network’s behavior
TIGRESS is:
Linear
Competitive
Parallelizable
Available (http://cbio.ensmp.fr/tigress)
But outperformed in some cases by random forests: limits of
linearity?
Perspectives:
Adaptive/changeable value for L.
Group selection of TFs.
Use of time series/knock-out/replicates information.
30Molecular signatures for breast
cancer prognosis
ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature Selection
Methods on Accuracy, Stability and Interpretability of Molecular
Signatures, 2011, PLoS ONE
ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability of
Molecular Signatures, 2010, arXiv 1001.3109
31Motivation
Prediction of breast cancer outcome
Assist breast cancer prognosis based on gene expression.
Avoid adjuvant/preventive chemotherapy when not needed.
Gene expression signature
Data: primary site tumor expression arrays.
Among the genome, find the few (50-100) genes sufficient to
predict metastasis/relapse.
Main challenge: high-dimensional data (few samples, many
variables).
32Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Background
2002: Van’t Veer et al. publish 70-gene signature MammaPrint.
Since then: at least 47 published signatures (Venet et al., 2011).
Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007).
Many gene sets are equally predictive (Michiels et al., 2005;
Ein-Dor et al., 2005), even within the same dataset.
Prediction discordances (Reyal et al., 2008)
Stability at the functional level? Not sure.
Seeking stability
Accuracy is not enough to choose the right genes.
=> Stability as a confidence indicator.
33Contributions
1 Systematic comparison of feature selection methods in terms of:
predictive performance;
stability;
biological stability and interpretability.
2 Group selection of genes:
with predefined groups (Graph Lasso);
with latent groups (k-overlap norm).
3 Evaluation of Ensemble methods
34Evaluation
Accuracy: how well selected genes + classifier predict metastatic
events on test data.
Stability: how similar two lists of genes are.
Interpretability: how much biological sense selected genes make.
35Data
Four public breast cancer datasets from the same technology (Affymetrix
U133A):
GEO Reference ] genes ] samples ] positives
GSE1456 12,065 159 40
GSE2034 12,065 286 107
GSE2990 12,065 125 49
GSE4922 12,065 249 89
36Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
37Classical feature selection/feature ranking methods
Type Characteristics Examples
Univariate, fast T-test, KL Divergence
Filters Only depend on the data Wilcoxon rank-sum test
Do not use loss function
Learning machine as a criterion SVM RFE, Greedy FS
Wrappers
Computationally expensive
Learning + selecting Lasso, Elastic Net
Embedded
Possible use of prior knowledge
Each of these algorithms returns a ranked list of genes to be
thresholded.
38First results
100 genes over four databases.
0.16
Random
Ttest
0.14
Entropy
0.12 Bhattacharryya
Wilcoxon
0.1 SVM RFE
GFS
Stability
0.08 Lasso
E−Net
0.06
0.04
0.02
0
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
Accuracy
39First conclusions
0.16
Random
0.14
Ttest
Entropy
Random better than tossing a
coin.
0.12 Bhattacharryya
Wilcoxon
0.1 SVM RFE
GFS
Stability
0.08 Lasso
E−Net
Elastic Net neither more stable
0.06
0.04
nor more accurate than Lasso.
0.02
Accuracy/Stability trade-off
0
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
Accuracy
T-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40First conclusions
0.16
Random
0.14
Ttest
Entropy
Random better than tossing a
coin.
0.12 Bhattacharryya
Wilcoxon
0.1 SVM RFE
GFS
Stability
0.08 Lasso
E−Net
Elastic Net neither more stable
0.06
0.04
nor more accurate than Lasso.
0.02
Accuracy/Stability trade-off
0
0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
Accuracy
T-test: both simplest and best.
Next step:
Can we have a better stability without decreasing accuracy?
40Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
41Ensemble Methods
Run each algorithm R times on subsamples.
Get R ranked lists of genes (r b )b=1...R .
Aggregate and get a score for each gene:
R
1X b
S(g) = f (rg ).
R
b=1
average : f (r ) = (p − r )/p
exponential : f (r ) = exp(−αr )
stability selection : f (r ) = δ(r ≤ k )
Sort S by decreasing order and threshold to get final signature
42Results
100 genes over four databases.
Random
0.2
Ttest
0.18 Entropy
Bhattacharryya
0.16
Wilcoxon
0.14 SVM RFE
GFS
Stability
0.12
Lasso
0.1 E−Net
Single−run
0.08
Stab. sel
0.06
0.04
0.02
0
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
Accuracy
43Functional stability
100 genes over four databases.
Random
0.3 Ttest
Entropy
0.25 Bhattacharryya
Wilcoxon
Functional stability
0.2 SVM RFE
GFS
Lasso
0.15
E−Net
Single−run
0.1
Stab. sel
0.05
0
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
Accuracy
44Ensemble methods: conclusions
Random Random
0.2
Ttest 0.3 Ttest
0.18 Entropy Entropy
Bhattacharryya Bhattacharryya
0.25
0.16
Wilcoxon
Wilcoxon
Functional stability
0.14 SVM RFE
0.2 SVM RFE
GFS
GFS
Stability
0.12
Lasso
Lasso
0.15
0.1 E−Net
E−Net
Single−run
0.08 Single−run
Stab. sel 0.1
Stab. sel
0.06
0.05
0.04
0.02 0
0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy
Accuracy
Expected improvement in stability not happening.
Slight improvement in accuracy in some cases.
Loss in functional stability.
T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?
45Ensemble methods: conclusions
Random Random
0.2
Ttest 0.3 Ttest
0.18 Entropy Entropy
Bhattacharryya Bhattacharryya
0.25
0.16
Wilcoxon
Wilcoxon
Functional stability
0.14 SVM RFE
0.2 SVM RFE
GFS
GFS
Stability
0.12
Lasso
Lasso
0.15
0.1 E−Net
E−Net
Single−run
0.08 Single−run
Stab. sel 0.1
Stab. sel
0.06
0.05
0.04
0.02 0
0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65
0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy
Accuracy
Expected improvement in stability not happening.
Slight improvement in accuracy in some cases.
Loss in functional stability.
T-test: still the preferred method.
Next step:
Can we do better by incorporating prior knowledge?
45Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
46Data
Expression data: Van’t Veer et al.,
2002; Wang et al., 2005.
PPI network with 8141 genes (Chuang
et al., 2007)
Assumption: genes close on the graph behave similarly
Idea: instead of selecting single genes, select edges
47Selecting groups of genes: `1 methods
Lasso : selects single genes (Tibshirani, 1996)
Why groups?
selecting similar genes: improving stability and interpretability
smoothing out noise by "averaging": improving accuracy
Group Lasso (Yuan & Lin, 2006): implies group sparsity for groups
of covariates that form a partition of {1...p}
Overlapping group Lasso (Jacob et al., 2009): selects a union of
potentially overlapping groups of covariates (e.g. gene pathways).
Graph Lasso: uses groups induced by the graph (e.g. edges)
48Is group sparsity enough?
`1 methods work well, but face serious stability issues when
groups are correlated.
Solution: randomization through stability selection.
49Accuracy results
Test on data from Wang et al., 2005.
Signature learnt on Van’t Veer dataset
0.8
Signature learnt on Wang dataset
Balanced Accuracy
0.6
0.4
0.2
0
Lasso Lasso + stab. sel. Graph Lasso Graph Lasso + stab. sel.
Neither prior knowledge nor stability selection bring any
improvement!
50Stability results
7
Number of genes in the overlap
6 Lasso
Graph Lasso with stability selection
5 Lasso with stability selection
Graph Lasso
4
3
2
1
0
0 20 40 60 80 100 120
Number of genes in the signatures
Graph Lasso slightly improves stability.
51Interpretability
Signature obtained using Lasso:
52Interpretability
Signature obtained using Graph Lasso + Stability Selection:
52Graph Lasso conclusion
Graphical prior seems to increase stability and interpretability.
However: no change in accuracy.
Next step:
Grouping increases stability. Now on to accuracy!
53Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
54Latent grouping
Grouping genes makes sense.
Let the data tell which genes to select together.
The k-support norm
Introduced by Argyriou et al., 2012
A trade-off between `1 and `2 .
Equivalent to overlapping group Lasso (Jacob et al., 2009) with all
possible groups of size k .
Results in selecting groups that are not predefined.
55Extreme randomization
Following Breiman’s random forests
Sample both the examples and the covariates.
Less variables = less correlation.
Give each gene a chance to be selected.
Extreme randomization
For each of the R runs:
Bootstrap samples (classical Ensemble method)
Sample the covariates: randomly choose 10% of them.
Run FS procedure on the restricted data.
=> Compute frequency of selection: P(g selected |g preselected)
56Accuracy or stability?
0.05
T−test
Random
0.045
Ttest
0.04
Lasso
0.035 ENet
kSupport (k=2)
0.03
kSupport (k=10)
Stability
0.025 kSupport (k=20)
Single−run
0.02
Extreme Rand. + SS
0.015
0.01
0.005
0
0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655
Accuracy
57Stability or redundancy?
0.05
Random
0.045
Ttest
Lasso
0.04
ENet
0.035 kSupport (k=2)
kSupport (k=10)
Stability
0.03
kSupport (k=20)
0.025 Single−run
Extreme Rand. + SS
0.02
0.015
0.01
0.005
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Correlation
58Conclusions
0.05 0.05
T−test Random
Random
0.045 0.045
Ttest
Ttest
0.04 Lasso
Lasso 0.04
ENet
0.035 ENet
0.035 kSupport (k=2)
kSupport (k=2)
0.03
kSupport (k=10)
kSupport (k=10)
Stability
0.03
Stability
0.025 kSupport (k=20)
kSupport (k=20)
0.025 Single−run
Single−run
0.02
Extreme Rand. + SS
Extreme Rand. + SS 0.02
0.015
0.015
0.01
0.005 0.01
0 0.005
0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Accuracy Correlation
Extreme Randomization improves accuracy.
Grouping improves stability.
But: both effects do not add up that well (redundancy)
T-test still the best trade-off?
59Outline
1 A simple start
2 An attempt at enforcing stability: Ensemble methods
3 Using prior knowledge: Graph Lasso
4 Acknowledging latent team work
5 Conclusion
60Signature selection: conclusion
Contributions:
Step-by-step study of FS methods behavior on several breast
cancer datasets.
Systematic analysis of accuracy, stability, interpretability.
Insights on the accuracy/stability trade-off.
What have we learned?
Best methods: simple t-test or complex black box
Grouping improves gene and functional stability.
Randomization improves accuracy (sometimes) but has unwanted
effects on stability.
Accuracy/Stability Trade-off: Stability => redundancy => lower
accuracy.
61Signature selection: perspectives
One unique signature?
single breast cancer subtype
many samples
larger signature
Is expression data sufficient?
probably not all information is there
clinical data: same accuracy (same information?)
possibly look at genotype, methylation, clinical and expression
Is stability important?
not as important as accuracy + prediction concordance
possibly not even achievable
62Conclusion
63Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64Conclusion
Gene expression data:
High-dimensional, noisy
Possibly contains important information
Feature selection:
Find the needle in the haystack.
Output relevant genes to be studied further.
Main issues:
Results are not necessarily transferable across datasets.
Models rely on hypotheses!
Fixing:
Testing on many databases.
Keeping model hypotheses in mind / not being afraid of black boxes.
64Acknowledgements
Fantine Paola Pierre Laurent
Mordelet Vera-Licona Gestraud Jacob
65The k-support norm
It can be shown that:
k −r −1 d
!2 12
1
Ωsp (|ω|↓i )2 + |ω|↓i
X X
k (ω) =
r +1
i=1 i=k −r
where r is the only integer in {0, . . . , k − 1} statisfying
d
1 X
|ω|↓k −r −1 > |ω|↓i ≥ |ω|↓k −r .
r +1
i=k −r
and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞).
66Link with overlapping Group Lasso
The k-support norm is equivalent to the overlapping group Lasso
norm
X
Ωsp
X
k (ω) = min ||v ||
I 2 : supp(vI ) ⊆ I, vI = ω
v ∈Rp×Gk
I∈Gk I∈Gk
where Gk denotes all subsets of {1, . . . , d} of cardinality k .
Remark 1: it selects at least k variables.
Remark 2: the first selected group consists of the k variables most
correlated with the response
67ADMM - applied to k-support problem
Our problem
minω,β R̂l (ω) + λ2 Ωsp 2
k (β)
s.t. ω − β = 0
Augmented Lagrangian
Lρ (ω, β, µ) = R̂l (ω) + λ2 Ωsp 2 0 ρ
k (β) + µ (ω − β) + 2 ||ω − β||
2
Algorithm
1 Initialize: β (1) , ω (1) , µ(1)
2 for t = 1, 2, ..., do
n o
w (t+1) = arg minw R̂l (w) + µ(t)T w + ρ2 ||w − β (t) ||2
(t)
β (t+1) = prox λ Ωsp (.)2 w (t+1) + µρ
2ρ k
µ(t+1) = µ(t) + ρ(w (t+1) − β (t+1) )
68ADMM - optimality conditions
Three first order conditions:
Primal condition: ω ∗ − β ∗ = 0
Dual condition 1: ∇R̂l (ω ∗ ) + µ∗ = 0
Dual condition 2: 0 ∈ λ2 ∂Ωsp ∗ 2
k (β ) + µ
∗
Resulting in a definition for the residuals at step t + 1:
Primal residuals: r (t+1) = ω (t+1) − β (t+1)
Dual residuals: s(t+1) = ρ(β (t+1) − β (t) )
As the algorithm converges, the (norm of the) residuals tend to zero.
69ADMM - choice of parameter
Parameter ρ is critical in ADMM: it controls how much variables change.
It can be seen as a step size.
How to choose it?
Adaptive ADMM
One solution
is to let(t)it adapt to the problem:
(1 + τ )ρ if ||r (t+1) ||2 > η||s(t+1) ||2 and t ≤ tmax
ρ(t+1) = ρ(t) /(1 + τ ) if η||r (t+1) ||2 < ||s(t+1) ||2 and t ≤ tmax
(t)
ρ otherwise
In practice, we use τ = 1, η = 10 and tmax = 100.
=> Adaptive ADMM forces the primal and dual residuals to be of a
similar amplitude.
70Comparison
k=1 k=5
5 5
0 0
RDG
RDG
−5 −5
−10 −10
−15 −15
0 10 20 0 10 20
Time (Seconds) Time (Seconds)
ADMM − adaptive
k = 10 ADMM − rho=1 k = 100
5 ADMM − rho=10 5
ADMM − rho=100
0 FISTA
0
RDG
RDG
−5 −5
−10 −10
−15 −15
0 10 20 0 20 40 60
Time (Seconds) Time (Seconds)
71Accuracy vs size of the signature
Single−run Ensemble−Mean
0.65 0.65
0.6
AUC
0.6
AUC
0.55 0.55
0.5 0.5
0.45 0.45
0 20 40 60 80 100 0 20 40 60 80 100
Ensemble−Exponential Ensemble−Stability Selection
0.65 0.65
0.6
AUC
0.6
AUC
0.55 0.55
0.5 0.5
0.45 0.45
0 20 40 60 80 100 0 20 40 60 80 100
Random T−test Entropy Bhatt. Wilcoxon SVM RFE GFS Lasso E−Net
72Stability vs size of the signature
0.8
Single-run 0.8
Ensemble-average
Stability
0.6
Stability
0.6
0.4 0.4
0.2 0.2
0 0
0 20 40 60 80 100 0 20 40 60 80 100
0.8
Ensemble-exponential Ensemble-stability selection
0.8
Stability
Stability
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E−Net
73You can also read