J. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic et al., Unsupervised learning from narrated instruction videos, CVPR, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01171193

J. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-julien, Joint discovery of object states and manipulation actions, ICCV, vol.2, p.5, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01676084

F. Bach and Z. Harchaoui, DIFFRAC: A discriminative and flexible framework for clustering, NIPS, vol.2, p.7, 2007.

P. Bojanowski and A. Joulin, Unsupervised learning by predicting noise, ICML, 2017.

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly supervised action labeling in videos under ordering constraints, ECCV, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01053967

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-supervised alignment of video with text, ICCV, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01154523

M. Caron, P. Bojanowski, A. Joulin, and M. Douze, Deep Clustering for Unsupervised Learning of Visual Features, ICCV, 2018.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, CVPR, 2004.

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari et al., Scaling egocentric vision: The EPIC-KITCHENS dataset, In ECCV, issue.2, 2018.

D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. Mayol-cuevas, You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video, BMVA, 2014.

K. Fang, T. Wu, D. Yang, S. Savarese, and J. J. Lim, Demo2vec: Reasoning object affordances from online videos, CVPR, 2018.

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, Describing objects by their attributes, CVPR, vol.2, p.3, 2009.

V. Ferrari and A. Zisserman, Learning visual attributes, NIPS, 2007.

D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik, From lifestyle vlogs to everyday interactions, CVPR, 2018.

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney et al., Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, ICCV, 2013.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen et al., Cnn architectures for large-scale audio classification, ICASSP, 2017.

D. Huang, L. Fei-fei, and J. C. Niebles, Connectionist temporal modeling for weakly supervised action labeling, ECCV, vol.1, 2016.

D. Huang, J. J. Lim, L. Fei-fei, and J. C. Niebles, Unsupervised visual-linguistic reference resolution in instructional videos, CVPR, 2017.

D. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri et al., Finding "it": Weakly-supervised reference-aware visual grounding in instructional video, CVPR, 2018.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

H. Kuehne, A. Richard, and J. Gall, Weakly supervised learning of actions from transcripts, CVIU, vol.1, 2017.

J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, CVPR, 2011.

J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich et al., What's cookin'? Interpreting cooking videos using text, speech and vision, NAACL, vol.2, p.5, 2015.

I. Misra, A. Gupta, and M. Hebert, From Red Wine to Red Tomato: Composition with Context, CVPR, 2017.

A. Richard, H. Kuehne, and J. Gall, Weakly supervised action learning with rnn based fine-to-coarse modeling, CVPR, 2017.

A. Richard, H. Kuehne, and J. Gall, Action sets: Weakly supervised action segmentation without ordering constraints, CVPR, 2018.

F. Sener and A. Yao, Unsupervised learning and segmentation of complex activities from video, CVPR, 2018.

O. Sener, A. Zamir, S. Savarese, and A. Saxena, Unsupervised semantic parsing of video collections, ICCV, vol.1, p.5, 2015.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, vol.1, 2014.

H. Wang and C. Schmid, Action recognition with improved trajectories, ICCV, vol.1, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, Maximum margin clustering, NIPS, 2004.

B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas et al., Human action recognition by learning bases of action attributes and parts, ICCV, 2011.

M. Yatskar, V. Ordonez, L. Zettlemoyer, and A. Farhadi, Commonly uncommon: Semantic sparsity in situation recognition, Proceedings of the CVPR, vol.2017

L. Zhou, X. Chenliang, and J. J. Corso, Towards automatic learning of procedures from web instructional videos, AAAI, vol.2, p.5, 2018.

, Outputs of the classifier are shown in blue. Correctly localized steps are shown in green. False detections are shown in red

J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev et al., Learning from narrated instruction videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01580630

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, vol.26, pp.3111-3119, 2013.

A. J. , P. Bojanowski, E. Grave, and T. Mikolov, Enriching word vectors with subword information. arXiv, 2017.

L. Van-der-maaten and G. Hinton, Visualizing data using t-sne, Journal of machine learning research, 2008.