Ciencia de datos: una revisión del estado del arte

Naivy Pujol Méndez, Joelsy Porven Rubier


Con un crecimiento explosivo en datos no estructurados y estructurados, las organizaciones buscan formas de innovar a través del análisis y de la ciencia de datos; la disponibilidad de Big Data permite a las organizaciones de todas las industrias aprovechar el análisis de datos. Por tanto, el objetivo de este artículo es realizar una revisión del estado del arte referente a la ciencia de datos. Se realizó un estudio inicial para determinar los temas y términos más representativos en el campo de la ciencia de datos y se utilizaron los métodos de investigación analítico-sintético e histórico-lógico para examinar los elementos fundamentales y característicos de la ciencia de datos y los científicos de datos; y para determinar los diferentes procesos, soluciones, herramientas y la evolución de estas en el transcurso del tiempo. Las principales conclusiones arribadas se encuentran: la amplia aplicación de la ciencia de datos, trae como consigo que existan muchas soluciones diferentes, estrechamente relacionados con el área de aplicación y las características del problema; propiciado por Big Data en la mayoría de las ocasiones se utiliza el aprendizaje automático para resolver los problemas; las técnicas más utilizados son los siguientes: regresión lineal, k-Nearest Neighbors (k-NN), k-means, regresión logística, redes bayesianas, máquina de soporte vectorial y redes neuronales.

Palabras clave: Ciencia de datos; Científico de datos; Aprendizaje automático

Texto completo:



Adeniyi, D.A., Wei, Z., Yongquan, Y., 2016. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Appl. Comput. Inform. 12, 90–108.

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M., 2018. How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science, in: A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer, Cham, pp. 287–306.

Anzola, N.S., 2016. Máquinas de soporte vectorial y redes neuronales artificiales en la predicción del movimiento USD/COP spot intradiario. ODEON 0, 113–172.

Ayankoya, K., Calitz, A., Greyling, J., 2014. Intrinsic Relations Between Data Science, Big Data, Business Analytics and Datafication, in: Proceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by Technology, SAICSIT ’14. ACM, New York, NY, USA, p. 192:192–192:198.

Baxevanis, A.D., Ouellette, B.F.F., 2004. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons.

Blei, D.M., Smyth, P., 2017. Science and data science. Proc. Natl. Acad. Sci. 114, 8689–8692.

Botía, J.A., Vandrovcova, J., Forabosco, P., Guelfi, S., D’Sa, K., Hardy, J., Lewis, C.M., Ryten, M., Weale, M.E., 2017. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst. Biol. 11, 47.

Breiman, L., others, 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231.

Cao, L., 2017. Data Science: A Comprehensive Overview. ACM Comput Surv 50, 43:1–43:42.

Chen, J., Xiao, T., Sheng, J., Teredesai, A., 2017. Gender prediction on a real life blog data set using LSI and KNN, in: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC). pp. 1–6.

Chojnacki, A., Dai, C., Farahi, A., Shi, G., Webb, J., Zhang, D.T., Abernethy, J., Schwartz, E., 2017. A Data Science Approach to Understanding Residential Water Contamination in Flint, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 1407–1416.

Cleveland, W.S., 2001. Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics. Int. Stat. Rev. 69, 21–26.

Dalkir, K., Beaulieu, M., 2017. Knowledge Management in Theory and Practice. MIT Press.

Dhar, V., 2013. Data Science and Prediction. Commun ACM 56, 64–73.

Duzhin, F., Gustafsson, A., 2018. Machine Learning-Based App for Self-Evaluation of Teacher-Specific Instructional Style and Tools. Educ. Sci. 8.

Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S., 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118.

Fayyad, U.M., Simoudis, E., Srivastava, A., 2017. Foreword to the Applied Data Science: Invited Talks Track at KDD-2017, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 7–8.

Fowler, B., Rajendiran, M., Schroeder, T., Bergh, N., Flower, A., Kang, H., 2017. Predicting patient revisits at the University of Virginia Health System Emergency Department, in: 2017 Systems and Information Engineering Design Symposium (SIEDS). pp. 253–258.

Gendelman, R., Xing, H., Mirzoeva, O.K., Sarde, P., Curtis, C., Feiler, H.S., McDonagh, P., Gray, J.W., Khalil, I., Korn, W.M., 2017. Bayesian Network Inference Modeling Identifies TRIB1 as a Novel Regulator of Cell-Cycle Progression and Survival in Cancer Cells. Cancer Res. 77, 1575–1585.

Giama, E., Papadopoulos, A.M., 2018. Carbon footprint analysis as a tool for energy and environmental management in small and medium-sized enterprises. Int. J. Sustain. Energy 37, 21–29.

Gould, R., Wild, C.J., Baglin, J., McNamara, A., Ridgway, J., McConway, K., 2018. Revolutions in Teaching and Learning Statistics: A Collection of Reflections, in: International Handbook of Research in Statistics Education. Springer, Cham, pp. 457–472.

Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A., 2014. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80.

Hey, T., 2012. The Fourth Paradigm – Data-Intensive Scientific Discovery, in: E-Science and Information Management. Springer, Berlin, Heidelberg, pp. 1–1.

Huang, D., Luo, L., 2016. Consumer Preference Elicitation of Complex Products Using Fuzzy Support Vector Machine Active Learning. Mark. Sci. 35, 445–464.

Huda, S.N., 2017. Cluster Analysis of Indonesian Province Based on Household Primary Cooking Fuel Using K-Means. IOP Conf. Ser. Mater. Sci. Eng. 185, 012016.

Hunter, M.C., Pozhitkov, A.E., Noble, P.A., 2017. Accurate predictions of postmortem interval using linear regression analyses of gene meter expression data. Forensic Sci. Int. 275, 90–101.

Kamper, H., Livescu, K., Goldwater, S., 2017. An embedded segmental K-means model for unsupervised segmentation and clustering of speech. ArXiv170308135 Cs.

Karpatne, A., Kumar, V., 2017. Big Data in Climate: Opportunities and Challenges for Machine Learning, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 21–22.

Kempler, S., Mathews, T., 2017. Earth Science Data Analytics: Definitions, Techniques and Skills. Data Sci. J. 16.

Kim, Y.-K., Na, K.-S., 2018. Application of machine learning classification for structural brain MRI in mood disorders: Critical review from a clinical perspective. Prog. Neuropsychopharmacol. Biol. Psychiatry 80, 71–80.

Knudson, S., Sarkar, S., Ray, A., 2016. Connecting Data Science and Qualitative Interview Insights through Sentiment Analysis to Assess Migrants’ Emotion States Post-Settlement. ArXiv160908776 Cs.

Kormos, M., Collura, M., Takács, G., Calabrese, P., 2017. Real-time confinement following a quantum quench to a non-integrable model. Nat. Phys. 13, 246.

Lee, S.-I., Celik, S., Logsdon, B.A., Lundberg, S.M., Martins, T.J., Oehler, V.G., Estey, E.H., Miller, C.P., Chien, S., Dai, J., Saxena, A., Blau, C.A., Becker, P.S., 2018. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42.

Maillo, J., Ramírez, S., Triguero, I., Herrera, F., 2017. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Vol. Var. Veloc. Data Sci. 117, 3–15.

McNally, R.J., Mair, P., Mugno, B.L., Riemann, B.C., 2017. Co-morbid obsessive–compulsive disorder and depression: a Bayesian network approach. Psychol. Med. 47, 1204–1214.

Molina-Solana, M., Ros, M., Ruiz, M.D., Gómez-Romero, J., Martin-Bautista, M.J., 2017. Data science for building energy management: A review. Renew. Sustain. Energy Rev. 70, 598–609.

Mueller, J.P., Massaron, L., 2016. Machine learning for dummies, For dummies. John Wiley & Sons, Inc, Hoboken, NJ.

Newman, R., Chang, V., Walters, R.J., Wills, G.B., 2016. Model and experimental development for Business Data Science. Int. J. Inf. Manag. 36, 607–617.

Norbert, D., Andreas, G., Armin, K., Manuel, M., Andrea, H., 2017. Solutions for Cyber-Physical Systems Ubiquity. IGI Global.

Provost, F., Fawcett, T., 2013. Data Science for Business: What you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc.

Raith, S., Vogel, E.P., Anees, N., Keul, C., Güth, J.-F., Edelhoff, D., Fischer, H., 2017. Artificial Neural Networks as a powerful numerical tool to classify specific features of a tooth based on 3D scan data. Comput. Biol. Med. 80, 65–76.

Rupp, G.M., Opitz, A.K., Nenning, A., Limbeck, A., Fleig, J., 2017. Real-time impedance monitoring of oxygen reduction during surface modification of thin film cathodes. Nat. Mater. 16, 640.

Sbarufatti, C., Corbetta, M., Giglio, M., Cadini, F., 2017. Adaptive prognosis of lithium-ion batteries based on the combination of particle filters and radial basis function neural networks. J. Power Sources 344, 128–140.

Schoenherr, T., Speier-Pero, C., 2015. Data Science, Predictive Analytics, and Big Data in Supply Chain Management: Current State and Future Potential. J. Bus. Logist. 36, 120–132.

Schuff, D., 2018. Data Science for All: A University-Wide Course in Data Literacy, in: Analytics and Data Science. Springer, Cham, pp. 281–297.

Schutt, R., O’Neil, C., 2013. Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc.

Shiba, K., Kaburagi, T., Kurihara, Y., 2017. Classification of State Transition by Using a Microwave Doppler Sensor for Wandering Detection 11, 3245.

Toyinbo, P.A., Vanderploeg, R.D., Belanger, H.G., Spehar, A.M., Lapcevic, W.A., Scott, S.G., 2017. A Systems Science Approach to Understanding Polytrauma and Blast-Related Injury: Bayesian Network Model of Data From a Survey of the Florida National Guard. Am. J. Epidemiol. 185, 135–146.

Urbano, J., Nogueira, P., Rocha, A.P., Cardoso, H.L., 2017. Analysis of Data Science Tools for Sensor-Based Assessment of Quality of Life in Health Care, in: Recent Advances in Information Systems and Technologies. Springer, Cham, pp. 446–455.

Van der Aalst, W.M., 2016. Process mining: data science in action. Springer.

Wang, Y., Priestley, J., 2017. Binary Classification on Past Due of Service Accounts using Logistic Regression and Decision Tree. Grey Lit. PhD Candidates.

Yogatama, D., Dyer, C., Ling, W., Blunsom, P., 2017. Generative and Discriminative Text Classification with Recurrent Neural Networks. ArXiv170301898 Cs Stat.

Yu, C., Wang, N., Yang, L.T., Yao, D., Hsu, C.-H., Jin, H., 2017. A semi-supervised social relationships inferred model based on mobile phone data. Future Gener. Comput. Syst. 76, 458–467.

Yu, L., Zhang, Y., Jian, G., Gutman, I., 2017. Classification for Microarray Data Based on K-Means Clustering Combined with Modified Single-to-Noise-Ratio Based on Graph Energy. J. Comput. Theor. Nanosci. 14, 598–606.

Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X., 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Trans Database Syst 42, 25:1–25:41.

Zhou, Y., Su, W., Ding, L., Luo, H., Love, P.E.D., 2017. Predicting Safety Risks in Deep Foundation Pits in Subway Infrastructure Projects: Support Vector Machine Approach. J. Comput. Civ. Eng. 31, 04017052.

Enlaces refback

  • No hay ningún enlace refback.

UCE Ciencia. Revista de postgrado - ISSN 2306-3556 - Vicerrectoría de Estudios de Graduados - Universidad Central del Este (UCE) - Dirección: Av. Francisco Alberto Caamaño Deñó, San Pedro de Macorís, República Dominicana - Teléfono: +1 809-529-3562 Fax: +1 809-529-5146 - Correo-e: - Sitio web: desarrollado como adaptación de Open Journal Systems - Esta revista se publica bajo una licencia que permite usar sus contenidos y generar obras derivadas, siempre y cuando esos usos no tengan fines comerciales y se reconozcan los derechos de sus autores.