Ciencia de datos: una revisión del estado del arte
Resumen
Con un crecimiento explosivo en datos no estructurados y estructurados, las organizaciones buscan formas de innovar a través del análisis y de la ciencia de datos; la disponibilidad de Big Data permite a las organizaciones de todas las industrias aprovechar el análisis de datos. Por tanto, el objetivo de este artículo es realizar una revisión del estado del arte referente a la ciencia de datos. Se realizó un estudio inicial para determinar los temas y términos más representativos en el campo de la ciencia de datos y se utilizaron los métodos de investigación analítico-sintético e histórico-lógico para examinar los elementos fundamentales y característicos de la ciencia de datos y los científicos de datos; y para determinar los diferentes procesos, soluciones, herramientas y la evolución de estas en el transcurso del tiempo. Las principales conclusiones arribadas se encuentran: la amplia aplicación de la ciencia de datos, trae como consigo que existan muchas soluciones diferentes, estrechamente relacionados con el área de aplicación y las características del problema; propiciado por Big Data en la mayoría de las ocasiones se utiliza el aprendizaje automático para resolver los problemas; las técnicas más utilizados son los siguientes: regresión lineal, k-Nearest Neighbors (k-NN), k-means, regresión logística, redes bayesianas, máquina de soporte vectorial y redes neuronales.
Palabras clave: Ciencia de datos; Científico de datos; Aprendizaje automático
Descargas
Referencias
Adeniyi, D.A., Wei, Z., Yongquan, Y., 2016. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Appl. Comput. Inform. 12, 90–108. https://doi.org/10.1016/j.aci.2014.10.001
Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M., 2018. How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science, in: A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer, Cham, pp. 287–306. https://doi.org/10.1007/978-3-319-61893-7_17
Anzola, N.S., 2016. Máquinas de soporte vectorial y redes neuronales artificiales en la predicción del movimiento USD/COP spot intradiario. ODEON 0, 113–172.
Ayankoya, K., Calitz, A., Greyling, J., 2014. Intrinsic Relations Between Data Science, Big Data, Business Analytics and Datafication, in: Proceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by Technology, SAICSIT ’14. ACM, New York, NY, USA, p. 192:192–192:198. https://doi.org/10.1145/2664591.2664619
Baxevanis, A.D., Ouellette, B.F.F., 2004. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons.
Blei, D.M., Smyth, P., 2017. Science and data science. Proc. Natl. Acad. Sci. 114, 8689–8692. https://doi.org/10.1073/pnas.1702076114
Botía, J.A., Vandrovcova, J., Forabosco, P., Guelfi, S., D’Sa, K., Hardy, J., Lewis, C.M., Ryten, M., Weale, M.E., 2017. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst. Biol. 11, 47. https://doi.org/10.1186/s12918-017-0420-6
Breiman, L., others, 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231.
Cao, L., 2017. Data Science: A Comprehensive Overview. ACM Comput Surv 50, 43:1–43:42. https://doi.org/10.1145/3076253
Chen, J., Xiao, T., Sheng, J., Teredesai, A., 2017. Gender prediction on a real life blog data set using LSI and KNN, in: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC). pp. 1–6. https://doi.org/10.1109/CCWC.2017.7868410
Chojnacki, A., Dai, C., Farahi, A., Shi, G., Webb, J., Zhang, D.T., Abernethy, J., Schwartz, E., 2017. A Data Science Approach to Understanding Residential Water Contamination in Flint, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 1407–1416. https://doi.org/10.1145/3097983.3098078
Cleveland, W.S., 2001. Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics. Int. Stat. Rev. 69, 21–26. https://doi.org/10.1111/j.1751-5823.2001.tb00477.x
Dalkir, K., Beaulieu, M., 2017. Knowledge Management in Theory and Practice. MIT Press.
Dhar, V., 2013. Data Science and Prediction. Commun ACM 56, 64–73. https://doi.org/10.1145/2500499
Duzhin, F., Gustafsson, A., 2018. Machine Learning-Based App for Self-Evaluation of Teacher-Specific Instructional Style and Tools. Educ. Sci. 8. https://doi.org/10.3390/educsci8010007
Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S., 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118.
Fayyad, U.M., Simoudis, E., Srivastava, A., 2017. Foreword to the Applied Data Science: Invited Talks Track at KDD-2017, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 7–8. https://doi.org/10.1145/3097983.3121426
Fowler, B., Rajendiran, M., Schroeder, T., Bergh, N., Flower, A., Kang, H., 2017. Predicting patient revisits at the University of Virginia Health System Emergency Department, in: 2017 Systems and Information Engineering Design Symposium (SIEDS). pp. 253–258. https://doi.org/10.1109/SIEDS.2017.7937726
Gendelman, R., Xing, H., Mirzoeva, O.K., Sarde, P., Curtis, C., Feiler, H.S., McDonagh, P., Gray, J.W., Khalil, I., Korn, W.M., 2017. Bayesian Network Inference Modeling Identifies TRIB1 as a Novel Regulator of Cell-Cycle Progression and Survival in Cancer Cells. Cancer Res. 77, 1575–1585. https://doi.org/10.1158/0008-5472.CAN-16-0512
Giama, E., Papadopoulos, A.M., 2018. Carbon footprint analysis as a tool for energy and environmental management in small and medium-sized enterprises. Int. J. Sustain. Energy 37, 21–29. https://doi.org/10.1080/14786451.2016.1263198
Gould, R., Wild, C.J., Baglin, J., McNamara, A., Ridgway, J., McConway, K., 2018. Revolutions in Teaching and Learning Statistics: A Collection of Reflections, in: International Handbook of Research in Statistics Education. Springer, Cham, pp. 457–472. https://doi.org/10.1007/978-3-319-66195-7_15
Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A., 2014. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80. https://doi.org/10.1016/j.ijpe.2014.04.018
Hey, T., 2012. The Fourth Paradigm – Data-Intensive Scientific Discovery, in: E-Science and Information Management. Springer, Berlin, Heidelberg, pp. 1–1. https://doi.org/10.1007/978-3-642-33299-9_1
Huang, D., Luo, L., 2016. Consumer Preference Elicitation of Complex Products Using Fuzzy Support Vector Machine Active Learning. Mark. Sci. 35, 445–464. https://doi.org/10.1287/mksc.2015.0946
Huda, S.N., 2017. Cluster Analysis of Indonesian Province Based on Household Primary Cooking Fuel Using K-Means. IOP Conf. Ser. Mater. Sci. Eng. 185, 012016. https://doi.org/10.1088/1757-899X/185/1/012016
Hunter, M.C., Pozhitkov, A.E., Noble, P.A., 2017. Accurate predictions of postmortem interval using linear regression analyses of gene meter expression data. Forensic Sci. Int. 275, 90–101. https://doi.org/10.1016/j.forsciint.2017.02.027
Kamper, H., Livescu, K., Goldwater, S., 2017. An embedded segmental K-means model for unsupervised segmentation and clustering of speech. ArXiv170308135 Cs.
Karpatne, A., Kumar, V., 2017. Big Data in Climate: Opportunities and Challenges for Machine Learning, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 21–22. https://doi.org/10.1145/3097983.3105810
Kempler, S., Mathews, T., 2017. Earth Science Data Analytics: Definitions, Techniques and Skills. Data Sci. J. 16. https://doi.org/10.5334/dsj-2017-006
Kim, Y.-K., Na, K.-S., 2018. Application of machine learning classification for structural brain MRI in mood disorders: Critical review from a clinical perspective. Prog. Neuropsychopharmacol. Biol. Psychiatry 80, 71–80. https://doi.org/10.1016/j.pnpbp.2017.06.024
Knudson, S., Sarkar, S., Ray, A., 2016. Connecting Data Science and Qualitative Interview Insights through Sentiment Analysis to Assess Migrants’ Emotion States Post-Settlement. ArXiv160908776 Cs.
Kormos, M., Collura, M., Takács, G., Calabrese, P., 2017. Real-time confinement following a quantum quench to a non-integrable model. Nat. Phys. 13, 246. https://doi.org/10.1038/nphys3934
Lee, S.-I., Celik, S., Logsdon, B.A., Lundberg, S.M., Martins, T.J., Oehler, V.G., Estey, E.H., Miller, C.P., Chien, S., Dai, J., Saxena, A., Blau, C.A., Becker, P.S., 2018. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42. https://doi.org/10.1038/s41467-017-02465-5
Maillo, J., Ramírez, S., Triguero, I., Herrera, F., 2017. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Vol. Var. Veloc. Data Sci. 117, 3–15. https://doi.org/10.1016/j.knosys.2016.06.012
McNally, R.J., Mair, P., Mugno, B.L., Riemann, B.C., 2017. Co-morbid obsessive–compulsive disorder and depression: a Bayesian network approach. Psychol. Med. 47, 1204–1214. https://doi.org/10.1017/S0033291716003287
Molina-Solana, M., Ros, M., Ruiz, M.D., Gómez-Romero, J., Martin-Bautista, M.J., 2017. Data science for building energy management: A review. Renew. Sustain. Energy Rev. 70, 598–609. https://doi.org/10.1016/j.rser.2016.11.132
Mueller, J.P., Massaron, L., 2016. Machine learning for dummies, For dummies. John Wiley & Sons, Inc, Hoboken, NJ.
Newman, R., Chang, V., Walters, R.J., Wills, G.B., 2016. Model and experimental development for Business Data Science. Int. J. Inf. Manag. 36, 607–617. https://doi.org/10.1016/j.ijinfomgt.2016.04.004
Norbert, D., Andreas, G., Armin, K., Manuel, M., Andrea, H., 2017. Solutions for Cyber-Physical Systems Ubiquity. IGI Global.
Provost, F., Fawcett, T., 2013. Data Science for Business: What you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc.
Raith, S., Vogel, E.P., Anees, N., Keul, C., Güth, J.-F., Edelhoff, D., Fischer, H., 2017. Artificial Neural Networks as a powerful numerical tool to classify specific features of a tooth based on 3D scan data. Comput. Biol. Med. 80, 65–76. https://doi.org/10.1016/j.compbiomed.2016.11.013
Rupp, G.M., Opitz, A.K., Nenning, A., Limbeck, A., Fleig, J., 2017. Real-time impedance monitoring of oxygen reduction during surface modification of thin film cathodes. Nat. Mater. 16, 640. https://doi.org/10.1038/nmat4879
Sbarufatti, C., Corbetta, M., Giglio, M., Cadini, F., 2017. Adaptive prognosis of lithium-ion batteries based on the combination of particle filters and radial basis function neural networks. J. Power Sources 344, 128–140. https://doi.org/10.1016/j.jpowsour.2017.01.105
Schoenherr, T., Speier-Pero, C., 2015. Data Science, Predictive Analytics, and Big Data in Supply Chain Management: Current State and Future Potential. J. Bus. Logist. 36, 120–132. https://doi.org/10.1111/jbl.12082
Schuff, D., 2018. Data Science for All: A University-Wide Course in Data Literacy, in: Analytics and Data Science. Springer, Cham, pp. 281–297. https://doi.org/10.1007/978-3-319-58097-5_20
Schutt, R., O’Neil, C., 2013. Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc.
Shiba, K., Kaburagi, T., Kurihara, Y., 2017. Classification of State Transition by Using a Microwave Doppler Sensor for Wandering Detection 11, 3245.
Toyinbo, P.A., Vanderploeg, R.D., Belanger, H.G., Spehar, A.M., Lapcevic, W.A., Scott, S.G., 2017. A Systems Science Approach to Understanding Polytrauma and Blast-Related Injury: Bayesian Network Model of Data From a Survey of the Florida National Guard. Am. J. Epidemiol. 185, 135–146. https://doi.org/10.1093/aje/kww074
Urbano, J., Nogueira, P., Rocha, A.P., Cardoso, H.L., 2017. Analysis of Data Science Tools for Sensor-Based Assessment of Quality of Life in Health Care, in: Recent Advances in Information Systems and Technologies. Springer, Cham, pp. 446–455. https://doi.org/10.1007/978-3-319-56535-4_45
Van der Aalst, W.M., 2016. Process mining: data science in action. Springer.
Wang, Y., Priestley, J., 2017. Binary Classification on Past Due of Service Accounts using Logistic Regression and Decision Tree. Grey Lit. PhD Candidates.
Yogatama, D., Dyer, C., Ling, W., Blunsom, P., 2017. Generative and Discriminative Text Classification with Recurrent Neural Networks. ArXiv170301898 Cs Stat.
Yu, C., Wang, N., Yang, L.T., Yao, D., Hsu, C.-H., Jin, H., 2017. A semi-supervised social relationships inferred model based on mobile phone data. Future Gener. Comput. Syst. 76, 458–467. https://doi.org/10.1016/j.future.2016.11.027
Yu, L., Zhang, Y., Jian, G., Gutman, I., 2017. Classification for Microarray Data Based on K-Means Clustering Combined with Modified Single-to-Noise-Ratio Based on Graph Energy. J. Comput. Theor. Nanosci. 14, 598–606. https://doi.org/10.1166/jctn.2017.6248
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X., 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Trans Database Syst 42, 25:1–25:41. https://doi.org/10.1145/3134428
Zhou, Y., Su, W., Ding, L., Luo, H., Love, P.E.D., 2017. Predicting Safety Risks in Deep Foundation Pits in Subway Infrastructure Projects: Support Vector Machine Approach. J. Comput. Civ. Eng. 31, 04017052. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000700
Descargas
Publicado
Número
Sección
Licencia
Los artículos de UCE Ciencia se publican bajo la licencia Creative Commons Atribución/Reconocimiento-No Comercial 4.0 Internacional (CC BY-NC 4.0). Esta licencia requiere que los reutilizadores den crédito al creador. Permite a los reutilizadores copiar y distribuir el material en cualquier medio o formato, así como remezclar, adaptar y construir a partir del material, solo para fines no comerciales.