Large, unwieldy, beautiful data sets and predicting the future

Posted by mouthyb | Posted in | Posted on 6:21 PM


It used to be something of a truism in sociology that predicting human behavior is beyond the ability of theory and theorists; early network/complexity-oriented sociologists like Durkheim, Parsons and Merton saw their work roundly criticized for its emphasis on reaction to modernization, complexity and/or dynamicism in relation to observed social phenomena, and/or insistence that human behavior can be characterized by systems and roles, instead of by motivations.

The computer and life of the internet are making it abundantly clear that these thinkers were on to something: there are discernable patterns in human behavior which are highly repetitious, highly structured and/or role-driven, and repetitious across large spreads of the population. One need only look at data mining and trends spotted in huge, unwieldy, social data sets to see that despite the overarching insistence that we are independent agents, capable of free will, it is possible to say with some certainty what we'll be likely to do, tomorrow. There are companies which specialize in this.

Some of the previous limitations on predicting human behavior have included the insistence that we are unique creatures (heritage of Romanticism and religion) and are not as deeply affected by our environment as to be so dictated by it, a limit to how much data it is possible or desirable to collect and a limit to how much can be interpreted.

For this reason, I've been shocked a few times to realize that the data sets used for some studies are incredibly small. Saturation-- e.g. the point at which you stop getting unique entries-- can be very low, itself evidence of environmental influences and role-driven decision processes, but it's used to justify small sets and discarding data because it is repetitive. Huge data sets are viewed as something which you should avoid whenever possible.

I don't understand why; computers are excellent at performing repetitive tasks, with as little error as you introduce in the process of programming the parameters. With such large sets, the tendency toward unrepresentative data stemming from sampling process problems would be considerably reduced. Aside from the cost of a program which can do that kind of scanning and the time spent formulating parameters, there is no reason beside the conventional objection to experiments under non-controlled (e.g lab) conditions, that it is not ultimately preferable to use massive, user provided data sets.

If you accept the assumption, which can be proven, that habits and systems govern behavior, and that those habits are repetitive, it is not a far leap to believe that these habits will have a tendency to continue to repeat.

While prediction methods are obviously not going to be perfect, those assumptions suggest a significant probability that there will be repetitive features to human behavior, features which can be derived from previous behaviors. It is possible to know what what someone is likely to do, based on what they've already done.

Given access to those huge, beautiful data sets, a program and the correct parameters, I expect to be able to understand and predict at P (80< x<90) specific trends in human behavior.

I can't wait to get my hands on that data.

Comments (1)

Hеllo i am kаvin, іts my first
tіme to соmmenting anуωherе, when i гeaԁ thiѕ piece of ωriting i thought
i cοuld аlsο create сommеnt due to this brіlliant pieсe of wrіting.

mу blog post; same day loans

Post a Comment