Ask HN: How big of a problem is unstructured data for companies?

16 points by dark7 a year ago

I read somewhere that 90% of companies have data like documents, PDFs, videos, images, audio clips, and other content that are unstructured that will be a big obstacle for ai.

Are there companies already in this space?

Trying to see if there's something here before I possibly create a MVP.

danjl a year ago

The trick is finding a problem that increases revenue or decreases costs after providing structure to the data. Sure, it would be great to bring structure to assets, but you can't just provide search or labeling. You have to figure out how providing the structure actually brings value to those companies. You'd hope they would do that for you, but you need to figure it all out, at least for one set of customers who will pay you, before you build the MVP. The details of how it benefits the company has a profound effect on the design of the MVP, including how to access the assets and how to expose the structure in the UX.

pizza a year ago

The other implication of that is, if the company knew that there was real value in structuring those documents, they would be going out of their way to do it already. So you as a 3rd-party kind of have to figure out some opportunity that the company, a 1st-party, doesn't already know about itself already. Or, for one reason or another, they're not prioritizing sufficiently, and why.
sam29681749 a year ago

+ there may be a cost in maintaining the organisation that needs to be justified too.
dark7 a year ago

Yeah good point. Would have to provide some value other than search. They can probably already do that with ChatGPT to an extent with pdfs and what not.

edmundsauto a year ago

I work for big tech. Our problem is not the unstructured nature of the data; it is the volume of noise to signal. Basic ranking and information retrieval is implemented; we have LLM/RAG systems that can be queried. However, it’s hard to evaluate what is good and up to date information - 98% of the documents people kick out are not useful.

dark7 a year ago

Interesting. So for example you’re saying you make a query and the information in the document is old and out of date?
- yourapostasy a year ago
  
  > ...and the information in the document is old and out of date?
  Not just old and out of date. Plain wrong, like referring to internal systems, processes or intranet URL's that are (often poorly done) duplications of enterprise standard services, processes or pages, and should be deprecated instead of incorporated into training data.
  Or the unstructured data sits within structured data. Some development teams stick giant strings with their own internal, proprietary formatting into a database blob column because...I dunno, they thought it was expedient or whatever for Very Good [Undocumented] Reasons. Invariably, these fields hold all sorts of nastiness you either want to tease out, or do not want to exist at all because they expose what you wish they wouldn't in cleartext. And even though your company pays the vendors of these monstrosities (oh yes, my PTSD suppressed memories of multiple vendors committing these same terrible, no good, very bad software sins) three commas a year of "support", they adamantly refuse to share their precious proprietary format with you, much less give you the courtesy to update you when they change the byzantine format that gives the three body problem a run for the money for a deterministic solution.
  If you make a Generative AI sort sense out of that, then you can practically segment the market by "how much before the customer starts to put away the blank check" and you'll make bank. Everyone I know of in this Data Loss Prevention (DLP) or adjacent spaces are deer in the headlights when the conversation turns towards these operational realities and what we might do with their solutions to put a dent in these multiverses of pain. No one has roadmaps talking of iterative pipelines, search outcome behavior analysis and automated training thereof, or really much beyond various degrees of pattern matching, whether that be regex's or LLM's.
  - dark7 a year ago
    
    Very interesting. Definitely got me thinking. What's the workflow like when these issues arise? At what point in that workflow would a possible generative ai be of help?
    
    edmundsauto a year ago
    
    What ends up happening is people go to each other to find their answers. No idea if a genAI workflow would be helpful at all, since the information is hidden in 1-1 chats and people's brains.
    Where I see genAI fitting in - providing an incentive for teams to keep their docs up to date. Similar to how SEO was a driver for webmasters to clean up their info architecture. If I know that people are going to be chatting w/ an AI instead of bugging my team, I could justify more investment in documentation.
    
    yourapostasy a year ago
    
    > What's the workflow like when these issues arise?
    The workflow is by hand, though not quite all the way down, it damn near is.
    > At what point in that workflow would a possible generative ai be of help?
    Wherever the non-deterministic output of Generative AI still delivers benefits that exceed the transactional value of getting it wrong some X% of the time. Highly data and context dependent. That's a way of saying the "It Depends" cop-out, but I have a hunch we have to solve each industry segment individually, if not in finer gradations.

evanjrowley a year ago

It's an even bigger obstacle for data management, particularly classification and loss prevention. Comparatively, it's less of an obstacle for AI and most likely that will be a game changer for addressing those other issues.

datadrivenangel a year ago

For most companies, the unstructured 'data' is barely information, let alone data, let alone valuable.

Most companies have internal training videos, recordings of meetings, PDFs of policies, and basically all of these are worthless within a year from a business perspective. Some things are useful for longer, or for reasons of historical interest, but the half life is short. The real thing that contains value is what those could potentially represent, like decisions or events that would benefit from action. If a meeting has a decision that some executive would want to know about, maybe a summary of the transcript could be useful?

Turning the random memorandum generated as part of business into valuable insights without process re-design is a pipe dream in most scenarios. Not all though.

constantinum a year ago

Unstract is trying to solve this problem by fully leveraging the LLM stack. It is open-source https://github.com/Zipstack/unstract

aworks a year ago

Ben Thompson suggests Palantir as a company to leverage deep enterprise data with AI.

https://stratechery.com/2024/enterprise-philosophy-and-the-f...

muzani a year ago

I would think that the main purpose of multimodal AI like GPT-4o is to be able to process data from videos, images, audio, etc.

theGnuMe a year ago

There are companies but it is a wide open space. There was a legal AI startup sold last year for a billion or so...